Screenshot of a web interface showing an ASL interpreter on the left, and an article on the right segmented by sentence. One particular sentence is highlighted, and the signer is frozen.

ASL STEM Wiki

Dataset and Benchmark for Interpreting STEM Articles

To help advance the state of sign language modeling, we created ASL STEM Wiki — the first continuous signing dataset focused on Science, Technology, Engineering, and Math (STEM). The corpus contains 254 Wikipedia articles on STEM topics in English, interpreted into 300 hours of American Sign Language (ASL). In addition to its size and topic, unlike many prior datasets, it contains videos of professional signers, including many CDIs (Certified Deaf Interpreters), and was collected with consent from each contributor under IRB approval. Deaf research team members were involved throughout.

This dataset is released alongside our paper identifying several use cases for ASL STEM Wiki and providing baselines for one of these tasks — fingerspelling detection and identification. Because the dataset focuses on STEM, and STEM terminology often lacks standardized signs, fingerspelling of technical terms appears frequently in our dataset. To help identify fingerspellings, we provide models for fingerspelling detection and alignment, and release benchmark performance on the ASL STEM Wiki dataset for the research community to build on. Our models highlight the difficulty of the detection and alignment task, and provide the first evidence that self-supervised contrastive pretraining can improve fingerspelling detection.

Our dataset empowers a small bilingual resource for students, providing full English texts for STEM articles alongside professional ASL interpretations. This resource enables students and other readers to access spot-translations for select sentences, and to play through entire articles as desired. We release this resource as well.

This project was conducted at Microsoft Research with collaborators.

Microsoft: Danielle Bragg (PI), Hal Daumé III, Alex Lu, Vanessa Milan, Fyodor Minakov, Chinmay Singh, Cyril Zhang
University of California, Berkeley: Kayo Yin

Dataset License: Please see the supporting tab. If you are interested in commercial use, please contact ASL_Citizen@microsoft.com.

Dataset Download:

To download via web interface, please visit: Download ASL STEM Wiki from Official Microsoft Download Center

To download via command line, please execute: wget https://download.microsoft.com/download/4/c/f/4cfec788-7478-4e47-9a15-ace9b6a96198/ASL_STEM_Wiki.zip

Bilingual STEM article resource: Wiki – The ASL Data Community (opens in new tab).

Open-source Repo: Coming soon!

Citation: If you use this dataset in your work, please cite our paper (opens in new tab).

@inproceedings{yin-etal-2024-asl,
    title = "{ASL} {STEM} {W}iki: Dataset and Benchmark for Interpreting {STEM} Articles",
    author = "Yin, Kayo  and
      Singh, Chinmay  and
      Minakov, Fyodor O  and
      Milan, Vanessa  and
      Daum{\'e} III, Hal  and
      Zhang, Cyril  and
      Lu, Alex Xijie  and
      Bragg, Danielle",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = Nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.801",
    pages = "14474--14490",
    abstract = "Deaf and hard-of-hearing (DHH) students face significant barriers in accessing science, technology, engineering, and mathematics (STEM) education, notably due to the scarcity of STEM resources in signed languages. To help address this, we introduce ASL STEM Wiki: a parallel corpus of 254 Wikipedia articles on STEM topics in English, interpreted into over 300 hours of American Sign Language (ASL). ASL STEM Wiki is the first continuous signing dataset focused on STEM, facilitating the development of AI resources for STEM education in ASL.We identify several use cases of ASL STEM Wiki with human-centered applications. For example, because this dataset highlights the frequent use of fingerspelling for technical concepts, which inhibits DHH students{'} ability to learn,we develop models to identify fingerspelled words{---}which can later be used to query for appropriate ASL signs to suggest to interpreters.",
}

Acknowledgements: We are deeply grateful to all community members who participated in this dataset project.