Screenshot of a web interface showing an ASL interpreter on the left, and an article on the right segmented by sentence. One particular sentence is highlighted, and the signer is frozen.

ASL STEM Wiki

Dataset and Benchmark for Interpreting STEM Articles

To help advance the state of sign language modeling, we created ASL STEM Wiki — the first continuous signing dataset focused on Science, Technology, Engineering, and Math (STEM). The corpus contains 254 Wikipedia articles on STEM topics in English, interpreted into 300 hours of American Sign Language (ASL). In addition to its size and topic, unlike many prior datasets, it contains videos of professional signers, including many CDIs (Certified Deaf Interpreters), and was collected with consent from each contributor under IRB approval. Deaf research team members were involved throughout.

This dataset is released alongside our paper identifying several use cases for ASL STEM Wiki and providing baselines for one of these tasks — fingerspelling detection and identification. Because the dataset focuses on STEM, and STEM terminology often lacks standardized signs, fingerspelling of technical terms appears frequently in our dataset. To help identify fingerspellings, we provide models for fingerspelling detection and alignment, and release benchmark performance on the ASL STEM Wiki dataset for the research community to build on. Our models highlight the difficulty of the detection and alignment task, and provide the first evidence that self-supervised contrastive pretraining can improve fingerspelling detection.

Our dataset empowers a small bilingual resource for students, providing full English texts for STEM articles alongside professional ASL interpretations. This resource enables students and other readers to access spot-translations for select sentences, and to play through entire articles as desired. We release this resource as well.

This project was conducted at Microsoft Research with collaborators.

  • Microsoft: Danielle Bragg (PI), Hal Daumé III, Alex Lu, Vanessa Milan, Fyodor Minakov, Chinmay Singh, Cyril Zhang
  • University of California, Berkeley: Kayo Yin

Dataset License: Please see the supporting tab. If you are interested in commercial use, please contact ASL_Citizen@microsoft.com

Dataset Download:

To download via web interface, please visit: Coming soon!

To download via command line, please execute: Coming soon!

Bilingual STEM article resource: Wiki – The ASL Data Community (opens in new tab).

Open-source Repo: Coming soon!

Citation: If you use this dataset in your work, please cite our paper (link coming soon!)

Citation: Coming soon!

Acknowledgements: We are deeply grateful to all community members who participated in this dataset project.