Research Forum | Episode 3 - abstract chalkboard background with colorful hands

Research Forum Brief | June 2024

Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Share this page

Daniela Massiceti

“Today’s AI models hold incredible potential for assisting the blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more.”

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Transcript: Lightning Talk

Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Daniela Massiceti delves into the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explores the current distance from realizing this potential and the advancements needed to bridge this gap.

Microsoft Research Forum, June 4, 2024

DANIELA MASSICETI: Hi there. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research Cambridge. Today, I will be sharing our recent CVPR paper, which examines the challenges and opportunities of large multi-modal models for blind and low-vision users.

Today’s AI models hold incredible potential for assisting the blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more. And I think this is hinted at by the recent partnership between OpenAI and Be My Eyes, with the promise that one day, human assistance could be replaced by AI agents that provide instantaneous assistance to blind users around the world. But despite their potential, no works have really looked at, well, how well do these models actually work on image and text data captured by blind users? And we know from the literature that this data is likely to be out of distribution or different in a number of ways. For example, blind users use a range of quite specialized assistive objects. They also are more likely to capture images with quality variation, things like camera blur and occlusion. And they’re also more likely to make use of non-visual vocabulary, for example, describing their objects by their physical rather than their visual properties.

Our work, therefore, set out to remedy this. Specifically, we systematically evaluated 25 variants of the CLIP model on data from blind and low-vision users. CLIP is one of today’s most widely used multi-modal models. It has over 15,000 citations and 75 million downloads. We used the ORBIT and the VizWiz-Classification datasets. Both of these are collected by blind users through real-world assistive applications. And we inspected CLIP’s performance on both a zero-shot image classification task directly as well as through examining the performance of models that use CLIP as a component, which is very widely done in the community. I unfortunately don’t have time to go into all the details of our work, but I will share our top three findings with you. First, we confirmed that CLIP does indeed underperform on data that is captured by blind and low-vision users. Second, these disparities trickle down to models that use CLIP as a component. And then third, these disparities stem from the fact that disability content is significantly underrepresented and sometimes missing completely from the datasets that are used to pretrain these large models. And I’ll dive into our three findings in a bit more detail.

So for the first finding, we found that CLIP underperforms on objects, image quality, and language that is typically used by blind users. On object type, CLIP recognizes disability objects like a Braille keyboard, for example, up to 28 percentage points less accurately than common objects like a TV remote. On image quality, CLIP is up to 23 percentage points more sensitive to images with things like camera blur and lighting compared to images that don’t have these quality issues. And on language, CLIP recognizes objects that are described by their material—so, for example, a leather boot—up to 12 percentage points less accurately than objects described by their color—for example, a brown boot. And we know that blind users rely heavily on this tactile rather than visual language.

Towards our second finding, we examined three models that use CLIP under the hood—an object detection model, an image segmentation model, and an image generation model—and found that all three struggle with disability content. For example, DALL-E 2, which relies on a CLIP vision encoder, cannot generate common disability objects like guide canes and Braille keyboards. Instead, as you can see here, it gives us very strange-looking walking canes and lots and lots of randomly placed white dots. In comparison, DALL-E 2 generated really high-quality and realistic images for almost all of the non-disability objects that we tested.

And then towards our third and final finding, we really wanted to understand where these performance disparities were stemming from. And so we quantified just how prevalent disability content is in three popular datasets that are commonly used to pretrain these large models: LAION-[400]Million, LAION-2 Billion, and the DataComp-1B dataset, or 1 billion dataset. Specifically, we counted how many times objects are mentioned in these datasets’ captions and found that disability objects appear up to 16 to 17 times less frequently than non-disability objects across all three of the datasets.

So as you can see, our work has identified a clear gap in current models’ capabilities for blind users, and this could have very real consequences if these models are then integrated into assistive technologies for the blind and low-vision community. So what should we, as a research community, be doing about it? First, I think more work is needed to understand how models come to learn or adapt to long-tailed data. Some of our early results show that few-shot learning approaches hold some promise, but they don’t always work, especially in more challenging scenarios, for example, when objects appear in highly cluttered scenarios. And second, I think it’s important for us to really focus on including more disability content in these large-scale pretraining datasets. And our team [is] currently working on developing equitable and fair practices alongside disabled communities to source data that is truly representative of their needs. And so with that, I will wrap up.

Thank you to all the people behind this work and thank you for listening.