Abstracts: September 30, 2024

Published

By , Managing Editor, Research Publishing , Senior Researcher , Principal Research Software Development Engineer

Outline illustration of Daniela Massiceti next to Martin Grayson

Members of the research community at Microsoft work continuously to advance their respective fields.  Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Researcher Daniela Massiceti (opens in new tab) and Principal Research Software Development Engineer Martin Grayson (opens in new tab) join host Amber Tingle to discuss the research project and AI-powered tool Find My Things. Find My Things is a personalizable object recognizer that people who are blind or have low vision can train to find personal items from just a few videos of those objects. It was recently recognized as a 2024 Innovation by Design Awards (opens in new tab) finalist in the accessibility design (opens in new tab) and AI categories (opens in new tab) by the US-based business media brand Fast Company and, earlier this year, became available as a feature in the Seeing AI app (opens in new tab)

The Find My Things story is an example of research at Microsoft enhancing Microsoft products and services. To try the Find My Things tool, download the free, publicly available Seeing AI app (opens in new tab).

Transcript

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft offer a quick snapshot—or a podcast abstract—of their new and noteworthy papers and achievements. 

[MUSIC FADES] 

Our guests today are Daniela Massiceti and Martin Grayson. Daniela is a senior researcher at Microsoft, and Martin is a software development engineer with the company. They are members of a team creating technology that can be personalized to meet individual needs. Their research project called Find My Things enables people who are blind or have low vision to train an AI system to find their personal items based on a few examples of the objects. Find My Things has now shipped as a new feature within Seeing AI, which is a free app that narrates a person’s surroundings, including nearby people, text, and objects. The team was also recently recognized by the US-based business media brand Fast Company as an Innovation by Design Awards finalist in both the accessible design and artificial intelligence categories. Daniela and Martin, congratulations and thank you so much for joining us today for Abstracts

MARTIN GRAYSON: Pleasure, thank you. 

DANIELA MASSICETI: Thanks very much, Amber. Nice to be here. 

TINGLE: So, Daniela, let’s start with a Find My Things overview. What is it, how does it work, and who’s it for? 

MASSICETI: I think the best way I can describe Find My Things is a personalizable object recognizer. So when we think about object recognizers in the past, they’ve, kind of, been what I would call generic object recognizers. So they can only really recognize generic things like maybe chairs, desks, tables. But for the blind and low-vision community, who are really key users of object recognition apps and technologies, for them, that’s not quite enough. They need to be able to recognize all of their personal objects and items. So things like their sunglasses, their partner’s sunglasses, um, perhaps their house keys. So a range of these really specific personal objects that generic object recognizers cannot recognize and help them find. And so Find My Things aims to tackle that by being a personalizable object recognizer. A user can essentially teach this object recognizer what their personal items look like, and then the personalized feature can then help them locate those objects at any point in the future. The experience is divided into two phases: a teaching phase and a finding phase. So in a teaching phase, a user would capture four short videos of each of their personal objects, and those videos are then ingested into the app[1], and the machine learning model that sits underneath that app learns what those objects actually look like. And then in the second, finding phase a user at any point in the future can, kind of, say, hey, I want to find my partner’s sunglasses or my sunglasses. And that will initiate this 3D localization experience, which will help guide them with sound and touch cues to that specific object, wherever it is in the room around them. 

TINGLE: I’ve heard Find My Things described as a teachable AI system. Daniela alluded to this, but, Martin, break it down a bit more for us. What do you and your collaborators mean when you use the term teachable AI? 

GRAYSON: Something you can say about every person is that we’re all unique. Unique in the things that we like, whether that’s music, movies, food; the things we do, whether it’s at home, at work, or in your hobbies; and of course, the things that we have and own and keep with us. The same applies to accessibility. Everyone has their own unique sets of skills and tools that help them get things done, and we have them set up in just the way that matches us. The other day, I, like, came into the office and I sat in my chair, and I realized immediately that it wasn’t right. And of course, somebody had borrowed my desk the previous day and changed the height of my chair, but it was no problem because I could just re-personalize the chair back to my liking. When it comes to tools for accessibility, we think that people should have the same ability to personalize those tools to work the very best way for them. Typically, these have been settings like text size, speech, and color display, but AI has become a more and more important component in those tools. And one way we’re really excited about how to enable that is through teachable AI. So I think for us, teachable AI means that we can take some already really smart AI technology that might have some great general skills, but with a tiny amount of time from a person, that AI can be taught what matters to them and what works for them and become an even better AI to help them get things done. 

TINGLE: Describe the origins of this work for our listeners, Daniela. What influenced or inspired the Find My Things pursuit? And how does your work build on or differ from previous work in the accessible technology space? 

MASSICETI: Yeah, great question. And this is going to require me to cast my mind back to around four years ago. Our team at Microsoft Research was developing a system called the PeopleLens. So this was a head-mounted camera device that could be worn by people who are blind or low vision, specifically children who are blind or low vision. And it would help them identify or it would describe to them all the people that are around them in their social scenario—where those people were, were those people looking at them. And I think the team realized very quickly that, as Martin was saying there, each person has a really unique need or a unique view of what they actually want described to them about the social environment around them. And so that got us thinking, well, actually, being able to personalize this system is really important. But in complex social environments, personalization is a really hard problem. And so that prompted the team to think, OK, well, we want to study this idea of personalization; let’s try and find almost the simplest possible example of an AI technology with which we could actually deeply explore this space of personalization. And that led us to object recognizers. Object recognizers are, as I mentioned, a very commonly used technology across the blind and low-vision community, and we know that there is a need for personalization there. And so that really prompted or started this journey along personalizable, or teachable, object recognizers, which we then have been working on for the last three or four years to eventually get us to a point now where we’re seeing this feature available in Seeing AI. 

TINGLE: Your team identified few-shot learning and the availability of new datasets as keys to this work. Martin, how have those particular advances helped to make Find My Things possible? And are there other approaches you’ve incorporated to make sure that it’s both practical and valuable for people who are blind or have low vision? 

GRAYSON: So AI loves data. In fact, data is essential to make AI work. But for AI to work for everyone, it needs to learn from data that somehow represents everyone. The earliest challenge for Find My Things was that people who are blind or low vision don’t often use their cameras to take lots of photos and videos. And this actually gives us two big data gaps. The first is that we don’t have lots of image data that is representative of their own lives, their environments, and their things. And the second is that if you’re someone who’s blind, you may hold your phone differently, or you may use your camera in different ways. And that’s, too, missing from the data, certainly in the established datasets that exist. So new datasets, like ORBIT, have collected thousands of images and videos by members of the blind and low-vision community, specifically of objects and environments that are really important to them. And this means that we’ve now addressed those two big data gaps. And the few-shot part is really important, too. Find My Things is not a general object recognizer. It’s a find my things. We want Find My Things to be able to recognize anything you throw at it—whether it’s your fluffy keyring, your colorful tote bag, or your favorite gadget or toy. However, traditional object detectors, they often need hundreds or thousands of images to learn how to recognize something accurately. Few-shot learning is a super-smart approach that means you only need to trouble our users for a couple of short five-second videos, and then our app will take it from there. Find My Things can use that tiny amount of data and still be able to spot your object from across the room. 

Maybe one more thing we did, and this also became so important, was to build and try prototype experiences as soon as we possibly could. And we would try so many models and designs out and then iterate. The team has definitely seen so many videos of me trying to find things around my house. But it’s actually one of the things we’re most proud of in the project, is this, kind of, graveyard of interactive prototypes that have all led us to the final experience. 

TINGLE: Daniela, what have you learned from the Find My Things journey that may help the broader research community create more inclusive and more human-centric AI experiences? 

MASSICETI: The first one I would say is the importance of doing participatory research. And what that means is really working with the communities that you are developing technologies for. And the second is really learning how to balance this tension between developing something in a research environment and actually deploying that technology in a real-world environment. To jump to the first learning around participatory research, Martin mentioned the ORBIT dataset. The ORBIT dataset was collected in partnership with users who are blind or low vision across both the UK and Canada over the years 2020 to 2021. And it was really important for us to actually engage directly with users who are blind as we were collecting that dataset from them to really understand what they wanted from a personalizable object recognition technology, how they would use their cameras, how they would hold their phones, what kinds of objects they would use this technology to find. And all of that was really, really critical in helping us shape what that dataset ended up being. That dataset became such a pivotal part of the ultimate Find My Things experience. To the second point around this tension between building something in research and deploying something in the real world, I think often as a researcher, we don’t really have to engage with real-world constraints. But of course, when you build a machine learning model or a machine learning system and you want to deploy it in the real world, suddenly those constraints really hit you in the face. And that was exactly the case with Find My Things. I remember quite distinctly in the model development process, we had a number of different models. They were, sort of, ranging up in size in terms of how much memory they would take on a phone to run. And of course, the larger the model was, the more accurate it was. But when we deployed these models of varying sizes onto a phone, we saw that they each had vastly different reactions to being on this phone. And I think if I recall from memory, some of our largest models ended up basically draining the phone’s battery in a couple of minutes, which would mean that the experience would be totally unusable to the user. And so one of the key things we had to do there is really find this sweet spot, or this balance, between what is good enough performance that does not end up, kind of, degrading the actual experience of running this model on a phone. 

TINGLE: You mentioned participatory research, and your team’s version feels a little different from what we typically encounter. Talk a little bit more about the citizens who helped you build out this app. 

MASSICETI: So these were a group of perhaps eight to 10 users who are blind or low vision who we hosted at Microsoft Research a number of times over the course of the development of the Find My Things experience. And they were … perhaps the best way I can describe them is they were co-designers; they were really helping us design—co-design—what the Find My Things experience ultimately turned out to be. We weren’t coming to them as simply testers of our system. We, kind of, went to them with a blank slate and asked them, well, we have these ideas of what we want to build; what do you think? And from there, we, kind of, iterated upwards and ultimately crafted, co-crafted, the ultimate design of the Find My Things experience, both the teaching part and the finding part. 

TINGLE: One of the members of that citizen design team, Karolina Pakėnaitė, visited the Microsoft Research Podcast back in December with your colleague Cecily Morrison. Martin, talk a bit more about how influential citizen designers like Karolina are to this effort. 

GRAYSON: There were so many key ideas and innovations that came from the workshops with Karolina and the rest of the citizen design team. Maybe I can share a couple that have had the biggest impact on the app and the experience. So the first was the process of teaching an object. Our testing of AI models showed that collecting videos of objects from different sides and on different backgrounds was critical. So we developed this thing called the drawback technique, where we leaned on the phone’s augmented reality capabilities to make it possible. We’d ask the user to start with the phone right next to their object and then slowly draw it away. This meant that we could track all of the different distances the images were, and the user could really comfortably create a video without leaving their seat. And what’s more, you can do this so easily without even needing to look at the camera. It’s really natural. The second big design innovation came later on when you were actually looking for the thing. We called it the last yard. So many of the lost-item scenarios that we learned about from the citizen designers … they shared with us that they had dropped something in a public space. Their wallet fell out of their pocket as they took their phone out, or they knocked their earbud off the table onto the floor of the train on their way to work. And in both of those moments, the last thing anyone wants to be doing is feeling around on the floor, especially on public transport. So we tested these early versions of Find My Things with the design team, and they would get close to their object, overstep it, and then reach down. And they’d still be feeling around the floor before they found their object, which mostly ended up back behind them. So our last yard design completely changed this. As the user got close to their object, within the last yard, we change the sounds, and the app actually tells them to move down. The phone then responds to the distance to the object exactly like a metal detector. And this meant that when they reached down just at the right moment, they found their object on the floor and it was much easier. No more overstepping. We spent lots of time exploring how the experience and the phone capabilities like AR and AI could work best together, and our citizen design team gave us all of the key insights that led to us coming to these approaches. 

TINGLE: So what’s next for Find My Things? I’d like you to share a bit about the opportunities or even the obstacles that exist for more widespread adoption of the teachable AI approach. 

GRAYSON: So Find My Things was such a great project to work on. It sat right in the center of the triangle of AI innovation, designing with your community, and of course product impact. We’re taking so much of what we’ve learned during this project and building it into our research going forwards—how we build and evaluate AI, how to engage with the communities that we want to build for, and of course the value of building lots and lots of prototypes. Teachable AI, I think, is going to be a key approach in addressing the challenge for AI working equally well for everyone. The challenge is how do we ensure that we build these new fantastic models on data that gives representation to all that’ll use it. And so often, the people that might benefit the most from innovations in AI might have the smallest representation in data. And our work with people in the blind and low-vision community have really brought that into focus for us. AI can and will be transformational for them, so long as we can make it work just as well for everyone. And then that creates the opportunity: ensuring that these systems and technologies that we design can learn from and build in all of the diverse and wonderful uniqueness of being a human. 

MASSICETI: I think one of the things I’m most excited about is unlocking this power of personalization. Hopefully, we’ve convinced you how impactful having personalized AI technologies would be for not only the blind and low-vision community, but for you and I. And so one of the things I’m most excited about is seeing how we can transplant some of these learnings and ideas that we’ve had in building Find My Things into now the generative AI era. And so, yeah, I think I’m really excited to, kind of, bring together these ideas of teachable AI with these new generative models to help really bring to life more useful AI technologies that service not just a small few but all the people across the user distribution. 

TINGLE: Daniela and Martin, thank you so much for joining Abstracts today. 

MASSICETI: Thank you, Amber. 

GRAYSON: Thank you for having us. 

[MUSIC] 

TINGLE: And thanks to our listeners, too. If you’d like to learn more about Find My Things and teachable AI, visit aka.ms/TeachableAI. Thank you for tuning in. I’m Amber Tingle. Join us next time for more Abstracts

 [MUSIC FADES] 


[1] (opens in new tab) The AI-powered tool Find My Things is not a standalone app. It is available as a feature in the Seeing AI app. Download the Seeing AI app (opens in new tab) to try Find My Things.

Related publications

Continue reading

See all podcasts