{"id":305930,"date":"2011-04-14T11:00:06","date_gmt":"2011-04-14T18:00:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=305930"},"modified":"2016-10-15T13:41:54","modified_gmt":"2016-10-15T20:41:54","slug":"kinect-audio-preparedness-pays-off","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/kinect-audio-preparedness-pays-off\/","title":{"rendered":"Kinect Audio: Preparedness Pays Off"},"content":{"rendered":"

By Rob Knies, Senior Editor, Microsoft Research<\/em><\/p>\n

It always helps to be prepared. Just ask Ivan Tashev<\/a>.<\/p>\n

A principal software architect in the Speech<\/a> group at Microsoft Research Redmond<\/a>, Tashev played an integral role in developing the audio technology that enabled Kinect for Xbox 360<\/a> to become the fastest-selling consumer-electronics device ever, with eight million units sold in its first 60 days on the market.<\/p>\n

\"KinectKinect represents part of Microsoft\u2019s deep investment in natural user interfaces, which make computing intuitive to use and able to do far more for users. On April 13, Scott Guthrie<\/a>, Microsoft corporate vice president of the .NET Developer Platform, announced features of the impending Kinect for Windows non-commercial software-development kit<\/a> during MIX11<\/a>, a three-day, web-focused conference in Las Vegas. Tashev himself will be speaking that day about his work in a talk entitled \u201cAudio for Kinect: From Idea to \u2018Xbox, Play!\u2019\u201d<\/p>\n

Such prominence isn\u2019t earned easily. In the case of the audio functionality for Kinect, it took a combination of preparation and patience to do the trick.<\/p>\n

\u201cI spent pretty much my entire career in Microsoft Research,\u201d Tashev says, \u201cknowing that, sooner or later, people would be talking to their computers. I was absolutely sure they would not want to wear a headset. So, from my first day with Microsoft Research, I\u2019ve been working on the problem of hands-free sound capturing from a certain distance in normal conditions and having enough clean sound in the output, good enough for telecommunications and for speech recognition.<\/p>\n

\u201cI didn\u2019t know which product would be interested in this. These technologies were designed in Microsoft Research, and, in our experiments, they worked on a small set of data, well enough that we wrote a scientific publication.\u201d<\/p>\n

Enter Alex Kipman, general manager of Xbox Incubation within Microsoft\u2019s Interactive Entertainment Business. He was driving the development of Kinect, the revolutionary product that enables controller-free command of an Xbox. He encountered Tashev in 2008 during Microsoft Research\u2019s annual TechFest<\/a> showcase, and several months later, Kipman decided to follow up.<\/p>\n

\u201cWe came to Microsoft Research,\u201d he recalls, \u201cand asked: \u2018Can you help us make a system that can do speech recognition without having to push a button to talk? We\u2019re all about no buttons, so you can\u2019t have a push-to-talk system.\u2019<\/p>\n

\u201cAnd we said: \u2018The system needs to be listening to us 100 percent of the time. You can leave this on for days, and it still needs to work.<\/p>\n

\u201cWe said: \u2018We want a system that can do speech recognition four meters at a distance. You\u2019re not going to have a captive audience a few feet in front of a microphone. People can be anywhere about four meters\u2019 distance, and they should still be able to talk and be recognized.\u2019<\/p>\n

\u201cAnd then we said: \u2018Our environment is all about people having fun. If we do our jobs correctly, every single person is going to be having fun, so there\u2019s a lot of noise from the loudspeakers, and the system still needs to pick out the signal when that person to whom you\u2019ve been listening all day says, \u201cXbox, play movie.\u201d\u2019<\/p>\n

Many people might have been daunted by such a formidable laundry list, but not Tashev.<\/p>\n

\"Ivan

Ivan Tashev<\/p><\/div>\n

\u201cThe most difficult part to resolve was overcoming the problem of the microphones hearing the sound from loudspeakers,\u201d he explains. \u201cFirst, gamers tend to listen to very loud sounds. Second, the Kinect device is closer to the loudspeakers than to the humans speaking in the room. The sound from the loudspeakers is way louder than the normal human voice.\u201d<\/p>\n

The algorithm for this is called acoustic echo cancellation, and it\u2019s included in virtually all speakerphones. But in normal, speakerphone usage, the loudspeaker sound level is about the same as a human voice. In the Kinect-usage scenario, the loudspeakers are louder, the humans are farther away, and the loudspeaker signal is not a single, mono signal\u2014it\u2019s in stereo or surround sound.<\/p>\n

That meant that not only did Tashev need to suppress loudspeaker echoes by an order of magnitude louder, but he also had to create the stereo acoustic-echo-cancellation algorithm\u2014a longstanding research problem. And he had to cope with reverberation, which makes speech recognition even more difficult from four meters\u2019 distance, and to capture an enormous dynamic range, one that could tease out, amid blaring loudspeakers, the soft voice of a young child.<\/p>\n

Tashev was inured to such challenges. While a professor at the Technical University of Sofia in his native Bulgaria, he had worked with a student on microphone arrays, which enabled the localization of a human speaker with just a couple of microphones. Upon joining Microsoft Research in 2001, Tashev began work on beam-forming research that helped lead to the Microsoft RoundTable, a videoconferencing device with a 360-degree camera. And his algorithm enabled Windows Vista to offer integrated microphone-array support.<\/p>\n

He also had pursued some pure research over the years. He began exploring multichannel acoustic-echo cancellation in 2007, and while it remained uncertain who would want such technology, or why, Tashev remained intrigued.<\/p>\n

Ready for Action<\/h2>\n

Thus, when Kipman contacted Tashev in 2009 to inquire about the demo on surround-sound acoustic-echo cancellation seen during TechFest, the researcher might have been caught by surprise, but he certainly wasn\u2019t unprepared.<\/p>\n

By May, Tashev had been embedded in the Xbox team, had been briefed about the new product, and was starting to design the audio pipeline for Kinect.<\/p>\n

\u201cWhen we derived the requirements for the pipeline,\u201d he says, \u201cthere was a small meeting, and we found that if even if we could take care of all the problems, we still needed an acoustic-echo canceller 10 times better than normal industrial devices have.<\/p>\n

\u201cThe first reaction was to cut the feature, but Alex said, very firmly: \u2018We\u2019re shipping this. We have to make it work.\u2019\u201d<\/p>\n

That took a lot of teamwork\u2014and dedication.<\/p>\n

\u201cWe had the technologies,\u201d Tashev says, \u201cand Xbox has an exceptional engineering team. But even under those conditions, we needed the determination of every member of that team, from developer, tester, and program manager on up to the general manager\u2019s level. Without their hard work and determination, Kinect wouldn\u2019t have happened.\u201d<\/p>\n

The speech research posed a significant hurdle for the audio team working feverishly on Kinect.<\/p>\n

\u201cSpeech is a serious beast,\u201d Tashev acknowledges. \u201cIt is still more science and art than engineering. Speech has its strong points, and it has scenarios where it is weaker. If I have to select one of 30,000 songs in my collection, speech is a perfect modality. I can send a speech query like \u2018Play me that song about submarines by the Beatles,\u2019 and our existing technology will find, relatively quickly, all songs with \u2018submarine\u2019 in the metadata and filter it with ones with \u2018Beatles.\u2019 We might end up with three or four candidates.<\/p>\n

\"voice

A combination of speech and gestures gives Kinect a user-friendly interface.<\/p><\/div>\n

\u201cBut then what? If this is a speech-only interface, we\u2019ll have to listen to the computer read us the title of all four songs. That\u2019s annoying\u2014not a good modality for speech.\u201d<\/p>\n

On the other hand, in the scenario of selecting from a short list, gestures work perfectly.<\/p>\n

\u201cYou can just point and select,\u201d Tashev says. \u201cIn this simple exercise, speech is good in one part, and gesture is good for another. Combining those, adding sounds and graphics, and we have something very powerful: a multimodal user interface. If properly designed, it can provide an intuitive and natural way of communication between the computer and the human\u2014and does not require any controllers or buttons.\u201d<\/p>\n

Thus, when he began work in earnest with the Xbox team, Tashev came complete with a handful of valuable technologies. At that point, it was time to transfer them into an actual product. But things did not necessarily go smoothly. A series of acoustical consultants didn\u2019t believe the target functionality was achievable. But after a series of refinements, the assembled engineers were able to devise an acoustical-analysis program using basic algorithms. It was slow, but it worked.<\/p>\n

\u201cOnce we can analyze,\u201d Tashev notes, \u201cwe can optimize.\u201d<\/p>\n

\"acoustical

Using a computing cluster, architects Ivan Tashev and Wei-ge Chen carefully tuned the acoustical models for Kinect.<\/p><\/div>\n

They put the program on a large computing cluster and began to vary the parameters of the microphones\u2019 design and placement. After several days of optimization on the cluster, a measurement using the final plastic mold tested at quite close to the desired product specs. By this point, Tashev was acting as the point of contact for almost all Kinect audio issues. In May 2010, the audio-processing pipeline was ready. The next month, the speech recognizer was trained, but the results needed significant improvement.<\/p>\n

Tashev went back to Microsoft Research to summon Wei-ge Chen<\/a>, software architect. They spent four months reiterating and testing, and by Sept. 26, working closely with the Microsoft Tellme<\/a> group, test results on the latest acoustical models achieved the shipping criteria.<\/p>\n

Thirty-four days later, Kinect was shipped to an eager public.<\/p>\n

\u201cMicrosoft didn\u2019t get to this position by accident,\u201d Tashev smiles. \u201cIt happens when you have technologies designed over the years, when you have your teams built\u2014an excellent engineering team from Xbox and very good research from Microsoft Research\u2014and those teams are willing and trained and encouraged to work together. Nothing serious\u2014no breakthrough\u2014happens by accident.\u201d<\/p>\n

And, as Tashev celebrated his 10th year with Microsoft Research, and with his contributions having helped Kinect transform consumer electronics, he took a moment to reflect.<\/p>\n

Support System<\/h2>\n

\u201cThe algorithms I was supposed to deliver were mine,\u201d he says, \u201cdesigned by me and people I worked with. But I have been encouraged over the years by my managers, Anoop Gupta, Rico Malvar<\/a>, and Alex Acero.<\/p>\n

\u201cIn Xbox, I\u2019d add Alex Kipman, the visionary behind Kinect, the person who said, \u2018We want this.\u2019 And, at the end, Ben Kilgore was the person who said: \u2018We have a challenge with our audio code. We should ask Microsoft Research to help us ship it.\u201d<\/p>\n

The assistance, Kipman makes it abundantly clear, was most appreciated.<\/p>\n

\u201cIf you know anything about the space of speech recognition,\u201d he says,\u201d you need a few improbables made possible to make this happen. There was one person who really stood up to the challenge and said: \u2018You know what? This stuff is improbable, but we\u2019re going to make it happen.\u2019 It\u2019s Ivan, one of the wicked smart people that made the improbable possible.\u201d<\/p>\n

Now, with Kinect having shipped, Tashev and others responsible can rest easy, right?<\/p>\n

Hardly.<\/p>\n

\u201cThis is just the beginning,\u201d Tashev says. \u201cWe have a lot of new, amazing technologies in the pipeline that will allow us to improve the user experience and the ability for better communication between the humans and the gaming console. I constantly work with those guys to add more interesting tools for the game developers so they can make more fascinating games.\u201d<\/p>\n

Nevertheless, the Kinect experience will remain, for Tashev, unforgettable.<\/p>\n

\u201cI\u2019ve had a lot of experience working on real systems,\u201d he says, \u201cbut never something on the scale of Kinect. It was a really big, end-to-end system with extremely high requirements. Not many people get an opportunity to put together their work for the last seven years into a product that becomes a huge breakthrough. It\u2019s been a unique opportunity.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"

By Rob Knies, Senior Editor, Microsoft Research It always helps to be prepared. Just ask Ivan Tashev. A principal software architect in the Speech group at Microsoft Research Redmond, Tashev played an integral role in developing the audio technology that enabled Kinect for Xbox 360 to become the fastest-selling consumer-electronics device ever, with eight million […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194476,194456,194462],"tags":[214406,214403,193602,196135,214412,214400,214409,193514,204625],"research-area":[13552,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-305930","post","type-post","status-publish","format-standard","hentry","category-devices-and-hardware","category-natural-language-processing","category-speech-and-dialog","tag-acoustical-models","tag-computing-cluster","tag-kinect-for-windows","tag-kinect-for-xbox-360","tag-microsoft-roundtable","tag-mix11","tag-multichannel-acoustic-echo-cancellation","tag-techfest","tag-windows-vista","msr-research-area-hardware-devices","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144923],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"April 14, 2011","formattedExcerpt":"By Rob Knies, Senior Editor, Microsoft Research It always helps to be prepared. Just ask Ivan Tashev. A principal software architect in the Speech group at Microsoft Research Redmond, Tashev played an integral role in developing the audio technology that enabled Kinect for Xbox 360…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305930"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=305930"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305930\/revisions"}],"predecessor-version":[{"id":305945,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305930\/revisions\/305945"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=305930"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=305930"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=305930"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=305930"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=305930"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=305930"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=305930"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=305930"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=305930"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=305930"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=305930"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}