{"id":305861,"date":"2011-05-26T09:00:41","date_gmt":"2011-05-26T16:00:41","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=305861"},"modified":"2016-10-15T12:39:00","modified_gmt":"2016-10-15T19:39:00","slug":"mavis-unlocks-spoken-words","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/mavis-unlocks-spoken-words\/","title":{"rendered":"MAVIS Unlocks Spoken Words"},"content":{"rendered":"<p><em>By Janie Chang, Writer, Microsoft Research<\/em><\/p>\n<p>Not long ago, Internet content was mostly text-based, with search tools supporting the need to index text efficiently and browsers providing the ability to search within a document for every instance of a keyword or phrase.<\/p>\n<p>Now, multimedia content has exploded onto the scene, thanks to technology that makes it easy to create and share multimedia. High-quality video cameras have become affordable, and every phone contains a camera. Low storage and bandwidth costs make it viable to upload and access large multimedia files, and the growth of social networking provides venues for consumers to share their experiences via audio, video, and photos. Search engines now find images, audio, and video files that have been tagged with text.<\/p>\n<p>For short audio or video clips, a textual description of the content may be sufficient. When faced with a two-hour video, or a collection of videos that could number in the hundreds or even thousands, users lack the equivalent of a document\u2019s Find function that enables them to skip through footage directly to spots where a keyword or phrase is mentioned.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-full wp-image-305867\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/05\/MAVIS.jpg\" alt=\"MAVIS\" width=\"234\" height=\"152\" \/>\u201cImagine having to read through every text document,\u201d says Behrooz Chitsaz, director of IP Strategy for Microsoft Research, \u201cjust to find the one paragraph that contains the one topic of relevance. This is basically how we are consuming speech content today, and we want to change that.\u201d<\/p>\n<p>Multimedia search, he says, is much the same as it was 10 years ago: heavily textual, with limited capabilities for searching in audio or video files for specific words. When these capabilities do exist, they usually are applied to content such as popular movies or lyrics. There is technology that automates speech recognition, but it often still requires a person to listen to and transcribe each audio or video file to make it search-ready.<\/p>\n<p>Hence Chitsaz\u2019s enthusiasm for <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/mavis\/\" target=\"_blank\">MAVIS<\/a>, the Microsoft Research Audio Video Indexing System. MAVIS comprises a set of software components that use speech-recognition technology to enable efficient, automated indexing and searching of digitized spoken content. By focusing on speech recognition, MAVIS not only enables search within audio files, but also within video. Footage from meetings, presentations, online lectures, and other, typically non-closed-captioned content all benefit from a speech-based approach.<\/p>\n<h2>Opening Up the Archives Just Got Easier<\/h2>\n<p>How significant is the functionality envisioned by MAVIS? Chitsaz and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-asia\/\" target=\"_blank\">Microsoft Research Asia<\/a>\u2019s <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fseide\/\" target=\"_blank\">Frank Seide<\/a>, senior researcher and research manager, and Kit Thambiratnam, lead researcher, are conducting technical previews to determine that. MAVIS has been running on a trial basis on digital archives for the U.S. states of Georgia, Montana, and Washington, as well as the U.S. Department of Energy, the British Library, and, most recently, CERN, the European Organization for Nuclear Research.<\/p>\n<div id=\"attachment_305870\" style=\"width: 330px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-305870\" class=\"size-full wp-image-305870\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/05\/Behrooz-Chitsaz.jpg\" alt=\"Behrooz Chitsaz\" width=\"320\" height=\"276\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/05\/Behrooz-Chitsaz.jpg 320w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/05\/Behrooz-Chitsaz-300x259.jpg 300w\" sizes=\"(max-width: 320px) 100vw, 320px\" \/><p id=\"caption-attachment-305870\" class=\"wp-caption-text\">Behrooz Chitsaz<\/p><\/div>\n<p>\u201cI started to realize this was really important,\u201d Chitsaz recalls, \u201cwhen the state of Washington contacted us. They had audio files of House of Representatives sessions from the \u201970s and \u201980s that they were transferring from tapes to digital files. They had digitized the audio but didn\u2019t know what was in them. It was like they had backups but no way to restore them.\u201d<\/p>\n<p>As a matter of policy, governmental organizations have to archive meetings for public access, but the archives are of little use if they can\u2019t be searched. Manual transcriptions are expensive, and it is unreasonable to expect state residents and legislators to listen through hours of recordings to find relevant information.<\/p>\n<p>MAVIS was able to index the thousands of files automatically. Now, governmental users and state residents have the ability to search for topics of interest by keyword or phrase. The search results enable users to retrieve from a list the precise moments in specific sessions when the keyword was mentioned and jump directly to those spots. Because MAVIS is integrated into the archive\u2019s text-search infrastructure, the search mechanism and user experience is the same as searching textual documents.<\/p>\n<p>\u201cMAVIS has made legislative research far easier and faster,\u201d Chitsaz says. \u201cUsers can search through tens of thousands of session hours and find discussions on a particular bill or issue. They can discover exactly how debates went or the original, historical reasons behind certain decisions. This has an enormously positive impact on government transparency.\u201d<\/p>\n<p>Governmental archives are an ideal starting point for implementing MAVIS. Magnetic tapes start to degrade after about 30 years, a factor that is driving digital-preservation initiatives. With such measures, there is an increased need for technologies that search and categorize multimedia files. The content of such archives is also ideal, because the recordings are \u201cspeech-recognition friendly\u201d\u2014mostly speech, with minimal background noise.<\/p>\n<h2>The High-Accuracy Speech-Recognition Challenge<\/h2>\n<p>Background noise is only one of the challenges MAVIS researchers are trying to solve in the quest for high-accuracy speech-recognition. Their goal is for MAVIS to handle general conversational speech, and that means coping with variables such as accents, ambient and background noise, reverberation, vocabulary, and language.<\/p>\n<p>\u201cOur brains can filter out noises,\u201d Chitsaz notes, \u201cbut it\u2019s hard for a computer. Vocabulary is also difficult. For instance, domains such as health care have specific terminologies. There\u2019s also context, which helps humans understand\u2014but that\u2019s hard to introduce to a computer system. We were confronted with all those variables. Speech recognition isn\u2019t new\u2014it\u2019s all about developing techniques that make it highly accurate.\u201d<\/p>\n<p>An important step forward has been a technique developed by researchers at Microsoft Research Asia called Probabilistic Word-Lattice Indexing, which improves accuracy for conversational speech indexing. Lattice indexing adjusts for the system\u2019s confidence rating for recognition of a word and alternate recognition candidates.<\/p>\n<p>\u201cWhen we recognize the audio track of a video,\u201d Seide explains, \u201cwe keep the alternatives. If I say \u2018Crimean War,\u2019 the system may think I\u2019ve said \u2018crime in a war,\u2019 because it lacks context. But we retain that as an alternative. By keeping the multiple word alternatives as well as the highest-confidence word, we get much better recall rates during the search phase.<\/p>\n<p>\u201cWe represent word alternatives as a graph structure: the lattice. Experiments showed that when it came to multiword queries, indexing and searching this word lattice significantly improved results for document-retrieval accuracy compared with plain speech-to-text transcripts: a 30- to 60-percent improvement for phrase queries and more than 200-percent better for queries consisting of multiple words or phrases.\u201d<\/p>\n<p>Another challenge is handling the broad range of potential topics.<\/p>\n<p>\u201cUnfortunately, speech recognizers are pretty dumb and can only recognize words they\u2019ve seen before,\u201d Thambiratnam explains. \u201cThat means many useful terms like names and technology jargon probably aren\u2019t going to be known to our speech recognizer. We leverage <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.bing.com\/\" target=\"_blank\">Bing<\/a> search to try to solve that, essentially trying to guess up front what words are most relevant for a video and then finding data on the web that we can use to adapt the vocabulary of our speech recognizer so that it does a better job on a particular file.\u201d<\/p>\n<p>Another piece of information critical to the usability of MAVIS is timing information: The system keeps timestamps so that search results include the times in an audio or video stream where the word occurs.<\/p>\n<h2>Implementing for Usability and Easy Deployment<\/h2>\n<p>Accurate speech recognition is compute-intensive and, thereby, an application ideal for a cloud-computing environment. The MAVIS architecture takes advantage of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/\" target=\"_blank\">Windows Azure Platform<\/a> to handle the speech-recognition process. The actual multimedia content can live behind the content owner\u2019s firewall, and the organization can submit thousands of hours of audio and video for indexing by the speech-recognition application running in an Azure application without having to invest in upgrading in-house computing infrastructure.<\/p>\n<p>While MAVIS provides tools that make it easy to submit audio and video content for indexing, just as critical to usability is the format of the results, which come back in a file that can be imported into <a href=\"https:\/\/www.microsoft.com\/en-us\/cloud-platform\/sql-server\" target=\"_blank\">Microsoft SQL Server<\/a> for full text indexing. This enables audio or video content to be searched just as any other textual content can be searched.<\/p>\n<p>\u201cCompatibility with SQL Server is very important,\u201d Chitsaz comments, \u201cbecause it means that searching for spoken words inside audio or video files becomes just like searching for text in an SQL Server database, a process familiar to IT organizations. We are not introducing a new search mechanism. They can maintain the same search infrastructure and processes.\u201d<\/p>\n<p>A demo of MAVIS at the Microsoft Video Web proves the success of the team\u2019s implementation. The site contains more than 15,000 MSNBC news videos. Searches are fast, and the results enable direct access into a video stream. Users can see textual information, as well as a timeline that shows where a keyword or phrase occurs.<\/p>\n<h2>Unlocking Information in a Disruptive Way<\/h2>\n<p>Even for Chitsaz, who is intimately familiar with the technology, the information MAVIS delivers still manages to surprise. During Iceland\u2019s volcanic eruptions in April 2010, he used MAVIS to search Microsoft\u2019s video archives to see what was available on the topic of \u201cvolcano.\u201d He found more than he expected: lectures that included imagery, sensors for tracking volcano activity, and interviews with people who had experienced volcano eruptions. When searching through government archives, he has found interesting discussions around topics such as taxes, public safety, and the environment, which still have a bearing on people in a community.<\/p>\n<p>MAVIS, Chitsaz says, is a disruptive technology that will affect the way we consume speech content, much the same way web searches affected the consumption of text content on the Internet.<\/p>\n<p>\u201cEach time I experience the value of MAVIS for myself,\u201d he says, \u201cit occurs to me that the textual information originally associated with the files did not include the term I was searching on. Without MAVIS, I would not have known about the information locked in those files.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Janie Chang, Writer, Microsoft Research Not long ago, Internet content was mostly text-based, with search tools supporting the need to index text efficiently and browsers providing the ability to search within a document for every instance of a keyword or phrase. Now, multimedia content has exploded onto the scene, thanks to technology that makes [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[205399,194480,194460],"tags":[186474,214346,186604,214340,214334,214352,202621,193659,214343,196484,196575,186467,187251,214349,214337,197281,186688],"research-area":[13551,13555],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-305861","post","type-post","status-publish","format-standard","hentry","category-azure","category-graphics-and-multimedia","category-search-and-information-retrieval","tag-audio","tag-automated-indexing","tag-bing","tag-indexing","tag-internet-content","tag-manual-transcriptions","tag-mavis","tag-microsoft-azure","tag-microsoft-research-audio-video-indexing-system","tag-microsoft-sql-server","tag-multimedia","tag-multimedia-search","tag-photos","tag-searching-of-digitized-spoken-content","tag-speech-content","tag-speech-recognition","tag-video","msr-research-area-graphics-and-multimedia","msr-research-area-search-information-retrieval","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"May 26, 2011","formattedExcerpt":"By Janie Chang, Writer, Microsoft Research Not long ago, Internet content was mostly text-based, with search tools supporting the need to index text efficiently and browsers providing the ability to search within a document for every instance of a keyword or phrase. Now, multimedia content&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305861"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=305861"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305861\/revisions"}],"predecessor-version":[{"id":305882,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/305861\/revisions\/305882"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=305861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=305861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=305861"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=305861"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=305861"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=305861"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=305861"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=305861"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=305861"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=305861"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=305861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}