{"id":1016412,"date":"2024-03-19T15:53:17","date_gmt":"2024-03-19T22:53:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=1016412"},"modified":"2024-03-26T10:25:37","modified_gmt":"2024-03-26T17:25:37","slug":"training-audio-captioning-models-without-audio","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/training-audio-captioning-models-without-audio\/","title":{"rendered":"Training Audio Captioning Models without Audio"},"content":{"rendered":"

Automated Audio Captioning (AAC) is the task of generating <\/span>natural language descriptions given an audio stream. A typ<\/span>ical AAC system requires manually curated training data of <\/span>audio segments and corresponding text caption annotations. <\/span>The creation of these audio-caption pairs is costly, resulting <\/span>in general data scarcity for the task. In this work, we address <\/span>this major limitation and propose an approach to train AAC <\/span>systems using only text. Our\u00a0 approach leverages the multi-<\/span>modal space of contrastively trained audio-text models, such <\/span>as CLAP. During training, a decoder generates captions con<\/span>ditioned on the pretrained CLAP text encoder. During infer<\/span>ence, the text encoder is replaced\u00a0 with the pretrained CLAP <\/span>audio encoder. To bridge the modality gap between text and <\/span>audio embeddings, we propose the use of noise injection or <\/span>a learnable adapter, during training.<\/span> We find that the pro<\/span>posed text-only framework\u00a0 performs competitively with state-<\/span>of-the-art models trained with paired audio, showing that effi<\/span>cient text-to-audio transfer is possible. Finally, we showcase <\/span>both stylized audio captioning and caption enrichment while <\/span>training without audio or human-created text captions.<\/span><\/p>\n

 <\/p>\n

 <\/p>\n

\"graphical<\/p>\n

The first panel depicts the modality gap between CLAP pretrained audio and pretrained text embeddings in the joint audio-text <\/span>space. The second panel shows the proposed method of text-only training for Automated Audio Captioning. At inference, the text encoder is <\/span>swapped with the audio encoder and a caption is produced for the input audio. Only mapping network<\/span> m<\/span> is trainable, while modules with<\/span> (snowflake) <\/span>are frozen. The Prefix is the output of<\/span> m<\/span>. Singular arrows depict embedding vectors while multiple arrows indicate a sequence of vectors.<\/span><\/p>\n

 <\/p>\n

 <\/p>\n

\"table\"<\/p>\n

The Table shows results of various models trained on both Au<\/span>dioCaps and Clotho. Models in rows 1-4 use both audio and <\/span>text in training. The proposed text-only model (row 5) uses <\/span>only text data and random Gaussian noise with a std of 0.015. <\/span>It achieves comparable performance with the best audio cap<\/span>tioning models in the literature and obtains a SPIDEr score of\u00a0 0.256 on Clotho and 0.455 on AudioCaps, higher than 0.215 and 0.437 reported by Kim et. al.\u00a0<\/span><\/p>\n

Text-only training is a valid alternative to training and\/or <\/span>initializing audio captioning systems. We also train our model <\/span>architecture made for text-only training with audio-text pairs. <\/span>The architecture is similar to Fig 1, where during training we <\/span>use audio files with an audio encoder instead of text with a <\/span>text encoder and Gaussian noise. This is the last and grayed <\/span>row in the Table above. The difference in SPIDEr score between the <\/span>audio-text and the text-only training is small: +0.02 on Au<\/span>dioCaps and +0.01 on Clotho.<\/span> This indicates that our text-<\/span>only training can achieve comparable results without audio <\/span>data. The main benefit of text-only training is training on un<\/span>paired openly available text. We explore this in Section 5.1, <\/span>whereby using LLM-generated text, we show that text-only <\/span>training can improve over the audio-text training.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"

Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[243062,13555],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[247741],"msr-conference":[259657],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1016412","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-audio-acoustics","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-field-of-study-audio-signal-processing"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-4-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/03\/TRAINING-AUDIO-CAPTIONING-MODELS-WITHOUT-AUDIO_2309.07372.pdf","id":"1016424","title":"training-audio-captioning-models-without-audio_2309-07372","label_id":"243132","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":1016424,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/03\/TRAINING-AUDIO-CAPTIONING-MODELS-WITHOUT-AUDIO_2309.07372.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Soham Deshmukh","user_id":40312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Soham Deshmukh"},{"type":"user_nicename","value":"Benjamin Elizalde","user_id":41662,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Benjamin Elizalde"},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dimitra Emmanouilidou"},{"type":"text","value":"Bhiksha Raj","user_id":0,"rest_url":false},{"type":"text","value":"Rita Singh","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Huaming Wang","user_id":32052,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Huaming Wang"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144923],"msr_project":[],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412\/revisions"}],"predecessor-version":[{"id":1018218,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412\/revisions\/1018218"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1016412"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=1016412"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1016412"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1016412"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1016412"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=1016412"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1016412"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=1016412"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=1016412"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1016412"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1016412"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1016412"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1016412"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1016412"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1016412"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1016412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}