{"id":162762,"date":"2012-06-01T00:00:00","date_gmt":"2012-06-01T00:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/msr-research-item\/a-challenge-set-for-advancing-language-modeling\/"},"modified":"2018-10-16T20:59:29","modified_gmt":"2018-10-17T03:59:29","slug":"a-challenge-set-for-advancing-language-modeling","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-challenge-set-for-advancing-language-modeling\/","title":{"rendered":"A Challenge Set for Advancing Language Modeling"},"content":{"rendered":"
\n

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Test\u2019s sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select the correct replacement from amongst five alternates. In general, all of the options are syntactically valid, and reasonable with respect to local N-gram statistics. The set was generated by using an N-gram language model to generate a long list of likely words, given the immediate context. These options were then hand-groomed, to identify four decoys which are globally incoherent, yet syntactically correct. To ensure the right to public distribution, all the data is derived from out-of-copyright materials from Project Gutenberg. The test sentences were derived from five of Conan Doyle\u2019s Sherlock Holmes novels, and we provide a large set of Nineteenth and early Twentieth Century texts as training material.<\/p>\n<\/div>\n

<\/p>\n","protected":false},"excerpt":{"rendered":"

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Test\u2019s sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556,13554],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-162762","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-locale-en_us"],"msr_publishername":"ACL\/SIGPARSE","msr_edition":"Workshop on the Future of Language Modeling for HLT, NAACL-HLT 2012","msr_affiliation":"","msr_published_date":"2012-06-01","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"205977","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","title":"holmes.pdf","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/holmes.pdf","id":205977,"label_id":0}],"msr_related_uploader":"","msr_attachments":[{"id":205977,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/holmes.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"gzweig","user_id":31938,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=gzweig"},{"type":"user_nicename","value":"cburges","user_id":31351,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=cburges"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[171065,170884],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":171065,"post_title":"Recurrent Neural Networks for Language Processing","post_name":"recurrent-neural-networks-for-language-processing","post_type":"msr-project","post_date":"2012-11-23 11:45:31","post_modified":"2019-08-19 14:55:37","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/recurrent-neural-networks-for-language-processing\/","post_excerpt":"This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks. A toolkit for doing RNN language modeling with side-information is in the associated download. Sample word vectors for use with this toolkit…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171065"}]}},{"ID":170884,"post_title":"MSR Sentence Completion Challenge","post_name":"msr-sentence-completion-challenge","post_type":"msr-project","post_date":"2011-12-08 11:11:26","post_modified":"2017-05-31 16:13:49","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/msr-sentence-completion-challenge\/","post_excerpt":"The MSR sentence completion challenge is intended to stimulate research in the area of semantic modeling. The challenge set consists of fill-in-the-blank questions similar to those found on the widely used Scholastic Aptitude Test. The sentence completion questions we focus on test the students ability to select words which are meaningful and coherent in the the context of a complete sentence. In general, this determination cannot be made on the basis of grammatical correctness alone.…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170884"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/162762","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/162762\/revisions"}],"predecessor-version":[{"id":531904,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/162762\/revisions\/531904"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=162762"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=162762"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=162762"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=162762"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=162762"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=162762"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=162762"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=162762"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=162762"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=162762"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=162762"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=162762"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=162762"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=162762"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=162762"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=162762"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}