{"id":158919,"date":"2010-06-01T00:00:00","date_gmt":"2010-06-01T00:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/msr-research-item\/on-indexing-error-tolerant-set-containment\/"},"modified":"2018-10-16T21:08:24","modified_gmt":"2018-10-17T04:08:24","slug":"on-indexing-error-tolerant-set-containment","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/on-indexing-error-tolerant-set-containment\/","title":{"rendered":"On Indexing Error-Tolerant Set Containment"},"content":{"rendered":"

Prior work has identi\ufb01ed set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that re\ufb02ect synonyms such as \u201cBob\u201d and \u201cRobert\u201d referring to the same \ufb01rst name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the bene\ufb01ts of our techniques. To our knowledge, this is the \ufb01rst paper that studies the indexing problem for Jaccard containment in the presence of string transformations.<\/p>\n","protected":false},"excerpt":{"rendered":"

Prior work has identi\ufb01ed set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13563,13555],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-158919","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-research-area-search-information-retrieval","msr-locale-en_us"],"msr_publishername":"","msr_edition":"Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)","msr_affiliation":"","msr_published_date":"2010-06-01","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"207095","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","title":"p927.pdf","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/p927.pdf","id":207095,"label_id":0}],"msr_related_uploader":"","msr_attachments":[{"id":207095,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/p927.pdf"}],"msr-author-ordering":[{"type":"text","value":"Parag Agrawal","user_id":0,"rest_url":false},{"type":"user_nicename","value":"arvinda","user_id":31106,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=arvinda"},{"type":"user_nicename","value":"skaushi","user_id":33680,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=skaushi"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[957177],"msr_project":[169513],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/158919"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/158919\/revisions"}],"predecessor-version":[{"id":533208,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/158919\/revisions\/533208"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=158919"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=158919"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=158919"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=158919"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=158919"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=158919"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=158919"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=158919"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=158919"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=158919"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=158919"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=158919"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=158919"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=158919"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=158919"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}