{"id":744385,"date":"2021-05-05T21:26:33","date_gmt":"2021-05-06T04:26:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=744385"},"modified":"2022-10-17T21:36:20","modified_gmt":"2022-10-18T04:36:20","slug":"discovering-related-data-at-scale","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/discovering-related-data-at-scale\/","title":{"rendered":"Discovering Related Data At Scale"},"content":{"rendered":"
Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise’s data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model that determines whether two columns across two data streams are related or not. We then use the model to find relations at scale across tens of millions of column-pairs and thereafter construct a data relationship graph in a scalable fashion, processing a data lake that has 4.5 Petabytes of data in approximately 80 minutes. Using manually labeled datasets as ground-truth, we show that our techniques show improvements of at least 23% when compared to state-of-the-art methods.<\/p>\n","protected":false},"excerpt":{"rendered":"
Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise’s data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13563,13547],"msr-publication-type":[193715],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[247552,255436,254995],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-744385","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-research-area-systems-and-networking","msr-locale-en_us","msr-field-of-study-big-data","msr-field-of-study-databases","msr-field-of-study-systems"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2021-8-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"VLDB Endowment","msr_volume":"14","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"8","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"http:\/\/www.vldb.org\/pvldb\/vol14\/p1392-bharadwaj.pdf","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Sagar Bharadwaj","user_id":0,"rest_url":false},{"type":"text","value":"Praveen Gupta","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ranjita Bhagwan","user_id":31217,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ranjita Bhagwan"},{"type":"user_nicename","value":"Saikat Guha","user_id":33493,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Saikat Guha"}],"msr_impact_theme":[],"msr_research_lab":[199562],"msr_event":[759268],"msr_group":[144939],"msr_project":[887322],"publication":[],"video":[],"download":[],"msr_publication_type":"article","related_content":{"projects":[{"ID":887322,"post_title":"Network Brain","post_name":"netbrain","post_type":"msr-project","post_date":"2022-10-17 05:39:45","post_modified":"2023-11-02 01:11:16","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/netbrain\/","post_excerpt":"Holistic optimization of large-scale networked services The growth of large-scale networked services has brought to the fore myriad challenges: performance, reliability, efficiency, cost, and more. Traditionally, work on addressing and balancing these has been done in silos. For instance, an application could make choices to optimize its own performance or cost, while treating the rest of the workload and infrastructure as outside its purview. However, such an approach breaks down when an application or service…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/887322"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/744385"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/744385\/revisions"}],"predecessor-version":[{"id":744388,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/744385\/revisions\/744388"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=744385"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=744385"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=744385"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=744385"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=744385"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=744385"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=744385"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=744385"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=744385"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=744385"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=744385"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=744385"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=744385"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=744385"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=744385"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=744385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}