{"id":245480,"date":"2011-10-13T10:00:20","date_gmt":"2011-10-13T17:00:20","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=245480"},"modified":"2016-12-08T06:31:07","modified_gmt":"2016-12-08T14:31:07","slug":"eliminating-duplicated-primary-data","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/eliminating-duplicated-primary-data\/","title":{"rendered":"Eliminating Duplicated Primary Data"},"content":{"rendered":"

By Douglas Gantenbein<\/em><\/p>\n

The amount of data created and stored in the world doubles about every 18 months. Some of that data is distinctive\u2014but by no means all of it. A PowerPoint (opens in new tab)<\/span><\/a> presentation might start bouncing around a work group, and within a week, many nearly identical copies could be scattered across an enterprise\u2019s desktops or servers.<\/p>\n

Eliminating redundant data is a process called data deduplication. It\u2019s not new\u2014appliances that scrub hard drives and delete duplicate data have existed for years. But those appliances work across multiple backup copies of the data. They do not touch the primary copy of the data, which exists on a live file server.<\/p>\n

Deduplicating as data is created and accessed\u2014primary data, as opposed to backup data\u2014is challenging. The process of deduplicating data consumes processing power, memory, and disk resources, and deduplication can slow data storage and retrieval when operating on live file systems.<\/p>\n

\n
\"Sudipta

Sudipta Sengupta, Principal Research Scientist, Microsoft Research<\/p><\/div>\n

\u201cPeople access lots of data stored on servers,\u201d says Sudipta Sengupta (opens in new tab)<\/span><\/a>, a senior researcher with Microsoft Research Redmond (opens in new tab)<\/span><\/a>. \u201cThey need fast access to that data\u2014ideally, as fast as without deduplication\u2014so it\u2019s a challenge to deduplicate data while also serving it in real time.\u201d<\/p>\n

Sengupta\u2013along with Jin Li (opens in new tab)<\/span><\/a>, a principal researcher at the Redmond facility\u2013have cracked that nut. In a partnership with the Windows Server 8 (opens in new tab)<\/span><\/a> team, they have developed a fast, effective approach to deduplicate primary data, and they have delivered production code that will ship with Windows Server 8.<\/p>\n

The researchers began work on data deduplication about four years ago. Sengupta and Li believed there were big opportunities for reducing redundancies within primary data, an area that hadn\u2019t really been examined because of the impact deduplication could have on a server managing live data. They built a tool that would crawl directories on servers and analyze the data for deduplication savings. This showed that there were significant redundancies in primary data.<\/p>\n

Sengupta and Li next tackled the problem of detecting duplicated data. That required building and maintaining an index of existing data fragments\u2014also called \u201cchunks\u201d\u2014in the system. Their goal was to make the indexing process perform well with low resource usage. The Microsoft Research team\u2019s solution is based on a technology they designed called ChunkStash (opens in new tab)<\/span><\/a>, for \u201cchunk<\/i><\/b> metadata st<\/i><\/b>ore on flash<\/i><\/b>.\u201d ChunkStash stores the chunk metadata on flash memory in a log-structured manner, accesses it using a low RAM-footprint index, and exploits the fast-random-access nature of the flash device to determine whether new data is unique or duplicate. Not all of the performance benefits of ChunkStash are dependent on the use of flash memory, and ChunkStash also greatly accelerates deduplication when hard disks alone are used for storage, which is the case in most server farms.<\/p>\n

Product-Team Engagement<\/h1>\n

Sengupta and Li\u2019s work on deduplication caught the eye of the Windows Server team, which was in the early stages of working on Windows Server 8. The opportunity to include deduplication in the release was tempting and driven by customer needs and industry trends.<\/p>\n

\u201cStorage deduplication,\u201d says Thomas Pfenning, general manager for Windows Server, \u201cis the No. 1 technology customers are considering when investing in file-based storage solutions.\u201d<\/p>\n

The process of deduplication breaks up data into smaller fragments that become the target for a deduplication, too. These fragments could be entire files or \u201cchunks\u201d of a few kilobytes. Because data is subject to edits and modifications over time, breaking data into smaller chunks and deduplicating those smaller pieces might be more effective than finding and deduplicating entire files.<\/p>\n

Take a PowerPoint presentation, for instance. A dozen slightly different versions might exist on a server. Is it better to find entire files that are identical and toss out the spares, or to unearth the pieces multiple files might have in common, and remove those duplicates?<\/p>\n

To find out, the Microsoft team analyzed data from 15 globally distributed servers within Microsoft. These servers contained data folders of single users\u2019 Office files, music, and photos; files shared by workgroups; SharePoint team sites; software-deployment tools; and more.<\/p>\n

They discovered that chunking data resulted in significantly larger savings compared with deduplication of entire files.<\/p>\n

\n
\"Jin

Jin Li, Partner Researcher Manager of the Cloud Computing and Storage (CCS) group in Microsoft Research \u2013 Technologies<\/p><\/div>\n

\u201cIf people are working on a PowerPoint presentation, the file can be edited lots of times,\u201d Li says. \u201cThat generates a lot of different versions, and although the files are not the same, they have a large amount of common data.\u201d<\/p>\n<\/div>\n

They also found that the use of higher average chunk sizes, in range of 70 to 80 kilobytes, together with chunk compression, could preserve the high deduplication savings typically associated with much smaller chunk sizes\u20144 to 8 kilobytes\u2014that previously have been used in the context of backup data deduplication. This has huge implications for a primary data server, because larger chunk sizes reduce chunk metadata and the number of chunks stored in the system, leading to increased efficiencies in many parts of the pipeline, from deduplicating data to serving data.<\/p>\n

From Research to Production<\/h1>\n

The Microsoft Research team contributed in three key areas in designing and building a production-quality data deduplication feature in Windows Server 8: data chunking, indexing for detecting duplicate data, and data partitioning.<\/p>\n

For the first contribution, Sengupta and Li devised a new data-chunking algorithm, called regression chunking, that achieves a more uniform chunk-size distribution and increased deduplication savings.<\/p>\n

Their second contribution was the indexing system for detecting duplicate data. The researchers used ideas from their ChunkStash research project to deliver a highly efficient chunk-indexing module that makes light use of CPU, memory, and disk resources.<\/p>\n

\u201cIndexing is usually a big bottleneck in deduplication,\u201d Sengupta says. \u201cIt\u2019s a big challenge to build a scalable, high-performance index to identify the duplicate chunks without slowing down performance.\u201d<\/p>\n

In the third contribution, the Microsoft researchers worked with the product team to devise a data-partitioning technique that scales up as a data set grows. Partitioning data enables the deduplication process to work across a smaller set of files, reducing resource consumption.<\/p>\n

Through data analysis, they found that two partitioning strategies\u2014partitioning by file type or by file-system-directory hierarchy\u2014work well in terms of negligible to marginal loss in deduplication quality through partitioned processing. The system also includes an optional reconciliation process that can be used to deduplicate across partitions if significant additional space savings can be extracted.<\/p>\n

Finally, the Microsoft Research team worked with the Windows Server team to write production code for the new data deduplication feature in Windows Server 8. For a time, Sengupta and Li wore two hats\u2014working both on research on deduplication and writing code for use in Windows Server 8. They were joined by Microsoft Research colleagues Kirk Olynyk, a senior research software-design engineer, and Sanjeev Mehrotra (opens in new tab)<\/span><\/a>, a principal software architect, in shipping production code to the Windows Server 8 team.<\/p>\n

\u201cIt was a great collaboration between Microsoft Research and the product team,\u201d Li says. \u201cWe got very good feedback from the team, and some of the challenges they posed to us helped make the product better.\u201d<\/p>\n

Evidence of that is in the results Windows Server 8 will yield in terms of reducing the need for server storage. In one recent demo, a virtual-hard-drive (VHD) store holding 10 terabytes of VHD files consumed only 400 gigabytes of disk space. In this instance, as much as 96 percent of the data was detected as duplicate and then eliminated, because of the presence of identical or slightly different operating-system and application-software files in the VHDs.<\/p>\n

Windows Server 8 was shown in mid-September during Microsoft\u2019s BUILD conference, and a preview edition is available to developers, with final release expected in 2012. Its data deduplication capability has been widely praised. A \u201ckiller feature (opens in new tab)<\/span><\/a>,\u201d wrote ITWorld<\/i>. \u201cAn impressive feature (opens in new tab)<\/span><\/a> that should do wonders for storage efficiency and network utilization,\u201d added <\/span>Windows IT Pro<\/i>. <\/span>Ars Technica <\/i><\/span>added to the chorus (opens in new tab)<\/span><\/a>: \u201c<\/span>Microsoft demonstrations of the technology reduced the disk footprint of a [virtual desktop infrastructure] server by some 96 percent.\u201d<\/p>\n

The Windows Server 8 team also is happy with the new addition to their product.<\/p>\n

\u201cWe are very pleased with the end result of the collaboration with Microsoft Research,\u201d Pfenning says. \u201cIt\u2019s great to see research work coming through in a product that we expect to bring tremendous customer value.\u201d<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"

By Douglas Gantenbein The amount of data created and stored in the world doubles about every 18 months. Some of that data is distinctive\u2014but by no means all of it. A PowerPoint presentation might start bouncing around a work group, and within a week, many nearly identical copies could be scattered across an enterprise\u2019s desktops […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[194470,194475,194476],"tags":[206690,187216,206714,206696,206693,206711],"research-area":[13563,13552],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-245480","post","type-post","status-publish","format-standard","hentry","category-computer-architecture","category-database-data-analytics-platforms","category-devices-and-hardware","tag-chunkstash","tag-flash-memory","tag-storage-deduplication","tag-vhd","tag-virtual-hard-drive","tag-windows-server-8","msr-research-area-data-platform-analytics","msr-research-area-hardware-devices","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[474786],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"October 13, 2011","formattedExcerpt":"By Douglas Gantenbein The amount of data created and stored in the world doubles about every 18 months. Some of that data is distinctive\u2014but by no means all of it. A PowerPoint presentation might start bouncing around a work group, and within a week, many…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/245480"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=245480"}],"version-history":[{"count":12,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/245480\/revisions"}],"predecessor-version":[{"id":333560,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/245480\/revisions\/333560"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=245480"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=245480"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=245480"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=245480"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=245480"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=245480"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=245480"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=245480"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=245480"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=245480"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=245480"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}