{"id":307355,"date":"2008-10-27T08:29:46","date_gmt":"2008-10-27T15:29:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=307355"},"modified":"2016-10-17T20:38:58","modified_gmt":"2016-10-18T03:38:58","slug":"dryad-programming-datacenter","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/dryad-programming-datacenter\/","title":{"rendered":"Dryad: Programming the Datacenter"},"content":{"rendered":"

By Rob Knies, Managing Editor, Microsoft Research<\/em><\/p>\n

Concurrent programming is demanding. While part of a program is modifying data, the other parts must be prevented from doing likewise. Manually organizing such tasks is challenging for the most adept experts. People have been trying for decades to make it easier.<\/p>\n

Concurrent programming is in demand. More programs are communicating with Web services. Fundamental limitations in physics are dictating a move to multicore chips that enable many processes to run in parallel. There\u2019s no turning back.<\/p>\n

Enter Dryad (opens in new tab)<\/span><\/a>.<\/p>\n

\u201cThe Dryad project,\u201d says Michael Isard, senior researcher for Microsoft Research Silicon Valley, \u201cis trying to make it easier to write programs that can run over very large collections of computers, both efficiently and reliably.<\/p>\n

\u201cWe\u2019re trying to take a large and useful class of programs and still let the programmer think about it sequentially but have the system automatically parallelize it. Concurrent-programming researchers have always looked for ways in which the concurrency can be automatically found by the system. There has always been an emphasis on approaches that would let the programmer think sequentially and have the system find parallelism.\u201d<\/p>\n

In Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks<\/em> (opens in new tab)<\/span><\/a>, a paper written by Isard and Microsoft Research Silicon Valley colleagues Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly, the project\u2019s value is stated thusly:<\/p>\n

\u201cThe Dryad execution engine handles all the difficult problems of creating a large, distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between [computation] vertices.<\/p>\n

Isard elaborates.<\/p>\n

\u201cThe goal,\u201d he says, \u201cis to abstract away a lot of the practical details of the cluster\u2014the data placement, the network hierarchy, the strategy for fault tolerance\u2014so that the programmer can work at a higher level and concentrate more on the structure of the computation and rely on Dryad to do the scheduling, the fault tolerance, and those kinds of things.\u201d<\/p>\n

In the process, programmers are freed to contemplate the more difficult, abstract issues involved in making distributed systems work over large-scale computer clusters\u2014without having to concern themselves with low-level, though critical, details.<\/p>\n

\u201cThe programmer,\u201d Isard says, \u201cshould be able to describe the computation at quite a high level and write some declarative or sequential code that looks like a SQL query or a single-threaded C# program.<\/p>\n

\u201cIt\u2019s the job of Dryad to look at the program and the actual physical resources\u2014what computers there are, what network they\u2019re connected with, how the data has been split up\u2014and figure out how to break that program up so it can be run in parallel and then send it out to this cluster and run the necessary pieces and send the data between computers when necessary.\u201d<\/p>\n

Scale, you see, is a significant component of the value Dryad brings to bear. It is designed to scale effectively, from a single, powerful multicore computer to small clusters of computers to datacenters with thousands of computers.<\/p>\n

\u201cSuppose you have some very large data set,\u201d Isard says, \u201cmaybe a few terabytes of data stored on a cluster of a few thousand computers\u2014and that the data set has been split up and partitioned, and the partitions have been spread over thousands of computers. Dryad enables a user to analyze that data.\u201d<\/p>\n

Of course, having a large collection of data distributed among a large number of computers introduces a higher risk of something going wrong.<\/p>\n

\u201cIf there are failures,\u201d Isard says, \u201cif there are transient network failures or some of the computers crash, Dryad hides all that and makes sure that the computation finishes anyway.\u201d<\/p>\n

As mentioned, work to enable concurrent programming has been conducted for years, and Dryad has its antecedents. Isard himself has worked on some of them. But none has delivered the flexible programming model and the ability to achieve rich relational queries in addition to providing fault tolerance and data-directed design. That combination\u2014scalable and better able to handle complex programs\u2014is what sets Dryad apart.<\/p>\n

Isard has been investigating large-scale distributed systems for about five years now, and his Dryad work began in the spring of 2005. But things change quickly in technological circles, with their steep learning curves, and the project has morphed to reflect that dynamic environment.<\/p>\n

\u201cWhen we started out,\u201d he recalls, \u201cwe didn\u2019t think very carefully about the programming model that the programmer would actually see. We thought more about the classes of computation that we wanted Dryad to support, and one of the things we learned is that what we built in the first version was more middleware that most programmers don\u2019t want to program to.<\/p>\n

\u201cSubsequently, we\u2019ve spent more time putting a layer on top of that, but at a higher level of abstraction that programmers see. The original Dryad system is mostly targeted by higher-level programming languages. So it\u2019s quite flexible.\u201d<\/p>\n

Another lesson had to do with the evolution of the hardware itself.<\/p>\n

\u201cThere\u2019s a very long history of parallel and distributed databases and super-computer research that tended to ignore some of the problems like fault tolerance,\u201d Isard says, \u201cbecause the assumption was that the hardware would be built to be very reliable. One thing that\u2019s changed is that the most cost-effective way of building large clusters is now to use cheaper and reliable hardware, so fault tolerance is now essential.\u201d<\/p>\n

An increase in the specific needs in computer systems also has created a need for specific kinds of computer capabilities.<\/p>\n

\u201cThere are high-performance computing systems and grid-computing systems which are similar,\u201d he states, \u201cbut are optimized for different kinds of workloads. The high-performance computing systems tend to be optimized more for things like finite element simulations\u2014bomb simulations, weather forecasting, and that kind of thing\u2014where they\u2019re more compute-intensive and less data-intensive. Dryad is optimized more for very large data sets, such as mining logs from search. There are also systems that scale well on this kind of application and offer fault-tolerance guarantees similar to those of Dryad, but they have a much more restricted computational model that makes it hard to get good performance on complex problems.\u201d<\/p>\n

Still, there was that programming model to address.<\/p>\n

\u201cDryad takes away the programmer\u2019s need to understand low-level concurrency, but it still relies on the programmer to think at some level of abstraction about how the job could be divided up,\u201d Isard says. \u201cProgrammers don\u2019t have to worry about low-level synchronization primitives, but they do still have to understand something about the structure of what needs to be done and what depends on what else.<\/p>\n

\u201cThat isn\u2019t necessarily the traditional way that people think about concurrency, but it\u2019s not like you can just sit down and write any old sequential C++ program and have it magically turn into a distributed program.\u201d<\/p>\n

Yes, but then there\u2019s LINQ (opens in new tab)<\/span><\/a>\u2014Language-Integrated Query extensions to the C# (opens in new tab)<\/span><\/a> programming language that enable developers to write and debug applications in a SQL-like query language, with the entire .NET (opens in new tab)<\/span><\/a> library at their disposal and within the familiar Visual Studio (opens in new tab)<\/span><\/a> environment.<\/p>\n

\u201cWe didn\u2019t know about LINQ,\u201d Isard says, \u201cwhen we started writing the Dryad system.\u201d<\/p>\n

Yuan Yu, though, got up to speed in a hurry, taking the lead of the DryadLINQ (opens in new tab)<\/span><\/a> project, which combines the complementary features of the two technologies. \u00dalfar Erlingsson, like Yu a researcher for Microsoft Research Silicon Valley, also played an instrumental role with DryadLINQ.<\/p>\n

\u201cThe conclusion that we\u2019ve come to and that Yu has really pushed forward,\u201d Isard says, \u201cis that LINQ is extremely well-suited to Dryad, and we think that LINQ is the best programming model that we know of for expressing this kind of program.\u201d<\/p>\n

The results have been sufficiently impressive that Dryad technology\u2014to which Mark Manasse, a principal researcher at Microsoft Research Silicon Valley, also has made integral contributions\u2014has been implemented by both Microsoft\u2019s Live Search (opens in new tab)<\/span><\/a> and adCenter (opens in new tab)<\/span><\/a> teams for various data-mining tasks. Other usages could be forthcoming.<\/p>\n

\u201cDryad is a fairly applied piece of research,\u201d Isard explains. \u201cIt was built partly to help product groups with the short-term need they had to analyze data and partly in the hopes of being an enabling platform that would allow us to do research into other aspects of distributed computing.<\/p>\n

\u201cThere\u2019s certainly research to be done. There are many research aspects we\u2019re looking at now to improve Dryad\u2019s performance, but the basic task that it has to perform is generally useful. If you want to run single programs on large clusters, then you\u2019ll need something playing the Dryad role for the foreseeable future.\u201d<\/p>\n

That, Isard says, makes all the hard work on Dryad worth every minute.<\/p>\n

\u201cOne of the main goals,\u201d he says, \u201cwas to make this a solid base on which we could build other things, and I think it\u2019s been very successful at that. That you could write complicated programs that run reliably on thousands of computers \u2026 it\u2019s not just a research project anymore.\u201d<\/p>\n

Reliable, productive\u2014count Dryad as a large-scale success.<\/p>\n","protected":false},"excerpt":{"rendered":"

By Rob Knies, Managing Editor, Microsoft Research Concurrent programming is demanding. While part of a program is modifying data, the other parts must be prevented from doing likewise. Manually organizing such tasks is challenging for the most adept experts. People have been trying for decades to make it easier. Concurrent programming is in demand. More […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[194488],"tags":[215216,215192,215201,186461,187126,186553,215213,215207,215210,215195,215198,215204,186664],"research-area":[13560],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-307355","post","type-post","status-publish","format-standard","hentry","category-program-languages-and-software-engineering","tag-c-programming-language","tag-concurrent-programming","tag-data-placement","tag-distributed-systems","tag-dryad","tag-fault-tolerance","tag-language-integrated-query","tag-large-scale-computer-clusters","tag-linq-extensions","tag-modifying-data","tag-multicore-chips","tag-network-hierarchy","tag-web-services","msr-research-area-programming-languages-software-engineering","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[169537,169536],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"October 27, 2008","formattedExcerpt":"By Rob Knies, Managing Editor, Microsoft Research Concurrent programming is demanding. While part of a program is modifying data, the other parts must be prevented from doing likewise. Manually organizing such tasks is challenging for the most adept experts. People have been trying for decades…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307355"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=307355"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307355\/revisions"}],"predecessor-version":[{"id":307403,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307355\/revisions\/307403"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=307355"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=307355"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=307355"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=307355"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=307355"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=307355"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=307355"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=307355"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=307355"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=307355"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=307355"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}