Dryad: Programming the Datacenter

Published

By Rob Knies, Managing Editor, Microsoft Research

Concurrent programming is demanding. While part of a program is modifying data, the other parts must be prevented from doing likewise. Manually organizing such tasks is challenging for the most adept experts. People have been trying for decades to make it easier.

Concurrent programming is in demand. More programs are communicating with Web services. Fundamental limitations in physics are dictating a move to multicore chips that enable many processes to run in parallel. There’s no turning back.

Spotlight: blog post

GraphRAG auto-tuning provides rapid adaptation to new domains

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.

Enter Dryad (opens in new tab).

“The Dryad project,” says Michael Isard, senior researcher for Microsoft Research Silicon Valley, “is trying to make it easier to write programs that can run over very large collections of computers, both efficiently and reliably.

“We’re trying to take a large and useful class of programs and still let the programmer think about it sequentially but have the system automatically parallelize it. Concurrent-programming researchers have always looked for ways in which the concurrency can be automatically found by the system. There has always been an emphasis on approaches that would let the programmer think sequentially and have the system find parallelism.”

In Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (opens in new tab), a paper written by Isard and Microsoft Research Silicon Valley colleagues Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly, the project’s value is stated thusly:

“The Dryad execution engine handles all the difficult problems of creating a large, distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between [computation] vertices.

Isard elaborates.

“The goal,” he says, “is to abstract away a lot of the practical details of the cluster—the data placement, the network hierarchy, the strategy for fault tolerance—so that the programmer can work at a higher level and concentrate more on the structure of the computation and rely on Dryad to do the scheduling, the fault tolerance, and those kinds of things.”

In the process, programmers are freed to contemplate the more difficult, abstract issues involved in making distributed systems work over large-scale computer clusters—without having to concern themselves with low-level, though critical, details.

“The programmer,” Isard says, “should be able to describe the computation at quite a high level and write some declarative or sequential code that looks like a SQL query or a single-threaded C# program.

“It’s the job of Dryad to look at the program and the actual physical resources—what computers there are, what network they’re connected with, how the data has been split up—and figure out how to break that program up so it can be run in parallel and then send it out to this cluster and run the necessary pieces and send the data between computers when necessary.”

Scale, you see, is a significant component of the value Dryad brings to bear. It is designed to scale effectively, from a single, powerful multicore computer to small clusters of computers to datacenters with thousands of computers.

“Suppose you have some very large data set,” Isard says, “maybe a few terabytes of data stored on a cluster of a few thousand computers—and that the data set has been split up and partitioned, and the partitions have been spread over thousands of computers. Dryad enables a user to analyze that data.”

Of course, having a large collection of data distributed among a large number of computers introduces a higher risk of something going wrong.

“If there are failures,” Isard says, “if there are transient network failures or some of the computers crash, Dryad hides all that and makes sure that the computation finishes anyway.”

As mentioned, work to enable concurrent programming has been conducted for years, and Dryad has its antecedents. Isard himself has worked on some of them. But none has delivered the flexible programming model and the ability to achieve rich relational queries in addition to providing fault tolerance and data-directed design. That combination—scalable and better able to handle complex programs—is what sets Dryad apart.

Isard has been investigating large-scale distributed systems for about five years now, and his Dryad work began in the spring of 2005. But things change quickly in technological circles, with their steep learning curves, and the project has morphed to reflect that dynamic environment.

“When we started out,” he recalls, “we didn’t think very carefully about the programming model that the programmer would actually see. We thought more about the classes of computation that we wanted Dryad to support, and one of the things we learned is that what we built in the first version was more middleware that most programmers don’t want to program to.

“Subsequently, we’ve spent more time putting a layer on top of that, but at a higher level of abstraction that programmers see. The original Dryad system is mostly targeted by higher-level programming languages. So it’s quite flexible.”

Another lesson had to do with the evolution of the hardware itself.

“There’s a very long history of parallel and distributed databases and super-computer research that tended to ignore some of the problems like fault tolerance,” Isard says, “because the assumption was that the hardware would be built to be very reliable. One thing that’s changed is that the most cost-effective way of building large clusters is now to use cheaper and reliable hardware, so fault tolerance is now essential.”

An increase in the specific needs in computer systems also has created a need for specific kinds of computer capabilities.

“There are high-performance computing systems and grid-computing systems which are similar,” he states, “but are optimized for different kinds of workloads. The high-performance computing systems tend to be optimized more for things like finite element simulations—bomb simulations, weather forecasting, and that kind of thing—where they’re more compute-intensive and less data-intensive. Dryad is optimized more for very large data sets, such as mining logs from search. There are also systems that scale well on this kind of application and offer fault-tolerance guarantees similar to those of Dryad, but they have a much more restricted computational model that makes it hard to get good performance on complex problems.”

Still, there was that programming model to address.

“Dryad takes away the programmer’s need to understand low-level concurrency, but it still relies on the programmer to think at some level of abstraction about how the job could be divided up,” Isard says. “Programmers don’t have to worry about low-level synchronization primitives, but they do still have to understand something about the structure of what needs to be done and what depends on what else.

“That isn’t necessarily the traditional way that people think about concurrency, but it’s not like you can just sit down and write any old sequential C++ program and have it magically turn into a distributed program.”

Yes, but then there’s LINQ (opens in new tab)—Language-Integrated Query extensions to the C# (opens in new tab) programming language that enable developers to write and debug applications in a SQL-like query language, with the entire .NET (opens in new tab) library at their disposal and within the familiar Visual Studio (opens in new tab) environment.

“We didn’t know about LINQ,” Isard says, “when we started writing the Dryad system.”

Yuan Yu, though, got up to speed in a hurry, taking the lead of the DryadLINQ (opens in new tab) project, which combines the complementary features of the two technologies. Úlfar Erlingsson, like Yu a researcher for Microsoft Research Silicon Valley, also played an instrumental role with DryadLINQ.

“The conclusion that we’ve come to and that Yu has really pushed forward,” Isard says, “is that LINQ is extremely well-suited to Dryad, and we think that LINQ is the best programming model that we know of for expressing this kind of program.”

The results have been sufficiently impressive that Dryad technology—to which Mark Manasse, a principal researcher at Microsoft Research Silicon Valley, also has made integral contributions—has been implemented by both Microsoft’s Live Search (opens in new tab) and adCenter (opens in new tab) teams for various data-mining tasks. Other usages could be forthcoming.

“Dryad is a fairly applied piece of research,” Isard explains. “It was built partly to help product groups with the short-term need they had to analyze data and partly in the hopes of being an enabling platform that would allow us to do research into other aspects of distributed computing.

“There’s certainly research to be done. There are many research aspects we’re looking at now to improve Dryad’s performance, but the basic task that it has to perform is generally useful. If you want to run single programs on large clusters, then you’ll need something playing the Dryad role for the foreseeable future.”

That, Isard says, makes all the hard work on Dryad worth every minute.

“One of the main goals,” he says, “was to make this a solid base on which we could build other things, and I think it’s been very successful at that. That you could write complicated programs that run reliably on thousands of computers … it’s not just a research project anymore.”

Reliable, productive—count Dryad as a large-scale success.

Continue reading

See all blog posts