Catapult: Moving Beyond CPUs in the Cloud

Published

Posted by Rob Knies

Field-programmable gate array (opens in new tab)

Operating a datacenter at web scale requires managing many conflicting requirements. The ability to deliver computation at a high level and speed is a given, but because of the demands such a facility must meet, a datacenter also needs flexibility. Additionally, it must be efficient in its use of power, keeping costs as low as possible.

Spotlight: Blog post

Eureka: Evaluating and understanding progress in AI

How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings. 

Addressing often conflicting goals is a challenge, leading datacenter providers to seek constant performance and efficiency improvements and to evaluate the merits of general-purpose versus task-tuned alternatives—particularly in an era in which Moore’s Law is nearing an end, as some suggest.

Microsoft researchers and colleagues from Bing (opens in new tab) have been collaborating with others from industry and academia to examine datacenter hardware alternatives, and their work, a project known as Catapult (opens in new tab), was presented in Minneapolis on June 16 during the 41st International Symposium on Computer Architecture (opens in new tab) (ISCA).

Their paper, titled A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services" href="#" target="_blank">A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (opens in new tab), describes an effort to combine programmable hardware and software that uses field-programmable gate arrays (opens in new tab) (FPGAs) to deliver performance improvements of as much as 95 percent.

The significance of this work, says Peter Lee (opens in new tab), head of Microsoft Research, could be dramatic.

“Going into production with this new technology will be a watershed moment for Bing search,” he says. “For the first time ever, the quality of Bing’s page ranking will be driven not only by great algorithms but also by hardware—incredibly advanced hardware that can be made more highly specialized than anything ever seen before at datacenter scale.”

Microsoft researcher Doug Burger (opens in new tab), one of 23 co-authors of the ISCA paper, explains the motivation behind this project (opens in new tab).

“We are addressing two problems,” he says. “First, how do we keep accelerating services and reducing costs in the cloud as the performance gains from CPUs continue to flatten?

“Second, we wanted to enable Bing to run computations at a scale that was not possible in software alone, for much better results at lower cost.”

Members of the Project Catapult team (opens in new tab)Derek Chiou, a Bing hardware architect, discusses the benefits of the collaboration.

“The partnership between Doug and his team at Microsoft Research and Bing has been fantastic and has resulted in significant results that will have real impact on Bing,” Chiou says. “The factor of two throughput improvement demonstrated in the pilot means we can do the same amount of work with half the number of servers or double the amount of work with the same number of servers—or some mix of the two.

“Those kinds of numbers are especially significant at the scale of a datacenter. The potential benefits go beyond simple dollars. To give some examples, Bing’s ranking could be further enhanced to provide an even better customer experience, power could be saved, and the size of the datacenters could be reduced. The strength of the pilot results have led to Bing deploying this technology in one datacenter for customers, starting in early 2015.”

As the ISCA paper notes, FPGAs have become powerful computing devices in recent years, making them particularly suited for use as fine-grained accelerators.

“We designed a platform that permits the software in the cloud, which is inherently programmable, to partner with programmable hardware,” Burger says. “You can move functions into custom hardware, but rather than burning them into fixed chips [application-specific integrated circuits], we map them to Altera FPGAs, which can run hardware designs but can be changed by reconfiguring the FPGA.

“We’ve demonstrated a ‘programmable hardware’ enhanced cloud, running smoothly and reliably at large scale.”

In the evaluation deployment outlined in the paper, the reconfigurable fabric—interconnected nodes linked by high-bandwidth connections—was tested on a collection of 1,632 servers to measure its efficacy in accelerating the workload of a production web-search service. The results were impressive: a 95 percent improvement in throughput at a latency comparable to a software-only solution. With an increase in power consumption and total per-server cost increase of less than 30 percent, the net results deliver substantial savings and efficiencies.

The results demonstrated the project’s capability to run stably for long periods, and all the stages in the pipeline exceeded the overall throughput goal. In addition, a service to handle failures quickly reconfigures the fabric after errors or machine failures.

The ISCA paper concludes by underscoring the belief that distributed reconfigurable fabrics will play a critical role as server performance increases level off. Such techniques could become indispensable to datacenter managers balancing their conflicting goals.

“This portends a future where systems are specialized dynamically by compiling a good chunk of demanding workloads into hardware,” Burger says. “I would imagine that a decade hence, it will be common to compile applications into a mix of programmable hardware and programmable software.

“This is a radical shift that will offer continued performance improvements past the end of Moore’s Law as we move more and more of our applications and services into hardware.”

Continue reading

See all blog posts

Research Areas

Related projects