{"id":572415,"date":"2019-03-13T09:58:54","date_gmt":"2019-03-13T16:58:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=572415"},"modified":"2019-03-13T09:58:54","modified_gmt":"2019-03-13T16:58:54","slug":"researchers-seek-to-simplify-the-complex-in-cloud-computing","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/researchers-seek-to-simplify-the-complex-in-cloud-computing\/","title":{"rendered":"Researchers seek to simplify the complex in cloud computing"},"content":{"rendered":"
<\/p>\n
From February 26\u201328, researchers gathered in Boston for the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (opens in new tab)<\/span><\/a>, one of the top conferences in the networking and systems field. Microsoft, a silver sponsor of the event, was represented by researchers serving on the program committee, as well as those presenting papers<\/a>, including two research teams using novel abstractions to empower and better serve cloud users.<\/p>\n \u201cBoth papers describe new ways to cope with the ever-increasing scale and complexity of what it means to do state-of-the-art computing in the cloud,\u201d said Thomas Moscibroda, Microsoft Partner Research Scientist, Azure Compute<\/a>.<\/p>\n With their respective work, the teams seek to simplify the underlying operations\u2014or what Microsoft Principal Scientist Konstantinos Karanasos<\/a>, co-author on the other paper, calls \u201cthe magic\u201d\u2014to deliver a more efficient and seamless user experience.<\/p>\n Field programmable gate arrays (FPGAs) are becoming widely used in today\u2019s data centers (opens in new tab)<\/span><\/a>. These reprogrammable circuits combine the advantages of hardware speed while offering some of the flexibility that makes software ideal for programming. But taking advantage of their full potential at cloud computing scale has been extremely challenging for several reasons, and researchers in the Networking Research Group at Microsoft Research Asia<\/a>, in collaboration with engineering leaders in Microsoft Azure (opens in new tab)<\/span><\/a>, are hoping to change that by addressing one such obstacle: the absence of an efficient, reliable, easy-to-use communications layer.<\/p>\n In their paper \u201cDirect Universal Access: Making Data Center Resources Available to FPGA,\u201d<\/a> they present a new communications architecture, one that Microsoft Researcher Peng Cheng<\/a> and his co-authors liken to the Internet Protocol or the operating system of a computer.<\/p>\n \u201cOur challenge has been, how do we provide a software-like IP layer inside this hardware-based platform,\u201d said Cheng, adding that the goal is a unified platform.<\/p>\n Currently, communication between pairs of FPGAs and other data center resources, such as CPUs, GPUs, memory, and storage, is complex, making programming large-scale heterogenous applications impractical and, at times, nearly impossible.<\/p>\n There are several reasons for this, the researchers explain in their paper: First, the communications paradigms used for connecting resources that are local to a server and resources that are remote\u2014that is, located on a different server in the data center\u2014are different and use vastly different communications stacks. Secondly, resources are named in a way that is specific to the server they live on. And lastly, current FPGA architecture is inefficient when it comes to multiplexing multiple diverse communications links to different local and remote resources.<\/p>\n Current FPGA communications architecture (top) compared to an ideal FPGA communications architecture. Deploying a common communications interface, a global unified naming scheme, and an underlying network service providing routing and multiplexing, the ideal architecture captured by DUA will allow designers and developers to build large-scale heterogenous FPGA-based applications.<\/p><\/div>\n Direct Universal Access (DUA) makes communication among data center resources possible and easier by providing a common communications interface, a global unified naming scheme, and an underlying network service that provides routing and resource multiplexing, creating a common resource pool that can be accessed uniformly and efficiently. The architecture is implemented as an overlay network\u2014a layer between the developer and the various data center communications stacks and resources\u2014and supports systems and communications protocols currently in place. This is critical because it means that no manufacturing overhaul of existing devices is required; DUA can be deployed as is on existing frameworks.<\/p>\n \u201cDUA connects all resources in a data center regardless of location and type of resource,\u201d explained Microsoft Associate Researcher Ran Shu<\/a>. \u201cAll these resources are in a unified naming space and unified IP-based networking scheme, so each application can access different resources with the same code, so it is easy for developers to port their code, and it greatly reduces application development time.\u201d<\/p>\n The researchers hope DUA will allow developers to build large-scale, diverse, and novel FPGA-based applications that, before this, haven\u2019t been within reach.<\/p>\n \u201cThe proliferation of FPGAs in the cloud is a reality and offers gigantic promise because if we can make it easy for developers to use and connect the different types of data center resources in an efficient way, they can build novel types of applications that are inconceivable otherwise,\u201d said Moscibroda.<\/p>\n To demonstrate this potential, the research team has built two large-scale FPGA applications\u2014regular expression matching for packet inspection and deep crossing, a machine learning algorithm\u2014on top of DUA.<\/p>\n The team is in the process of making DUA open-source, and it will be available on GitHub soon.<\/p>\n Cloud services for storing, analyzing, and managing big data can process thousands of jobs for thousands of users in a single day. No small order. And it\u2019s the responsibility of the service\u2019s resource manager to make sure these jobs go off without a hitch. The resource management infrastructure determines where a particular job and its tasks should run and what share of resources each user should get to accomplish said job. For smaller-scale services, the challenges of task placement and share determination can generally be tackled together. But Microsoft is no small-scale operation.<\/p>\n Serving 10,000 users and running half a million jobs daily across hundreds of thousands of machines, Microsoft was in need of a new approach, so its researchers set out to deliver a resource manager capable of offering the scalability and utilization of its existing infrastructure while also meeting several additional key requirements: It needed to be able to handle not only a high volume of work but also a diverse workload, including both internal Microsoft applications and open-source frameworks; it needed to allocate resources in a more principled and efficient way; and it needed to make the testing of new features easier.<\/p>\n The result of their work is Hydra, the main resource manager behind the big-data analytics clusters of Microsoft today. The infrastructure has actually been in place for a few years now, the team migrating 99 percent of users over in real-time while continuing services. \u201cThis is what we call changing airplane engines mid-flight,\u201d Karanasos said with a laugh.<\/p>\nDirect Universal Access: A communications architecture<\/h3>\n
<\/p>\n
Hydra: A resource management framework<\/h3>\n