{"id":625260,"date":"2019-12-09T08:16:44","date_gmt":"2019-12-09T16:16:44","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=625260"},"modified":"2019-12-09T08:16:44","modified_gmt":"2019-12-09T16:16:44","slug":"project-petridish-efficient-forward-neural-architecture-search","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/project-petridish-efficient-forward-neural-architecture-search\/","title":{"rendered":"Project Petridish: Efficient forward neural architecture search"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-625539\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788.gif\" alt=\"Animation depicting Efficient forward neural architecture search\" width=\"1400\" height=\"788\" \/><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p>Having experience in deep learning doesn\u2019t hurt when it comes to the often mysterious, time- and cost-consuming process of hunting down an appropriate neural architecture. But truth be told, no one really knows what works the best on a new dataset and task. Relying on well-known, top-performing networks provides few guarantees in a space where your dataset can look very different from anything those proven networks have encountered before. For example, a network that worked well on satellite images won\u2019t necessarily work well on the selfies and food photos making the rounds on social media. Even when a task dataset is similar to other common datasets and a bit of prior knowledge can be utilized by starting with similar architectures, it\u2019s challenging to find architectures that satisfy not only accuracy, but also memory and latency constraints, among others, at serving time. These challenges could lead to a frustrating amount of trial and error.<\/p>\n<p>In our paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/efficient-forward-architecture-search\/\">\u201cEfficient Forward Architecture Search,\u201d<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> which is being presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/neurips.cc\/\">33rd Conference on Neural Information Processing Systems (NeurIPS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we introduce Petridish, a neural architecture search algorithm that opportunistically adds new layers determined to be beneficial to a parent model, resulting in a <em>gallery<\/em> of models capable of satisfying a variety of constraints for researchers and engineers to choose from. The team behind the ongoing work is comprised of myself, Carnegie Mellon University PhD graduate <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/hanzhang-hu-76487562\/\">Hanzhang Hu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jcl\/\">John Langford<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Partner Research Manager; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rcaruana\/\">Rich Caruana<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Senior Principal Researcher; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shitals\/\">Shital Shah<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Principal Research Software Engineer; <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/saurajitm\/\">Saurajit Mukherjee<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Principal Engineering Manager; and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\">Eric Horvitz<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Technical Fellow and Director, Microsoft Research AI.<\/p>\n<p>With Petridish, we seek to increase efficiency and speed in finding suitable neural architectures, making the process easier for those in the field, as well as those without expertise interested in machine learning solutions.<\/p>\n<h3><strong>Neural architecture search\u2014forward search vs. backward search<\/strong><\/h3>\n<p>The machine learning subfield of neural architecture search (NAS) aims to take the guesswork out of people\u2019s hands and let algorithms search for good architectures. While <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1611.01578\">NAS experienced a resurgence in 2016<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and has become a very popular topic (see the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.automl.org\/automl\/literature-on-neural-architecture-search\/\">AutoML Freiburg-Hannover website<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for a continuously updated compilation of published papers), the earliest papers on the topic date back to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/papers.nips.cc\/paper\/149-self-organizing-neural-networks-for-the-identification-problem\">NeurIPS 1988<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/papers.nips.cc\/paper\/207-the-cascade-correlation-learning-architecture\">NeurIPS 1989<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Most of the well-known NAS algorithms today, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1802.03268\">Efficient Neural Architecture Search (ENAS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1806.09055\">Differentiable Architecture Search (DARTS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1812.00332\">ProxylessNAS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, are examples of backward search. During backward search, smaller networks are sampled from a supergraph, a large architecture containing multiple subarchitectures. A limitation of backward search algorithms is that human domain knowledge is needed to create a supergraph in the first place. In contrast, Petridish is an example of forward search, a paradigm first introduced 30 years ago by Scott Fahlman and Christian Lebiere of Carnegie Mellon University in that 1989 NAS NeurIPS paper. Forward search requires far less human knowledge when it comes to search space design.<\/p>\n<p>Petridish, which was also inspired by gradient boosting, creates as its search output a gallery of models to choose from, incorporates stop-forward and stop-gradient layers in more efficiently identifying beneficial candidates for building that gallery, and uses asynchronous training.<\/p>\n<div id=\"attachment_625275\" style=\"width: 550px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-625275\" class=\" wp-image-625275\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow.png\" alt=\"Figure 1: Petridish, a neural architecture search algorithm that grows a nominal seed model during search by opportunistically adding layers as needed, comprises three phases.\" width=\"540\" height=\"281\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow.png 1577w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow-300x156.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow-1024x533.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow-768x400.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Workflow-1536x800.png 1536w\" sizes=\"(max-width: 540px) 100vw, 540px\" \/><p id=\"caption-attachment-625275\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 1: Petridish, a neural architecture search algorithm that grows a nominal seed model during search by opportunistically adding layers as needed, comprises three phases. Phase 0 starts with the small parent model. In Phase 1, a large number of candidates is considered for addition to the parent. If a candidate is promising, then it\u2019s added to the parent in Phase 2. Models in Phase 2 that lie near the boundary of the current estimate of the Pareto frontier (see Figure 2) are then added to the pool of parent models in Phase 0 so they have the chance to grow further.<\/p><\/div>\n<h3><strong>Overview of Petridish <\/strong><\/h3>\n<p>There are three main phases to Petridish:<\/p>\n<ul>\n<li>\n<blockquote><p><strong>PHASE 0<\/strong>: We start with some parent model, a very small human-written model with one or two layers or a model already found by domain experts on a dataset.<\/p><\/blockquote>\n<\/li>\n<li>\n<blockquote><p><strong>PHASE 1<\/strong>: We connect the candidate layers to the parent model using <em>stop-gradient<\/em> and <em>stop-forward<\/em> layers and partially train it. The candidate layers can be any bag of operations in the search space. For example, for vision tasks, we set the candidates to be 3&#215;3 and 5&#215;5 dilated convolutions, 3&#215;3 and 5&#215;5 separable convolutions, 3&#215;3 max pooling, 3&#215;3 average pooling, and identity. Using stop-gradient and stop-forward layers allows gradients with respect to the candidates to be accumulated without affecting the model\u2019s forward activations and backward gradients. Without the stop-gradient and stop-forward layers, it would be difficult to determine which candidate layers are contributing what to the parent model\u2019s performance and would require separate training if you wanted to see their respective contributions, increasing costs. By leaving the parent model unaffected by the candidate layers, we\u2019re able to independently evaluate each candidate simultaneously.<\/p><\/blockquote>\n<\/li>\n<li>\n<blockquote><p><strong>PHASE 2:<\/strong> If a particular candidate or set of candidates is found to be beneficial to the model, then we remove the stop-gradient and stop-forward layers and the other candidates and train the model to convergence. The training results are added to a scatterplot, naturally creating an estimate of the Pareto frontier. A Pareto frontier encodes the relationship between different objectives of a multi-objective optimization problem where there can\u2019t be gains in one objective without giving up something in the other. Only those models that have a realistic chance of improving the estimate of the Pareto frontier get moved to the parent queue in Phase 0.<\/p><\/blockquote>\n<\/li>\n<\/ul>\n<p>Explicitly maintaining a Pareto frontier, like the one represented by Figure 2, allows researchers, engineers, and product groups to more easily determine the architecture that achieves the best combination of properties they\u2019re considering for a particular task. With Figure 2, for example, they can more easily answer questions such as what is the best-performing architecture available given a certain amount of floating-point operations per second (FLOPS) at serving time. This is crucial in production environments, where accuracy, FLOPS, and other metrics like serving latency, memory, and cost are important considerations. Once a search has been completed, if the need for a model meeting different constraints arises, all the team has to do is look it up on the plot without having to redo the architecture hunt.<\/p>\n<p>All three phases are executing concurrently in a distributed manner with each phase maintaining its own queue of models where each queue is cleared by a pool of worker processes in parallel.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_626235\" style=\"width: 915px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-626235\" class=\"wp-image-626235 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Pareto.png\" alt=\"\" width=\"905\" height=\"700\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Pareto.png 905w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Pareto-300x232.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/Pareto-768x594.png 768w\" sizes=\"(max-width: 905px) 100vw, 905px\" \/><p id=\"caption-attachment-626235\" class=\"wp-caption-text\">Figure 2: Petridish maintains an estimate of the Pareto frontier and makes it easier to see the tradeoff between accuracy, FLOPS, memory, latency, and other criteria. Those models along the Pareto frontier (red line) make up the search output, a gallery of models from which researchers and engineers can choose.<\/p><\/div>\n<h3><strong>Candidate selection<\/strong><\/h3>\n<p>In Phase 1, for a candidate to be selected for incorporation into the model, we apply L1 regularization to all the candidates and greedily select the candidates that have the highest weight. L1 regularization is commonly used in feature selection to induce sparsity over a set of features so that one can effectively get the most predictive power out of the least number of additional features. Petridish should remind some readers of gradient boosted machines (GBMs), where additional capacity is sequentially added\u2014for example, in a gradient boosted forest\u2014to minimize residual loss.<\/p>\n<p>The construction of Petridish makes it particularly amenable to warm-starting from a previously known model, which is important, as datasets continually change in size and character, a common occurrence in production environments.<\/p>\n<h3><strong>Summary of results <\/strong><\/h3>\n<p>On <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\">CIFAR-10<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Petridish achieves 2.75 \u00b10.21 percent average test error, with 2.51 percent as the best result, using only 3.2M parameters and five GPU days of search time on the popular cell search space. On the more general and bigger macro search space, Petridish achieves 2.85 \u00b10.12 percent average test error, with 2.83 percent as the best search, using only 2.2M parameters. This is state of the art on a much bigger search space at a similar number of parameters and dispels the common myth that macro search spaces are difficult to deal with and cannot easily achieve competitive performance, opening the door to interesting families of architectures researchers might not have previously considered.<\/p>\n<p>On transferring the models found on CIFAR-10 to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.image-net.org\/\">ImageNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Petridish achieves 28.7 \u00b10.15 percent top-1 test error, with 28.5 percent as the best result, using only 4.3M parameters on the macro search space. On the cell search space, Petridish achieves 26.3 \u00b10.20 percent top-1 test error, with 26.0 percent as the best result, using 4.8M parameters. Again, we show that macro search spaces that don\u2019t need a prior human-designed supergraph can be quite competitive, and more research to unlock performance from such expressive spaces is needed.<\/p>\n<p>While we\u2019ve demonstrated Petridish on CIFAR-10\/100, ImageNet and also <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/catalog.ldc.upenn.edu\/docs\/LDC95T7\/cl93.html\">Penn Treebank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which are commonly accepted NAS datasets, we\u2019re trying it out on a number of diverse datasets in vision and language and invite the community to do the same and report back their experiences. All <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/petridishnn\">source code for Petridish is openly available<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (under MIT license) using TensorFlow 1.12. We\u2019re writing a more robust distributed version in PyTorch, which will appear shortly at the same repository.<\/p>\n<p><em>This work was spearheaded by Hanzhang Hu, a Carnegie Mellon University PhD graduate, during a Microsoft Research summer internship. Team members Debadeepta Dey, John Langford, Rich Caruana, and Eric Horvitz served as advisors on the work.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Having experience in deep learning doesn\u2019t hurt when it comes to the often mysterious, time- and cost-consuming process of hunting down an appropriate neural architecture. But truth be told, no one really knows what works the best on a new dataset and task. Relying on well-known, top-performing networks provides few guarantees in a space where [&hellip;]<\/p>\n","protected":false},"author":38679,"featured_media":625542,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[241770,194454,1],"tags":[],"research-area":[13561,13556,13562],"msr-region":[256048],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-625260","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-intelligence","category-research-blog","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-region-global","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144633],"related-projects":[],"related-events":[609480],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/12\/MSR_NeurIPS_EfficientForward_1400x788.png 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"Debadeepta Dey","formattedDate":"December 9, 2019","formattedExcerpt":"Having experience in deep learning doesn\u2019t hurt when it comes to the often mysterious, time- and cost-consuming process of hunting down an appropriate neural architecture. But truth be told, no one really knows what works the best on a new dataset and task. Relying on&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/625260"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38679"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=625260"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/625260\/revisions"}],"predecessor-version":[{"id":626247,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/625260\/revisions\/626247"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/625542"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=625260"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=625260"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=625260"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=625260"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=625260"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=625260"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=625260"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=625260"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=625260"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=625260"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=625260"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}