{"id":1233,"date":"2023-04-17T14:00:07","date_gmt":"2023-04-17T14:00:07","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/startups\/blog\/?p=1233"},"modified":"2025-06-24T20:52:14","modified_gmt":"2025-06-25T04:52:14","slug":"discovering-holistic-infrastructure-strategies-for-compute-intensive-startups","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/startups\/blog\/discovering-holistic-infrastructure-strategies-for-compute-intensive-startups\/","title":{"rendered":"Discovering holistic infrastructure strategies for compute-intensive startups"},"content":{"rendered":"\n
This is part two of a three-part AI-Core Insights series. Click here for part one<\/a>, \u201cFoundation models: To open-source or not to open-source?\u201d<\/em><\/p>\n\n\n\n In the first part of this three-part blog series, we discussed the practical approach towards foundation models (FM), both open and closed source. From a deployment perspective, the proof in the pudding is which foundation model works best to solve the intended use case.<\/p>\n\n\n\n Let us now simplify the seemingly infinite infrastructure needed to realize a product out of compute-intensive foundation models. There are two heavily discussed problem statements<\/a>:<\/p>\n\n\n\n Put simply, the return and investment should go hand in hand. In the beginning, however, this can require a huge sunk cost. So, what do you focus on?<\/p>\n\n\n\n If you have a fine-tuning pipeline, it looks something like this:<\/p>\n\n\n\n Note: if you do not have a fine-tuning pipeline, the pre-processing elements are out, but you are still thinking about serving infrastructure. <\/em><\/p>\n\n\n\n The biggest decision that relates to our sunk cost conversation is this: What constitutes your infrastructure? Do you A) the infrastructure problem and borrow<\/em> it from providers, while focusing on your core product, or do you B) build <\/em>components in-house, investing time and money upfront, discovering, and solving the challenges as you go? Do you A) consolidate locations, saving on ingress\/egress and many associated costs with regions and zones, or do you B) decentralize it from various sources, diversifying the points of failure but spreading it across zones or regions, potentially creating a latency problem needing a solution?<\/p>\n\n\n\n The trend that I see in growing startups is this: focus on your core product differentiation and commoditize the rest. Infrastructure can be a complicated overhead taking you away from the monetizable problem statement, or it can be a big powerhouse with bits and pieces that can easily scale on single clicks with your growth.<\/p>\n\n\n\n There is a euphemism that I have heard in the startup community: \u201cYou cannot throw GPU at every problem.\u201d How I interpret it is this: \u201cOptimization is a problem that can\u2019t be completely solved by hardware (generally speaking).\u201d There are other factors at play like model compression and quantization, not to mention the crucial role of platform and runtime software such as inference acceleration<\/a> and checkpointing<\/a>.<\/p>\n\n\n\n Thinking of the big picture, the role of optimization and acceleration rapidly becomes centralized. Runtime accelerators like ONNX can give 1.4X faster inference while rapid checkpointing features like Nebula can help recover your training jobs from hardware failures, thus saving the most vital resource: time. Along with this, simple techniques like autoscaling or scaling and workload triggers can help you spin down the number of GPUs sitting idle and waiting for your next burst of inference requests by going back to a minimum where you can scale it up from.<\/p>\n\n\n\n In the roundtables that we\u2019ve hosted for startups, sometimes the most cash-burning questions are the simplest ones: To manage your growth, how do you balance serving your customers short-term with the most efficient hardware and scale vs. serving them long-term with efficient scale-ups and -downs?<\/p>\n\n\n\n As we think about productionizing with foundation models, involving large-scale training and inference, we need to consider the role of platform and inference acceleration together with the role of infrastructure. Techniques such as ONNX runtime or Nebula are only a couple of such considerations and there are many more. Ultimately, startups face the challenge of efficiently serving customers in the short term while managing growth and scalability in the long term.<\/p>\n\n\n\n\n
The infrastructure dilemma for FM startups<\/h2>\n\n\n\n
\n
Beyond compute: The role of platform and inference acceleration<\/h2>\n\n\n\n
Summary<\/h2>\n\n\n\n