Reducing AI's Carbon Footprint Articles http://approjects.co.za/?big=en-us/research/ Tue, 16 Aug 2022 20:48:33 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 Improve edge-device AI efficiency http://approjects.co.za/?big=en-us/research/articles/improve-edge-device-ai-efficiency/ Tue, 24 May 2022 16:17:07 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=840592 Machine learning models are increasingly running on edge hardware, such as mobile phones or Internet of Things (IoT) devices. Motivations include protection of private data and avoidance of networking latency, for example with applications that recognize speech. Ensuring efficient inference is especially important on battery-powered devices with constrained processor, memory and power budgets. Several approaches […]

The post Improve edge-device AI efficiency appeared first on Microsoft Research.

]]>
Urban innovation: farmer selling vegetables to a customer using a cellphone to pay

Machine learning models are increasingly running on edge hardware, such as mobile phones or Internet of Things (IoT) devices. Motivations include protection of private data and avoidance of networking latency, for example with applications that recognize speech. Ensuring efficient inference is especially important on battery-powered devices with constrained processor, memory and power budgets. Several approaches have proven fruitful.

In collaboration with NVIDIA, we’ve developed efficient Neural Architecture Search (NAS) to find network architectures that will run efficiently on hardware with specific constraints, such as low power consumption for mobile devices. Hardware-Aware Network Transformation (HANT) employs a two-level strategy to achieve this goal. First, knowledge distillation is used to train a library of efficient operators, once. HANT can then search this library quickly and repeatedly to generate energy-efficient, hardware-specific architectures. This highly efficient method can find high-performing architectures in minutes, enabling carbon savings compared to previous methods.

HANT diagram
Hardware-Aware Network Transformation (HANT)

AsyMo takes a different approach that focuses on deep learning inference latency and energy efficiency on the asymmetric processors found in mobile phones. Mobile CPUs have multiple cores with different characteristics, such as a larger core intended for high performance, and a smaller core for when energy conservation is more important than performance. AsyMo incorporates knowledge of the processor asymmetry and model architecture into the partitioning of neural network inferencing tasks to reduce inference latency. AsyMo also identifies that high CPU clock speeds do not benefit (and can actually harm) models that are memory-bandwidth limited. Leveraging this insight, AsyMo intelligently sets the CPU clock speed based on the hardware and model architecture, to improve energy efficiency. Depending on the deep learning framework and model evaluated, AsyMo achieved improvements of 46% or more for inference latency, and up to 37% for energy efficiency. 

The post Improve edge-device AI efficiency appeared first on Microsoft Research.

]]>
Emit less carbon from AI http://approjects.co.za/?big=en-us/research/articles/emit-less-carbon-from-ai/ Tue, 24 May 2022 16:15:52 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=840565 Multiple activities are involved in developing and using machine learning models, including selection of model architectures and algorithms, hyperparameter tuning, training on existing datasets, and making predictions on new data (aka inference). Optimizing results across these activities involves many complex problems that researchers are addressing, as described in the sections below. Neural Architecture Search (NAS) […]

The post Emit less carbon from AI appeared first on Microsoft Research.

]]>
conceptual image - Programming code abstract technology background of software developer and computer script

Multiple activities are involved in developing and using machine learning models, including selection of model architectures and algorithms, hyperparameter tuning, training on existing datasets, and making predictions on new data (aka inference). Optimizing results across these activities involves many complex problems that researchers are addressing, as described in the sections below.

Efficient model architectures and hyperparameter tuning

Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) are both automated optimization techniques that aim to identify promising candidates within combinatorial search spaces that are typically too large to search exhaustively. For NAS, the search is conducted through a space of potential neural network architectures; for HPO, the search is for high-performing combinations of hyperparameters. While “high-performing” traditionally refers to the prediction accuracy of the resulting model, techniques like NAS and HPO can also be used to satisfy different objective functions, such as computational efficiency or cost. Thus, NAS and HPO can be useful for identifying more resource-efficient machine learning models – but it is also critical that these techniques themselves operate efficiently. Research-based meta-learning techniques (opens in new tab), in which an algorithm learns from experience to guide more efficient exploration of the search space, have been incorporated into Azure Machine Learning. These techniques are the basis for efficient model selection in Automated Machine Learning (opens in new tab) and for efficient hyperparameter optimization in the HyperDrive service (opens in new tab). Other research approaches include Probabilistic Neural Architecture Search (PARSEC (opens in new tab)), which uses a memory-efficient sampling procedure that requires only as much memory as is needed to train a single architecture in the search space, greatly reducing memory requirements compared to previous methods of identifying high-performing neural network architectures. Weightless PARSEC built on that result to achieve comparable results with 100x less computational cost, with implications for both embodied and emitted carbon reduction.

animation showing hyperparameters

Since those advances, the growth of very large-scale deep learning models motivated researchers to develop a theory of scaling as model width grows, called Tensor Programs (opens in new tab). This theory enabled a procedure, called muTransfer (opens in new tab), that can transfer training hyperparameters across model sizes, allowing the optimal hyperparameters discovered for a small model to be applied directly to a target scaled-up version. Compared to directly tuning the hyperparameters of large models, muTransfer enables equivalent accuracy levels while using at least an order of magnitude (~10x) less compute, with no limit to the efficiency gain as the target model size grows.  With very large-scale models (e.g., trillions of parameters) for which hyperparameter tuning is prohibitively costly, muTransfer makes hyperparameter tuning possible. For models of any size, muTransfer is available as an open-source package (opens in new tab).


Efficient model training

Training is the process of exposing a model to existing data, from which the model “learns” how to weight different features in the input data to predict the outcome of a specified task. For example, models for translation from English to Hebrew might learn from large data sets of example translations to predict that “I” in English should translate to “ani” in Hebrew. As language models have grown larger and more generalizable to multiple tasks, training has been refined into two stages: pre-training of a general model, followed by fine-tuning to produce accurate outcomes on a specific task. Continuing the translation example, the pre-training stage might draw from multi-lingual translation examples, while the fine-tuning might focus on one target language. While pre-training can be especially computationally intensive, for production systems it typically happens less frequently than fine-turning, and both happen much less frequently than inference. Nevertheless, improving the efficiency of both aspects of training is critical to reducing AI’s carbon footprint. 

LoRA (opens in new tab) (Low-Rank Adaptation of Large Language Models) focuses on compute and memory reductions during fine-tuning. It freezes the pre-trained model weights (aka parameter values) and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the GPU memory requirement for fine-tuning by a factor of 3. LoRA has been successfully applied in practice to make real-world model development more resource-efficient. For example, Power Apps Ideas (opens in new tab) uses AI to help people create apps using plain language to describe what they want the app to do. The product’s development relied on fine-tuning the general purpose GPT-3 175B model. Using LoRA for fine tuning reduced the storage and hardware requirements from 64 to 16 GPUs (opens in new tab) (specialized processors for AI training). A package that facilitates the integration of LoRA with PyTorch models can be found on GitHub (opens in new tab).

EarlyBERT (opens in new tab), in contrast, provides a general computationally efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. Inspired by the Early-Bird Lottery Tickets (opens in new tab) studied for computer vision tasks, EarlyBERT slims the self-attention and fully-connected sub-layers inside a transformer model to identify structured winning tickets in the early stage of BERT training. In essence, EarlyBERT compresses the model by pruning it into a sparser version of the original. The result is comparable performance to standard BERT, with 35-45% less compute time needed for training.


Efficient inference

Once designed, trained and tuned, models are used for inference – making predictions on new data.  Because production models at scale may handle trillions of inferences per day, efficient use of compute hardware for inference is a critical objective for minimizing embodied carbon, alongside objectives to minimize cost and response times (aka latency) while maximize predictive accuracy. Many approaches are being pursued to balance these tradeoffs, including algorithmic improvements, techniques such as model compression and knowledge distillation that reduce model size, and methods such as quantization and factorization to lessen the intensity of computations needed for the arithmetic operations (such as matrix multiplication) that are fundamental to deep learning. Example results in this area include:

Factorization: Factorizable neural operators (FNO) aim to reduce the memory and compute requirements of very large AI models, such as those used for language-based tasks. FNO leverages low-rank matrix decompositions to achieve hardware-friendly compression of the most costly neural network layers. Applied to large-scale language models such as BERT, FNO reduced memory usage by 80 percent and prediction time by 50 percent, with less than a 5 percent reduction in accuracy. Further improvements may be achieved by combining FNO with other model compression methods, such as distillation and quantization.

Model compression via knowledge distillation: These techniques aim to reduce model size while retaining model accuracy, and are particularly critical for the large-scale, pre-trained models commonly used for language tasks. Compression can be achieved by distilling knowledge from a larger teacher model into a smaller student model; several developments based on this approach have achieved significant reductions in parameter counts, which typically reduce memory requirements correspondingly, as well as inference latency.

XtremeDistil graph: F1-score comparison for different models across 41 languages. The y-axis on the left shows the scores,
whereas the axis on the right (plotted against blue dots) shows the number of training labels (in thousands).
F1-score comparison for different models across 41 languages. The y-axis on the left shows the scores, whereas the axis on the right (plotted against blue dots) shows the number of training labels (in thousands).
  • XtremeDistil uses a complementary stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture. XtremeDistil outperformed prior art and compressed teacher models like mBERT by up to 35x for parameters and 51x for latency of batch inference, while retaining 95% of its accuracy. Using the XtremeDistil approach with the ONNX runtime on a multilingual Named Entity Recognition (NER) task, Walmart Labs was able to reduce inference time by over a factor of 30 (from 30 milliseconds to less than 1 millisecond). 
  • MiniLM uses deep self-attention to distill pre-trained language models, into models that are 50% smaller (i.e., 50% of the teacher parameters and computations) while retaining 99% of the model accuracy. It also obtains competitive results in applying deep self-attention distillation to multilingual pre-trained models.
  • AutoDistil combines distillation and neural architecture search to further improve distillation, outperforming leading compression techniques by achieving up to 2.7x reduction in computational cost with negligible loss in task performance.

The post Emit less carbon from AI appeared first on Microsoft Research.

]]>
Empower AI developers http://approjects.co.za/?big=en-us/research/articles/empower-ai-developers-2/ Tue, 24 May 2022 16:11:51 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=847225 Progress in machine learning is measured in part through the constant improvement of performance metrics such as accuracy or latency. Carbon footprint metrics, while being an equally important target, have not received the same degree of attention. With contributions from our research team, Azure ML now provides transparency around machine learning resource utilization, including GPU […]

The post Empower AI developers appeared first on Microsoft Research.

]]>
GPU energy usage graph - showing 4 out of 8 GPUs

Progress in machine learning is measured in part through the constant improvement of performance metrics such as accuracy or latency. Carbon footprint metrics, while being an equally important target, have not received the same degree of attention. With contributions from our research team, Azure ML now provides transparency around machine learning resource utilization, including GPU energy consumption and computational cost, for both training and inference at scale. This reporting can raise developers’ awareness of the carbon cost of their model development process and encourage them to optimize their experimentation strategies.

Read the blog > (opens in new tab)

An animated illustration of the neural architecture search platform Archai automatically identifying neural network architectures for a given dataset.
An animated illustration of the neural architecture search platform Archai automatically identifying neural network architectures for a given dataset.

Archai, an open-source tool, can inform model development tradeoffs. In combination with a set of Neural Architecture Search (NAS) algorithms, Archai can perform a cost-aware architecture search, where “cost” can represent different resources of interest such as compute time or peak memory footprint. Running Archai provides the model developer with the entire spectrum of cost vs. accuracy tradeoffs, allowing them a choice of which tradeoff best meets their needs.

Finally, Accera (opens in new tab) is an open-source compiler that aggressively optimizes for AI workloads. The Accera compiler doesn’t change or approximate a model; rather, it finds the most efficient implementation of that model. For example, matrix multiplication with a ReLU activation is commonly used in machine learning algorithms; by optimizing its implementation, developers can reduce the computational intensity of running their models. 

The post Empower AI developers appeared first on Microsoft Research.

]]>