Multiple activities are involved in developing and using machine learning models, including selection of model architectures and algorithms, hyperparameter tuning, training on existing datasets, and making predictions on new data (aka inference). Optimizing results across these activities involves many complex problems that researchers are addressing, as described in the sections below.
Efficient model architectures and hyperparameter tuning
Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) are both automated optimization techniques that aim to identify promising candidates within combinatorial search spaces that are typically too large to search exhaustively. For NAS, the search is conducted through a space of potential neural network architectures; for HPO, the search is for high-performing combinations of hyperparameters. While “high-performing” traditionally refers to the prediction accuracy of the resulting model, techniques like NAS and HPO can also be used to satisfy different objective functions, such as computational efficiency or cost. Thus, NAS and HPO can be useful for identifying more resource-efficient machine learning models – but it is also critical that these techniques themselves operate efficiently. Research-based meta-learning techniques (opens in new tab), in which an algorithm learns from experience to guide more efficient exploration of the search space, have been incorporated into Azure Machine Learning. These techniques are the basis for efficient model selection in Automated Machine Learning (opens in new tab) and for efficient hyperparameter optimization in the HyperDrive service (opens in new tab). Other research approaches include Probabilistic Neural Architecture Search (PARSEC (opens in new tab)), which uses a memory-efficient sampling procedure that requires only as much memory as is needed to train a single architecture in the search space, greatly reducing memory requirements compared to previous methods of identifying high-performing neural network architectures. Weightless PARSEC built on that result to achieve comparable results with 100x less computational cost, with implications for both embodied and emitted carbon reduction.
Since those advances, the growth of very large-scale deep learning models motivated researchers to develop a theory of scaling as model width grows, called Tensor Programs (opens in new tab). This theory enabled a procedure, called muTransfer (opens in new tab), that can transfer training hyperparameters across model sizes, allowing the optimal hyperparameters discovered for a small model to be applied directly to a target scaled-up version. Compared to directly tuning the hyperparameters of large models, muTransfer enables equivalent accuracy levels while using at least an order of magnitude (~10x) less compute, with no limit to the efficiency gain as the target model size grows. With very large-scale models (e.g., trillions of parameters) for which hyperparameter tuning is prohibitively costly, muTransfer makes hyperparameter tuning possible. For models of any size, muTransfer is available as an open-source package (opens in new tab).
Efficient model training
Training is the process of exposing a model to existing data, from which the model “learns” how to weight different features in the input data to predict the outcome of a specified task. For example, models for translation from English to Hebrew might learn from large data sets of example translations to predict that “I” in English should translate to “ani” in Hebrew. As language models have grown larger and more generalizable to multiple tasks, training has been refined into two stages: pre-training of a general model, followed by fine-tuning to produce accurate outcomes on a specific task. Continuing the translation example, the pre-training stage might draw from multi-lingual translation examples, while the fine-tuning might focus on one target language. While pre-training can be especially computationally intensive, for production systems it typically happens less frequently than fine-turning, and both happen much less frequently than inference. Nevertheless, improving the efficiency of both aspects of training is critical to reducing AI’s carbon footprint.
- Download LoRA
LoRA (opens in new tab) (Low-Rank Adaptation of Large Language Models) focuses on compute and memory reductions during fine-tuning. It freezes the pre-trained model weights (aka parameter values) and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the GPU memory requirement for fine-tuning by a factor of 3. LoRA has been successfully applied in practice to make real-world model development more resource-efficient. For example, Power Apps Ideas (opens in new tab) uses AI to help people create apps using plain language to describe what they want the app to do. The product’s development relied on fine-tuning the general purpose GPT-3 175B model. Using LoRA for fine tuning reduced the storage and hardware requirements from 64 to 16 GPUs (opens in new tab) (specialized processors for AI training). A package that facilitates the integration of LoRA with PyTorch models can be found on GitHub (opens in new tab).
EarlyBERT (opens in new tab), in contrast, provides a general computationally efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. Inspired by the Early-Bird Lottery Tickets (opens in new tab) studied for computer vision tasks, EarlyBERT slims the self-attention and fully-connected sub-layers inside a transformer model to identify structured winning tickets in the early stage of BERT training. In essence, EarlyBERT compresses the model by pruning it into a sparser version of the original. The result is comparable performance to standard BERT, with 35-45% less compute time needed for training.
Efficient inference
Once designed, trained and tuned, models are used for inference – making predictions on new data. Because production models at scale may handle trillions of inferences per day, efficient use of compute hardware for inference is a critical objective for minimizing embodied carbon, alongside objectives to minimize cost and response times (aka latency) while maximize predictive accuracy. Many approaches are being pursued to balance these tradeoffs, including algorithmic improvements, techniques such as model compression and knowledge distillation that reduce model size, and methods such as quantization and factorization to lessen the intensity of computations needed for the arithmetic operations (such as matrix multiplication) that are fundamental to deep learning. Example results in this area include:
Factorization: Factorizable neural operators (FNO) aim to reduce the memory and compute requirements of very large AI models, such as those used for language-based tasks. FNO leverages low-rank matrix decompositions to achieve hardware-friendly compression of the most costly neural network layers. Applied to large-scale language models such as BERT, FNO reduced memory usage by 80 percent and prediction time by 50 percent, with less than a 5 percent reduction in accuracy. Further improvements may be achieved by combining FNO with other model compression methods, such as distillation and quantization.
Model compression via knowledge distillation: These techniques aim to reduce model size while retaining model accuracy, and are particularly critical for the large-scale, pre-trained models commonly used for language tasks. Compression can be achieved by distilling knowledge from a larger teacher model into a smaller student model; several developments based on this approach have achieved significant reductions in parameter counts, which typically reduce memory requirements correspondingly, as well as inference latency.
- XtremeDistil uses a complementary stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture. XtremeDistil outperformed prior art and compressed teacher models like mBERT by up to 35x for parameters and 51x for latency of batch inference, while retaining 95% of its accuracy. Using the XtremeDistil approach with the ONNX runtime on a multilingual Named Entity Recognition (NER) task, Walmart Labs was able to reduce inference time by over a factor of 30 (from 30 milliseconds to less than 1 millisecond).
- MiniLM uses deep self-attention to distill pre-trained language models, into models that are 50% smaller (i.e., 50% of the teacher parameters and computations) while retaining 99% of the model accuracy. It also obtains competitive results in applying deep self-attention distillation to multilingual pre-trained models.
- AutoDistil combines distillation and neural architecture search to further improve distillation, outperforming leading compression techniques by achieving up to 2.7x reduction in computational cost with negligible loss in task performance.