Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
Dongkuan Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed Hassan Awadallah, Jianfeng Gao
Knowledge distillation (KD) is effective in compressing large pre-trained language models, where we train a small student model to mimic the output distribution of a large teacher model (e.g., BERT, GPT-X). KD relies on hand-designed student model architectures that require several trials and pre-specified compression rates. In our paper, Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models, we discuss AutoDistil, a new technique pioneered by Microsoft Research that leverages advances in KD and neural architecture search (NAS) to automatically generate a suite of compressed models with variable computational cost (e.g., varying sizes, FLOPs and latency). NAS for distillation addresses customization challenges of hand-engineering compressed model architectures for diverse deployment environments having variable resource constraints with an automated framework. AutoDistil-generated compressed models obtain up to 41x reduction in FLOPs with limited regression in task performance and 6x FLOPs reduction with parity in performance with large teacher model. Given any state-of-the-art compressed model, AutoDistil finds a better compressed variant with better trade-off in task performance vs. computational cost during inference.
Neuron with steady response leads to better generalization
Qiang Fu, Lun Du, Haitao Mao, Xu Chen, Wei Fang, Shi Han, Dongmei Zhang
Improving models’ ability to generalize is one of the most important research problems in machine learning. Deep neural networks with diverse architectures have been invented and widely applied to various domains and tasks. Our goal was to study and identify the fundamental properties commonly shared by different kinds of deep neural networks, and then design a generic technique applicable for all of them to improve their generalization.
In this paper, from the neural level granularity, we study the characteristics of individual neurons’ response during the training dynamics. We find that keeping the response of activated neurons stable for the same class helps improve models’ ability to generalize. This is a new regularization perspective based on the neuron-level class-dependent response distribution. Meanwhile, we observed that the traditional vanilla model usually lacks good steadiness of intra-class response. Based on these observations, we designed a generic regularization method, Neuron Steadiness Regularization (NSR), to reduce large intra-class neuron response variance. NSR is computationally efficient and applicable to various architectures and tasks. Significant improvements are obtained on extensive experiments with multiple types of datasets and various network architectures. We will continue the research for improving the model generalization ability.
Long-form video-language pre-training with multimodal temporal contrastive learning
Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
Huge numbers of videos on diverse topics and of various lengths are shared on social media. Analyzing and understanding these videos is an important but challenging problem. Previous work on action and scene recognition has been limited to certain labels, while neglecting the rich semantic and dynamic information in other videos. Inspired by the cross-modal pre-training paradigm in image-language domain (e.g., CLIP, Florence), researchers have explored video-language joint pre-training, which mainly use short-form videos (e.g., < 30 seconds). Long-form video and language pre-training have not been well studied yet, though long-form videos contain much richer and more complex semantic contents in real scenarios.
In this research, we propose a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) to explore long-form video representation learning, and train it on a long-form video-language dataset (LF-VILA-8M) on the basis of our new collected video-language dataset (HD-VILA-100M). We then design a Multimodal Temporal Contrastive (MTC) loss to capture the temporal relation between video clips and single sentences. We also propose the Hierarchical Temporal Window Attention (HTWA) mechanism on video encoder to reduce the training time by one-third. Our model achieves significant improvements on nine benchmarks, including paragraph-to-video retrieval, long-form video question-answering, and action recognition tasks. In the future, we will explore using it for broader scenarios, such as ego-centric video understanding.
Microsoft Research Causality and ML team features multiple papers and workshops at NeurIPS 2022
Parikshit Bansal, Ranveer Chandra, Eleanor Dillon, Saloni Dash, Rui Ding, Darren Edge, Adam Foster, Wenbo Gong, Shi Han, Agrin Hilmkil, Joel Jennings, Jian Jiao, Emre Kıcıman, Hua Li, Chao Ma, Sara Malvar, Robert Ness, Nick Pawlowski, Yashoteja Prabhu, Eduardo Rodrigues, Amit Sharma, Swati Sharma, Cheng Zhang, Dongmei Zhang
Identifying causal effects is an integral part of scientific inquiry, helping us to understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven improvement and evaluation of business and technological systems we build today. The intersection of causal analysis and machine learning is driving rapid advances. Microsoft researchers are excited to be presenting three papers at NeurIPS, along with workshops on new methods and their applications. This includes work improving deep methods for causal discovery, applying causal insights to improve responsible language models, and improving soil carbon modeling with causal approaches. To accelerate research and broaden adoption of the latest causal methods, Microsoft researchers are co-organizing the Workshop on Causality for Real-world Impact (opens in new tab) and releasing new no-code interactive ShowWhy (opens in new tab) tools for causal discovery and analysis. We encourage NeurIPS attendees to learn more via the links below or stop by the Microsoft booth for demos and talks.
Main conference papers
Workshop papers
Workshop on Causality for Real-world Impact (opens in new tab)
- A Causal AI Suite for Decision-Making
- The Counterfactual-Shapley Value: Attributing Change in System Metrics
- Counterfactual Generation Under Confounding
- Deep End-to-end Causal Inference
- Rhino: Deep Causal Temporal Relationship Learning with history-dependent noise
- Causal Reasoning in the Presence of Latent Confounders via Neural ADMG Learning
Workshop on Tackling Climate Change with Machine Learning (opens in new tab)
Workshop on Distribution Shifts (opens in new tab)
Workshop on Understanding Deep Learning Through Empirical Falsification («I can’t believe it’s not better») (opens in new tab)
We’ll be participating in the panel.
Causal AI Software Resources
- Download Causal No-Code Tools (ShowWhy)
New research on generative models
Two papers covering new research on generative models will be presented at NeurIPS 2022.
Vikas Raunak, Matt Post, Arul Menezes
The first paper, Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models (opens in new tab), presents recommendations on the evaluation of state-of-the-art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large-scale models have had three impacts: 1) The fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors; 2) The same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning, which raise the abstraction level of interacting with such models; 3) The user expectations around these models have made the technical challenge of out-of-domain generalization much less excusable in practice. Subsequently, our evaluation methodologies haven’t adapted to these changes. More concretely, while the associated utility and methods of interacting with generative models have expanded, a similar expansion has not been observed in their evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks.
- Publication Rank-One Editing of Encoder-Decoder Models
The second paper is Rank-One Editing of Encoder-Decoder Models. (opens in new tab) Here, we look at large sequence-to-sequence models for tasks such as neural machine translation (NMT), which are usually trained over hundreds of millions of samples. However, training is just the origin of a model’s life-cycle. Real-world deployments of models require further behavioral adaptations as new requirements emerge or shortcomings become known. Typically, in the space of model behaviors, behavior deletion requests are addressed through model retrainings, whereas model finetuning is done to address behavior addition requests. Both procedures are instances of data-based model intervention. In this work, we present a preliminary study investigating rank-one editing as a direct intervention method for behavior deletion requests in encoder-decoder transformer models. We propose four editing tasks for NMT and show that the proposed editing algorithm achieves high efficacy, while requiring only a single instance of positive example to fix an erroneous (negative) model behavior. This research therefore explores a path towards fixing the deleterious behaviors of encoder-decoder models for tasks such as translation, making them safer and more reliable without investing in a huge computational budget.
- Venue: The Second Workshop On Interactive Learning For Natural Language Processing (opens in new tab) (InterNLP 2022)
Award Winner: A Neural Corpus Indexer for Document Retrieval
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Allen Sun, Weiwei Deng, Qi Zhang, Mao Yang
Note: this paper was named an Outstanding Paper at NeurIPS 2022
Current state-of-the-art document retrieval solutions typically follow an index-retrieve paradigm, where the index is not directly optimized towards the final target. The proposed Neural Corpus Indexer (NCI) model, instead, leverages a sequence-to-sequence architecture, which serves as a model-based index that takes a query as input and outputs the most relevant document identifiers. For the first time, we demonstrate that an end-to-end differentiable document retrieval model can significantly outperform both sparse inverted index and dense retrieval methods. Specifically, NCI achieves +17.6% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset respectively, and a competitive MRR without using an explicit re-ranking model. This work has received a NeurIPS 2022 Outstanding Paper award.
The pipeline is composed of three stages. In the first stage, documents are encoded into semantic identifiers by the hierarchical k-means algorithm. In the second stage, a query generation model is employed to prepare
Microsoft Research career opportunities – come join us!
We’re hiring for multiple roles including internships and researchers at all levels in multiple Microsoft Research labs. Join us and work on causal ML, precision health, genomics, deep learning, robotics, or computational chemistry. If you’re attending the conference, stop by the Microsoft booth (Expo Hall G, Booth #202) to speak with researchers and recruiters about working at Microsoft and open job opportunities. Or you can browse our current openings at NeurIPS 2022 – Microsoft Research career opportunities.