À propos
I am currently a senior researcher at Systems and Networking Groups of Microsoft Research Aisa (opens in new tab). I received my Ph.D. degree from the University of Science and Technology of China (opens in new tab) (USTC) in 2018. Before that, I received my B.S. degree from USTC in 2013. My research interests lie broadly in AI algorithms. After I joined MSRA, my research focused on inventing novel algorithms for improving AI inference efficiency, which include: (1) compression for pre-trained transformer models and LLMs, and (2) hardware-aware NAS for edge AI.
These days, I have been deeply interested in exploring and addressing the cutting-edge research problems in LLMs and AGI. A list of topics I’m actively working on and thinking about:
- Long-context LLM
- LLM self-play
Update: I’m seeking several research interns. If you’re interested in long-context LLM and LLM self-play, feel free to contact me (lzhani@microsoft.com)
News:
- 2024-9: rStar has been recommended as a key technique in OAI-o1 like approaches. Read more: Awesome-LLM-Strawberry (opens in new tab), technical blogs (opens in new tab)
- 2024-8: We introduce rStar (opens in new tab), a self-play mutual reasoning approach that can significantly improve SLM reasoning capabilities during inference! We reveal that SLMs already exhibit strong reasoning capabilities before domain specialized SFT. For LLaAM2-7B, rStar boosts GSM8K accuracy from 12.51% to 63.91%. Our work is featured on Huggingface Daily Papers (opens in new tab) et 机器之心 (opens in new tab)
- 2024-8: Phi3.5-128k LLMs (opens in new tab) are released! We made significant improvements in LongRoPE for recovering short performance in context window extension.
- 2024-7: LongRoPE has been open-sourced https://github.com/microsoft/LongRoPE (opens in new tab)
- 2024-6: We have made improvements to LongRoPE for Phi3-mini-128k in the June update (opens in new tab), with significantly enhanced long context capabilities.
- 2024-4: LongRoPE has been integrated into Microsoft Phi3-family (opens in new tab) for long context support!
- 2024-2: LLM can now read all eight Harry Potter books at one inference! Excited to share LongRoPE (opens in new tab)! For the first time, LongRoPE extends the context window of pre-trained LLMs to an impressive 2048k tokens. We will release the code and extended LLMs (LongRoPE-LLaMA2-7B-2048k and LongRoPE-Mistral-7B-2048k soon). Our work is featured on Huggingface Daily Papers (opens in new tab).
- 2023-12: We introduce CoT-Influx (opens in new tab) to boost LLM math reasoning capability. Without any finetuing, LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. Our work is featured on Microsoft blog: https://mp.weixin.qq.com/s/ewg8ue6QRy2D7BIUy4r32A (opens in new tab)
- 2023-10: Excited to share our work (opens in new tab) on a collaborative structured pruning paradigm for LLM! We can efficiently prune LLaMA-7b to 5.4b while preserving its original performance.
- 2023-10: Our work is featured on Microsoft Research blog: “Efficient and hardware-friendly neural architecture search with SpaceEvo (opens in new tab)“
- 2023-7: Code for ToP (opens in new tab) has been released!
- 2023-7: paper “LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search” accepted to NSDI 2024 Spring
- 2023-7: paper “ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices” accepted to ICCV 2023
- 2023-7: paper “SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference” accepted to ICCV 2023
- 2023-5: paper “Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference” accepted to SigKDD 2023
- 2023-5: paper “Accurate and Structured Pruning for Efficient Automatic Speech Recognition” accepted to InterSpeech 2023
- 2023-1: paper “On Modular Learning of Distributed Systems for Predicting End-to-End Latency” accepted to NSDI 2023
- 2022-8: paper “SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance” accepted to CIKM 2022 (Applied Research Track)
- 2022-5: ship two pruned BERT models (by SwiftPruner) to Bing Ad Relevance
- 2021-12: paper “Towards Efficient Vision Transformers Inference: A First Study of Transformers on Mobile Devices” accepted to HotMobile 2022
- 2021-8: we opensourced nn-Meter, https://github.com/microsoft/nn-Meter (opens in new tab)
- 2021-6: Best Paper Award: “nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices” accepted to MobiSys 2021. nn-Meter (opens in new tab) won the best paper award and 2012 SigMobile Research Highlight
- 2021-1: paper: “To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks” accepted to MLSys 2021