Portrait of Shijie Cao

Shijie Cao

Senior Researcher

About

I am Shijie Cao (曹士杰), a senior researcher at the Systems Research Group in Microsoft Research Asia. I received my Ph.D. degree in Computer Science from Harbin Institute of Technology (HIT) in 2021 through a joint-PhD program with MSRA, under the supervision of Dr. Hsiao-Wuen Hon (opens in new tab) and Prof. Lanshun Nie (opens in new tab). Prior to that, I earned my B.E. degree in Computer Science from HIT in 2016. From 2015 to 2021, I served as a long-term intern at MSRA’s system area mentored by Dr. Ningyi Xu (opens in new tab), and Dr. Lintao Zhang (opens in new tab).

My research interests lie at the intersection of computer system/architecture and deep learning, including domain-specific architectures, software-hardware co-design, deep learning compression and acceleration, etc. More recently, my research has been focused on low-bit and sparse large language model and its efficient computing in system/hardware.

Please feel free to contact me for internships and collaborations at shijiecao@microsoft.com (opens in new tab).

 

-News-

Nov 2024 SeerAttention code is open sourced at microsoft/SeerAttention (opens in new tab).

Oct 2024 We released SeerAttention (opens in new tab), a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. By natively learning the attention sparsity in LLMs, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67× speedup over FlashAttention-2.

Aug 2024  Our paper LUT Tensor Core (opens in new tab) is now available on arxiv.

Aug 2024  T-MAC is accepted to EuroSys 2025.

Jul 2024    T-MAC [paper (opens in new tab), code (opens in new tab)] is now available on arxiv. T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables.

May 2024  BitDistiller (opens in new tab) [code (opens in new tab)] is accepted to the ACL 2024 main conference. AFPQ (opens in new tab) [code (opens in new tab)]is accepted to the ACL 2024 findings.

Apr 2024   We released BitBLAS (opens in new tab) and T-MAC (opens in new tab), libraries to support mixed-precision matrix multiplications on GPU and CPU respectively, specially designed for low-bit LLM deployment.

Mar 2024   Our paper Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [code (opens in new tab)] is accepted to OSDI 2024.

Mar 2024   Our paper Pre-gated MoE (opens in new tab) is accepted to ISCA 2024.

Feb 2024   We release BitDistiller (opens in new tab), a QAT with Self-Distillation framework to enhance ultra low-bit LLM.

Nov 2024   We release AFPQ (opens in new tab), a asymmetric floating point quantization method for LLM.

May 2023   We release a comparative analysis of Integer and Floating Point formats for low-bit quantization on Large Language Models. [paper (opens in new tab)]

Feb 2023   Our paper nmSPARSE (opens in new tab) is accepted to MLSys 2023.

 

-Media-

Our series of research efforts on low-bit LLM system and hardware (including BitBLAS (opens in new tab), T-MAC (opens in new tab) and LUT Tensor Core (opens in new tab)) is featured by 机器之心 (opens in new tab) and MSRA (opens in new tab).

T-MAC is featured by 新智元 (opens in new tab) and 量子位 (opens in new tab).