{"id":1123686,"date":"2025-02-05T09:32:32","date_gmt":"2025-02-05T17:32:32","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1123686"},"modified":"2025-02-05T09:32:36","modified_gmt":"2025-02-05T17:32:36","slug":"advances-to-low-bit-quantization-enable-llms-on-edge-devices","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/advances-to-low-bit-quantization-enable-llms-on-edge-devices\/","title":{"rendered":"Advances to low-bit quantization enable LLMs on edge devices"},"content":{"rendered":"\n
\"Three<\/figure>\n\n\n\n

Large language models (LLMs) are increasingly being deployed on edge devices\u2014hardware that processes data locally near the data source, such as smartphones, laptops, and robots. Running LLMs on these devices supports advanced AI and real-time services, but their massive size, with hundreds of millions of parameters, requires significant memory and computational power, limiting widespread adoption. Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient operation.<\/p>\n\n\n\n

Recent advances in low-bit quantization have made mixed-precision matrix multiplication (mpGEMM) viable for LLMs. This deep learning technique allows data of the same or different formats to be multiplied, such as int8*int1, int8*int2, or FP16*int4. By combining a variety of precision levels, mpGEMM strikes a balance among speed, memory efficiency, and computational accuracy. <\/p>\n\n\n\n

However, most hardware supports only symmetric computations\u2014operations on data of similar formats\u2014creating challenges for mixed-precision calculations during General Matrix Multiplication (GEMM), a critical operation for LLMs. Overcoming these hardware limitations is essential to fully benefit from mpGEMM and support asymmetrical computations. <\/p>\n\n\n\n

To unlock the potential of low-bit quantization on resource-constrained edge devices, hardware must natively support mpGEMM. To address this, we developed the following three approaches for computing kernels and hardware architectures: <\/p>\n\n\n\n