Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
1-bit large-scale language models (LLMs) have emerged as a promising approach to making generative AI more accessible and affordable. 1-bit LLM significantly reduces the memory and computational resources required to run by representing model weights with a very limited number of bits.
microsoft research has been pushing the limits of 1-bit LLM with the BitNet architecture. in new paperresearchers introduced BitNet a4.8, a new technique that further improves the efficiency of 1-bit LLM without sacrificing performance.
The rise of 1-bit LLM
Traditional LLM uses 16-bit floating point numbers (FP16) to represent parameters. This requires large amounts of memory and computing resources, which limits LLM’s accessibility and deployment options. A 1-bit LLM addresses this challenge by significantly reducing the precision of the model weights while matching the performance of a full-precision model.
Previous BitNet models used 1.58-bit values (-1, 0, 1) to represent model weights and 8-bit values for activations. Although this approach has significantly reduced memory and I/O costs, the computational cost of matrix multiplication remains a bottleneck, making it difficult to optimize neural networks with extremely low-bit parameters. .
Two techniques can help solve this problem. Sparsification reduces the number of computations by pruning activations of smaller magnitude. This is especially useful in LLM because activation values tend to have a long-tailed distribution with a small number of very large values and a large number of small values.
Quantization, on the other hand, uses fewer bits to represent activations, reducing the computational and memory costs of processing activations. However, simply reducing the activation precision can result in significant quantization errors and performance degradation.
Furthermore, combining sparsification and quantization is difficult and poses special problems when training 1-bit LLMs.
“Both quantization and sparsification introduce non-differentiable operations, which makes gradient calculations particularly difficult during training,” Furu Wei, partner research manager at Microsoft Research, told VentureBeat.
Gradient calculations are essential for calculating errors and updating parameters when training neural networks. The researchers also needed to ensure that their technique could be efficiently implemented on existing hardware while preserving the benefits of both sparsification and quantization.
bitnet a4.8
BitNet a4.8 addresses the challenge of optimizing 1-bit LLMs through what researchers call “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on specific distribution patterns of activations. This architecture uses 4-bit activations as inputs to the attention and feedforward network (FFN) layers. Use 8-bit sparsification for intermediate states and keep only the top 55% of the parameters. This architecture is optimized to leverage existing hardware.
“In BitNet b1.58, the 1-bit LLM inference bottleneck switches from memory/IO to computation and is constrained by the activation bit (i.e. 8 bits in BitNet b1.58),” Wei said. . “BitNet a4.8 pushes activation bits to 4 bits and leverages 4-bit kernels (such as INT4/FP4) to double the speed of LLM inference on GPU devices. BitNet b1.58 By combining 1-bit model weights with BitNet a4.8’s 4-bit activations, memory/IO and computational constraints can be effectively addressed.”
BitNet a4.8 also uses 3-bit values to represent the key (K) and value (V) states of the attention mechanism. The KV cache is an important component of the transformer model. Saves the representation of previous tokens in the sequence. BitNet a4.8 further reduces the required memory requirements, especially when processing long sequences, by reducing the precision of KV cache values.
BitNet a4.8 promise
Experimental results show that BitNet a4.8 achieves comparable performance to the previous generation BitNet b1.58 while using less compute and memory.
Compared to the full-precision Llama model, BitNet a4.8 uses 10x less memory and is 4x faster. Achieves 2x speedup with 4-bit activation kernel compared to BitNet b1.58. But design can offer much more than that.
“The estimates of computational power gains are based on existing hardware (GPUs),” Wei said. “Using hardware specifically optimized for 1-bit LLMs greatly enhances computational improvements. BitNet eliminates the need for matrix multiplication, which is the main focus of current hardware design optimization. We are introducing a new computational paradigm that minimizes
BitNet a4.8’s efficiency makes it particularly suitable for LLM deployments at the edge and resource-constrained devices. This can have important privacy and security implications. By enabling on-device LLM, users can take advantage of the capabilities of these models without sending data to the cloud.
Wei and his team continue to work on 1-bit LLMs.
“We continue to drive our research and vision toward the era of 1-bit LLM,” said Wei. “While our current focus is on the model architecture and software support (i.e. bitnet.cpp), we are exploring co-design and co-evolution of the model architecture and hardware to fully realize the potential of 1-bit LLM. I aim to release it.”