SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization for Edge AI Devices

 

Jaewoo Song

Software Engineer, Nota AI

 

Summary

  • This study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitations.

  • By splitting layers based on parameter clustering, the study achieves accuracies close to the original model on the basic linear quantization.

  • The study has been accepted to the Edge AI Foundation Austin 2025.

 

Key Messages of the Paper

Edge AI devices often do not support advanced quantization methods due to their limitations and hence only the basic linear quantization method can be used on them. This study proposes an AI model preprocessing method which splits layers based on parameter clustering for achieving improved quantization accuracies close to the original model on the basic linear quantization for such edge AI devices.

 

Significance/Importance of the Paper

As AI applications increasingly demand on-device processing—ranging from real-time video analysis and autonomous navigation to smart home automation—there is a growing need for models that can run efficiently on limited hardware.

By dramatically reducing computational redundancy and memory overhead, UniForm makes it feasible to deploy state-of-the-art Transformer models on edge devices. This not only broadens the potential applications of advanced AI but also enhances energy efficiency and reduces operational costs, democratizing access to powerful AI solutions.

 

Summary of Methodology

SplitQuant improves quantization resolution by increasing the scaling factor. Since the bit-width is fixed, SplitQuant increases the scaling factor by decreasing the difference between the maximum and minimum values of original parameters.

SplitQuant splits linear, convolution and activation layers and combines them while preserving the functionality of DNNs. Since the distance between the maximum and minimum values in each layer is smaller than in the original layer, the quantization resolution is improved.

How SplitQuant splits layers is graphically represented in the following figure.

SplitQuant runs k-means clustering on weights and biases. With 𝑘 = 3, weight and bias parameters are clustered into lower, middle and upper clusters. Initial cluster centroids are selected by the greedy kmeans++ algorithm. Then three new layers are created from the clustered parameters. For example, the lower layer consists of the lower clusters of weight and bias parameters. Then the original layer is replaced by the newly created layers. The shapes of weights and biases in each new layer are maintained by injecting 0 where needed.

Experimental Results

SplitQuant was applied on two BERT-Tiny models from Hugging Face, fine-tuned for DAIR.AI’s emotion recognition dataset and UC Irvine’s SMS Spam Collection dataset for spam detection. The models were selected because BERT effectively represents the transformer architecture. Also, their relatively small sizes make them suitable for tinyML and Edge AI.

The effect of SplitQuant was most dramatic for INT2 quantization. For INT2 quantization, SplitQuant improved the accuracy of the emotion recognition by 3.3%pt, increasing it from 86.5% to 89.8%. The accuracy of the spam recognition was also improved by 2.1%pt, increasing from 96.2% to 98.3%. The improved accuracies were very close to the original FP32 accuracies which were 90.2% and 98.4%. SplitQuant also improved the results for INT4 and INT8 quantizations, albeit less dramatic than for INT2. The whole experiment result is shown in the following table.

Conclusion

SplitQuant proves to be a highly effective method for improving the accuracy of low-bit quantizations, such as INT2 quantization, which are especially vulnerable to outliers due to their low quantization resolution. By splitting each quantizable layer into three mathematically equivalent layers, SplitQuant successfully keeps the important signals conveyed by outliers while simultaneously enhancing the quantization resolution. The use of k-means clustering to optimize the split for weights and biases refines this process further. SplitQuant can be integrated with other quantization algorithms to enhance their performance. Tests on two fine-tuned BERT-Tiny language models demonstrated significant improvements of 3.3%pt and 2.1%pt in accuracy with INT2 quantization, achieving accuracies comparable to the original FP32 models. Future research could explore the application of SplitQuant to large language models and investigate potential benefits from advancements in sparse DNN technologies.


If you have any further inquiries about this research, please feel free to reach out to us at following 📧 email address: contact@nota.ai.

Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.

Next
Next

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features