Two papers discussing quantization methods specialized for MoE LLMs, accepted to an ICML 2026 workshop

 

Hancheol Park, Ph. D.
AI Research Engineer, NetsPresso Tech, Nota AI

Geonho Lee
Edge AI Engineer Intern, NetsPresso Tech, Nota AI

Tae-Ho Kim
CTO & Co-Founder, Nota AI

 

Summary

  • Two of our papers on quantization methods specialized for MoE LLMs have been accepted to an ICML 2026 workshop organized by Amazon (https://adaptfm.gitlab.io/).

  • DREAM-MoE: A PTQ method for MoE LLMs that preserves routing-critical expert orderings and uses downstream router supervision to reduce quantization-induced routing errors without adding inference-time overhead.

  • SRA-MoE: A selective router-alignment method that focuses on routing shifts that actually affect model outputs, improving MoE quantization by prioritizing output-critical tokens.

 

Introduction

Mixture-of-Experts (MoE) LLMs have become an important architecture for scaling large language models efficiently, since only a subset of experts is activated for each token. However, despite this sparse computation, deployment remains memory-intensive because all expert parameters must still reside in memory. Low-bit post-training quantization (PTQ) is therefore essential for practical MoE LLM deployment.

Unlike dense LLMs, however, MoE models introduce an additional challenge: quantization can perturb router outputs and change which experts are selected. This means the quantized model may not simply approximate the full-precision model with small numerical errors; it may execute a different expert pathway altogether. Our two studies address this MoE-specific quantization problem from a routing-stability perspective.

 

Key Messages of the Paper

The key message is that accurate MoE quantization requires preserving routing decisions that actually matter for model behavior. Instead of treating MoE quantization as only a weight reconstruction problem, these works show that expert selection, expert ordering, routing margins, and output-relevant routing shifts must be explicitly considered during PTQ.

 

Significance/Importance of the Paper

In dense models, quantization errors usually appear as continuous perturbations in activations or outputs. In MoE models, small perturbations can change the selected top-k experts, causing a discrete change in the computation path. This makes MoE quantization uniquely sensitive to router behavior.

The broader implication is that MoE LLM compression needs MoE-aware objectives: not just minimizing reconstruction error, but preserving the routing structure that determines how tokens flow through experts.

 

Summary of Methodology

We developed two complementary PTQ strategies for improving routing stability in quantized MoE LLMs.

The first strategy focuses on routing-critical expert ordering. Existing router-aware PTQ methods often match router logits between the quantized and full-precision models. However, matching logit values does not necessarily preserve the relative ordering of experts, and even small changes near the top-k boundary can replace one selected expert with another. To address this, the method preserves pairwise margins among selected experts and near-boundary unselected experts, directly targeting the expert-ranking relationships that determine top-k routing. It also extends this idea to the next MoE router, so each quantized block is regularized not only by local reconstruction quality but also by how much it disrupts downstream routing. This is added only as a calibration-time objective, so it does not introduce extra inference-time modules or computational overhead.

The second strategy focuses on output-aware selective router alignment. A key observation is that not all routing shifts are equally harmful: many tokens show substantial routing changes but almost no output discrepancy. Therefore, instead of uniformly aligning routing behavior for every token, the method first identifies tokens whose output distributions meaningfully change after quantization, then applies router alignment primarily to those tokens. This makes the alignment objective more targeted and avoids spending optimization capacity on routing shifts that do not affect model behavior.

Together, the two studies suggest a common principle: MoE PTQ should preserve the routing behavior that changes outputs, rather than blindly minimizing all router differences.

 

Experimental Results

Across three MoE LLMs and two low-bit weight-only quantization settings, the margin-preserving downstream-routing method achieved the best average downstream accuracy in 5 out of 6 model-bit settings and the lowest language-modeling perplexity in 5 out of 6 settings. At 4-bit, it improved average downstream accuracy over the strongest baseline by +1.30 points on average. At 3-bit, it achieved the best perplexity across all three evaluated models and improved average downstream accuracy on two of the three models.

In the 4-bit setting, the method improved average downstream accuracy from 39.46 → 40.54, 49.64 → 50.25, and 64.06 → 66.27 on three representative MoE models, respectively, compared with the strongest baseline in each case. The largest improvement was +2.21 points on the strongest model setting.

For the output-aware selective alignment method, only about 49% of tokens in the 4-bit setting and 57% of tokens in the 3-bit setting were selected for alignment, meaning a large fraction of tokens were excluded because their outputs were already preserved after quantization. Despite using fewer alignment tokens, the method generally matched or improved over conventional router alignment.

On a representative model, selective alignment improved 3-bit quantized performance from 69.58 → 72.08 on one reasoning benchmark, 52.50 → 55.42 on another, and 64.73 → 67.17 on a broad knowledge benchmark compared with uniform router alignment. In 4-bit quantization, it also improved one reasoning benchmark from 67.50 → 70.00 and another from 60.73 → 61.87.

The method also scaled to larger MoE models. In 4-bit quantization, selective alignment improved representative results such as 73.33 → 76.67, 60.00 → 63.75, and 63.54 → 65.79 over the base PTQ method on one large MoE model, and 64.58 → 67.92, 68.18 → 69.26, and 61.69 → 62.96 on another.

 

Conclusion

These studies show that routing stability is a central bottleneck in low-bit MoE LLM quantization. The main lesson is that MoE PTQ should not only reconstruct weights or activations, but also preserve the expert-selection structure that determines the model’s computation path.

By preserving routing-critical expert orderings, accounting for downstream router behavior, and selectively aligning only output-relevant routing shifts, these methods provide a practical direction for accurate and efficient MoE LLM deployment under low-bit quantization.

 

If you're curious about Nota AI's model optimization technology, find us at NetsPresso®.

Stay ahead with Nota AI on LinkedIn. From edge AI trends to the latest tech updates — subscribe to Edge Insights and be the first to know. 👉 Subscribe now
Next
Next

Smart Tech Korea (STK) 2026 Nota AI Booth Preview: Physical AI, Built at the Edge