Feng Yao$^{\star\dagger}$ Junxia Cui$^{\star}$ Ruohan Zhang$^{\star}$ Liyuan Liu$^{\dagger}$ Shibo Hao Li Zhang Chengyu Dong Shuohang Wang Yelong Shen Jianfeng Gao Jingbo Shang

$^{\dagger}$: Project Lead; $^{\star}$: Core Contributors; (Work in Progress)

UCSD, Microsoft

Last Updated on July 7, 2025; First Published on June 30, 2025 | GitHub: https://github.com/yaof20/DenseMixer

<aside>

TL;DR

We introduce DenseMixer — a novel and effective MoE post-training technique that makes MoE easier to train and better performing.

By trading one extra forward pass on inactive experts for precise router gradient, DenseMixer consistently outperforms the conventional method — across different MoE scales (7B, 14B, 30B), architectures (with/without shared experts), pre-trained methods (from scratch/up-cycling), and post-training data types (instruction/long CoT data).

We provide a plug-and-play implementation for DenseMixer, empowering MoE post-training simply by pip install densemixer. It is fully compatible with existing libraries (e.g., transformers, llama-factory, open-instruct, verl) and can be applied with parameter-efficient methods (e.g., LoRA), introducing no changes to inference.

</aside>

Figure 1. Performance gains of Qwen3-30B MoE after post-training using conventional method vs. DenseMixer. Results are reported with the decoding parameters: temperature = 0.6, top-p = 0.95. Additional results under other decoding configs are provided in the Empirical Results section.

Problem of MoE Training

MoE is notoriously harder to train compared with dense models. The only difference MoE introduces is its sparse routing mechanism — typically implemented via a Top-K router, which is mathematically non-differentiable. Such issue blocks the straight-forward back-propagation and complicates gradient computation. We elaborate on this in the Technical Details section.

Introducing DenseMixer

To address the non-differentiability problem, we introduce DenseMixer for MoE post-training, where we trade additional compute (on inactive experts during the forward pass) for more precise router gradient estimation. We delve into the details in the Technical Details section.

DenseMixer consistently outperforms conventional method in downstream performance across different MoE scales (7B, 14B, 30B), architectures (with/without shared experts), pre-trained methods (from scratch/up-cycling), and post-training data types (instruction/long CoT).

It is universally applicable to any MoE using Top-K router and back-propagation, and it can be used in a plug-and-play manner, compatible with existing training libraries (transformers, llama-factory, open-instruct, verl) and parameter-efficient methods (e.g., LoRA).

To shift from conventional method to DenseMixer, you only need the following change:

# Your current MoE training
python your_moe_training_script.py

# Shift to DenseMixer (**no code changes needed!**)
pip install densemixer
densemixer setup

export DENSEMIXER_ENABLED=1
python your_moe_training_script.py

Please refer to our GitHub repo for more details. → https://github.com/yaof20/DenseMixer

Empirical Results

We conduct experiments with MoEs of varying scales, architectures, and recipes, using both (relatively) short instruction and long reasoning datasets for training.

TL;DR

Problem of MoE Training

Introducing DenseMixer

Empirical Results

Models & Datasets