Why this paper
Mixtral 8x7B arxiv 2401.04088 is one of the few open-weights Mixture-of-Experts releases at frontier-quality scale. Most MoE work to date stayed proprietary (GPT-4-mixed, Switch Transformer family) or shipped at scales that didn't compete with the best dense models. Mixtral changes that.
What we want to surface: the routing mechanism is the entire contribution. The base architecture is Mistral 7B. The deltas are local — and they're the reason the model exists.
How the routing works
At each layer, instead of one feedforward block, there are 8. A small "router" — a tiny neural network at each layer — takes the current token's hidden state and emits 8 scores. The 2 highest-scoring experts process the token; their outputs are mixed by the router's softmaxed weights.
Top-2 routing matters. Switch Transformer used top-1 (each token goes to exactly one expert). Top-2 trades a small per-token compute increase for routing stability — gradients flow back through 2 experts at each layer, which empirically prevents the routing collapse you see in top-1 setups where one expert dominates.
What the routing buys
Two decoupled axes:
- Capacity: 47B total parameters. Knowledge "stored" in the model is at this scale.
- Compute: 13B active parameters per token. Each forward pass is roughly equivalent to running a 13B dense model.
This decoupling is the whole point. At inference, you're paying for a 13B dense model's compute. At quality, you're getting a 70B-class model's behaviour on most benchmarks.
What it costs
The headline catch: you have to load all 8 experts into memory, even though only 2 fire per token. So a 47B-parameter footprint dominates your VRAM requirement, not a 13B-equivalent footprint.
For deployment:
- FP16 serving: ~94 GB VRAM. Two A100 80GBs or one H100 80GB.
- INT4 quantisation: ~24 GB VRAM. One A100 80GB, comfortable headroom.
- Throughput: ~6× faster than Llama 2 70B at comparable quality on reasoning benchmarks. Routing overhead is small (10-15% slower than pure dense 13B).
For research:
- Easier to ablate than closed MoE models — you can swap routing functions and re-train.
- The open weights enable a wave of community fine-tuning that closed models can't get.
What the paper doesn't address
A few things to look up separately if you're evaluating:
- Routing decisions during long-context inference (32k window) — paper doesn't disclose router stability at extreme positions.
- Expert specialisation — there's some appendix data, but the paper doesn't deeply analyse whether the 8 experts develop interpretable roles.
- Inference latency variance — top-2 routing means per-token decisions can be heterogeneous across batch positions, which complicates serving.
Why we summarised it this way
The summarizer prompt for the Industry Pro profile (which we ran first) prioritises deployment cost over methodological elegance. That's deliberate. The same paper through the Researcher lens would lead with the top-1 vs top-2 routing analysis instead. Same paper, different lens, different headline.
If you're evaluating Mixtral for production, the Industry Pro summary in your daily digest will lead with VRAM and throughput numbers. If you're tracking the MoE field, the Researcher summary will lead with the routing comparison. Same digest pipeline, different optimisation target.