Researchers from Nvidia and MIT unveiled QeRL (Quantization-enhanced Reinforcement Learning) — a breakthrough that makes training LLMs with RL up to 1.5× faster, while keeping accuracy equal or better than 16-bit baselines.
Traditional RL fine-tuning is painfully slow because rollouts are long and models must store both policy and reference networks in high precision. QeRL changes that.
Here’s how it works 👇
- 🧮 Uses NVFP4 4-bit weights with Marlin kernels for rollouts and scoring.
- ⚙️ Keeps LoRA adapters for gradient updates only.
- 🔁 Reuses one quantized policy for both rollout and logit scoring — halving memory use.
- 🎲 Adds Adaptive Quantization Noise (AQN) that boosts exploration early and fades naturally.
The key insight: quantization noise ≠ problem — it’s a feature.
It increases token-level entropy, helping models explore better during training.
On math reasoning benchmarks, QeRL trains faster, converges sooner, and matches or beats FP16 LoRA and QLoRA models — all while fitting a 32B model on a single H100 (80 GB) GPU.
Bottom line: QeRL delivers faster rollouts, lower memory, and larger models — bringing full RL training to single-GPU setups.
📄 Paper: “QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs”

