TD-MPC-Opt: Distilling Model-Based Multi-Task Reinforcement Learning Agents: what it means for business leaders
TD-MPC-Opt compresses a 317 M-parameter world-model agent into a 1 M-parameter controller without accuracy loss, cutting compute, memory, and deployment cost for robotics and edge-AI teams.
1. What the method is
TD-MPC-Opt distills a large TD-MPC2 teacher into a tiny student by adding a reward-matching loss to the usual model-based objectives, then optionally quantises to FP16. The result is a single, lightweight, multi-task policy with near-teacher performance.
2. Why the method was developed
State-of-the-art model-based RL agents are huge and energy-hungry, blocking real-world robot or drone use. Training small models from scratch wastes data and under-performs. TD-MPC-Opt offers a shortcut: keep teacher quality but slash parameters for practical deployment.
3. Who should care
- Robotics CTOs fitting control stacks on embedded boards
- Edge-AI engineers squeezing inference into tight power budgets
- Autonomous logistics managers planning multi-task fleets
4. How the method works
A frozen 317 M teacher and a 1 M student ingest the same dataset. Besides consistency and value losses, the student minimises reward-MSE against the teacher. A coefficient balances imitation versus self-learning. After one million gradient steps the student is quantised to FP16, halving memory while preserving accuracy.
5. How it was evaluated
Distillation ran on the DM-Control MT30 offline set (690 k episodes). Metrics were normalised return, sample efficiency curves, and post-quantisation model size. Baselines included scratch-trained 1 M networks and Dreamer V3.
6. How it performed
The 1 M student hit a 28.5 normalised score—50 % above scratch models—and shrank to 3.9 MiB after FP16. A 200 k-step distillation already matched scratch training with 40 % less data. (Source: arXiv 2507.01823, 2025)
← Back to dossier index