Demajh logo Demajh, Inc.

TD-MPC-Opt: Distilling Model-Based Multi-Task Reinforcement Learning Agents: what it means for business leaders

TD-MPC-Opt compresses a 317 M-parameter world-model agent into a 1 M-parameter controller without accuracy loss, cutting compute, memory, and deployment cost for robotics and edge-AI teams.

1. What the method is

TD-MPC-Opt distills a large TD-MPC2 teacher into a tiny student by adding a reward-matching loss to the usual model-based objectives, then optionally quantises to FP16. The result is a single, lightweight, multi-task policy with near-teacher performance.

2. Why the method was developed

State-of-the-art model-based RL agents are huge and energy-hungry, blocking real-world robot or drone use. Training small models from scratch wastes data and under-performs. TD-MPC-Opt offers a shortcut: keep teacher quality but slash parameters for practical deployment.

3. Who should care

4. How the method works

A frozen 317 M teacher and a 1 M student ingest the same dataset. Besides consistency and value losses, the student minimises reward-MSE against the teacher. A coefficient balances imitation versus self-learning. After one million gradient steps the student is quantised to FP16, halving memory while preserving accuracy.

5. How it was evaluated

Distillation ran on the DM-Control MT30 offline set (690 k episodes). Metrics were normalised return, sample efficiency curves, and post-quantisation model size. Baselines included scratch-trained 1 M networks and Dreamer V3.

6. How it performed

The 1 M student hit a 28.5 normalised score—50 % above scratch models—and shrank to 3.9 MiB after FP16. A 200 k-step distillation already matched scratch training with 40 % less data. (Source: arXiv 2507.01823, 2025)

← Back to dossier index