Demajh logo Demajh, Inc.

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test: what it means for business leaders

The paper reveals two ultralight routing-path metrics that alert engineers when a Mixture-of-Experts LLM flips from memorizing data to truly generalizing—saving evaluation compute and shortening time-to-value.

1. What the method is

Researchers log which expert networks a token visits inside a Mixture-of-Experts (MoE) language model, then compute average route edit-distance across samples and cosine consistency within each sample. Falling distance plus rising consistency reliably marks the “grokking” transition from rote memorization to abstraction.

2. Why the method was developed

Conventional pretraining curves provide little guidance on when a model will start performing well on real tasks, leading teams to over-train or guess. Full benchmark sweeps are expensive and leak private data. The authors sought a fast, training-only signal that predicts downstream capability without any extra inference or held-out sets.

3. Who should care

4. How the method works

During pretraining the MoE router assigns each token to a handful of experts. The pathway of expert IDs is stored for many samples. As learning progresses, unrelated samples begin sharing similar pathways (lower edit-distance) while each sample’s path stabilizes across layers (higher consistency). These two metrics are streamed in real time, require no gradients, and add negligible overhead—making them practical “dashboard” gauges for any MoE run.

5. How it was evaluated

The team checkpointed a 10-billion-parameter OLMoE model 40 times across a trillion-token pretraining run. Each checkpoint was instruction-tuned and scored on GSM-8K (math), HumanEval (code), CommonsenseQA, and a biomedical QA set. They then calculated Pearson and Spearman correlations between pathway metrics and benchmark accuracy to test predictive strength.

6. How it performed

Pathway consistency correlated +0.93 with benchmark accuracy, while pathway distance correlated −0.95, both far outperforming raw loss (<0.5). The metrics predicted accuracy jumps up to ten checkpoints early, enabling 5–10 % compute savings without quality loss. (Source: arXiv 2506.21551, 2025)

← Back to all posts