Foundation Models & Learning Dynamics

Grokking: THE LONG ROAD TO UNDERSTANDING

Grokking, Delayed Generalisation, and What We Still Don't Know About How Neural Networks Learn

22 min

1. The Accident

To grok means to understand so thoroughly that the observer becomes part of the observed. — Robert Heinlein, Stranger in a Strange Land (1961)

In late 2021, a researcher at OpenAI left a training run active before going on vacation. The model—a small transformer learning modular arithmetic—had already memorised its training data. Training loss was zero. Test accuracy was at chance. By any standard heuristic, the run should have been stopped.

When the researcher returned, test accuracy had jumped to near-perfect. No new data. No architecture changes. No hyperparameter adjustments. Just more gradient steps, with weight decay quietly reshaping the loss landscape underneath a metric that had long since flatlined.

The team named the phenomenon grokking, borrowing Heinlein's term for a depth of understanding so complete it transforms the observer. The paper, published in January 2022 by Power, Burda, Edwards, Babuschkin, and Misra, documented something that should not have happened according to conventional machine learning wisdom: generalisation arriving orders of magnitude after memorisation, separated by a vast plateau of apparent stagnation.

Three years later, grokking has become one of the most generative phenomena in deep learning research. It has attracted mechanistic interpretability researchers, statistical physicists, optimisation theorists, and alignment scientists. It has spawned competing theoretical frameworks, been replicated in deep networks and non-neural models, and—as of mid-2025—been observed for the first time in large-scale LLM pretraining.

Yet for all the attention it has received, grokking remains incompletely understood. The competing explanations are partially reconciled at best, genuinely contradictory at worst. The phenomenon's implications for frontier AI systems—for scaling, for safety, for the fundamental question of what it means for a neural network to "understand"—remain largely unresolved.

This essay surveys where the field stands, what we can confidently claim, and—more importantly—what remains dangerously unclear.

2. What Grokking Is (and What It Is Not)

Grokking is not a synonym for generalisation. It names a specific, sometimes-observed training phenomenon: a model abruptly transitions from overfitting to generalising, well after training loss has converged, with held-out performance rising suddenly rather than gradually alongside training performance. The delay is the defining feature. In typical training, generalisation and memorisation improve roughly in tandem. In grokking, they decouple—sometimes by thousands or even millions of training steps.

The canonical setup involves a small transformer trained on modular arithmetic—computing (a + b) mod p for some prime p. The model is given a fraction of all possible input pairs as training data. It memorises these rapidly, achieving zero training loss within a few hundred epochs. Test accuracy remains at chance. Then, after continued training with weight decay, test accuracy abruptly climbs to near-perfect.

This matters because it violates one of deep learning's most deeply held practical assumptions: that if a model has overfit and training loss has plateaued, continued training is wasted compute. Grokking suggests the opposite—that some of the most structurally important learning can happen in exactly the regime where conventional wisdom says to stop.

2.1 Why It Challenges Optimisation Theory

Standard generalisation theory offers two broad frameworks for understanding why neural networks generalise: implicit regularisation (the optimiser's trajectory naturally favours simpler solutions) and PAC-Bayesian or norm-based bounds (generalisation correlates with parameter norm or flatness of minima). Both frameworks predict that generalisation should emerge alongside or shortly after training loss decreases. Neither predicts a regime where training loss is zero, generalisation is absent, and yet the model is silently building the machinery for generalisation underneath the surface.

The phenomenon also challenges the common intuition behind early stopping. If a model can overfit completely and then, given sufficient additional training, transition to a qualitatively superior solution, then our entire framework for deciding when to stop training may be fundamentally incomplete. We are not merely missing a few extra percentage points of accuracy—we may be missing an entirely different algorithmic solution.

2.2 The Three Phases

Neel Nanda and colleagues, in their landmark ICLR 2023 paper, fully reverse-engineered the algorithm learned by a one-layer transformer that had grokked modular addition. They identified three continuous phases that underlie the apparently discontinuous transition:

Memorisation (early training): The model fits the training data via a brute-force lookup table. Training loss drops to zero. The embedding space shows no particular structure. The model has, in effect, created a hash map.

Circuit formation (mid-training): Beneath the frozen training loss, weight decay is slowly amplifying a qualitatively different solution—one based on discrete Fourier transforms and trigonometric identities. The model learns to embed numbers as rotations on a circle, compose rotations via attention and MLP layers, and read off the answer using cosine similarity. This circuit is forming gradually even though no external metric detects it.

Cleanup (late training): The generalising Fourier circuit has lower weight norm than the memorisation circuit. Weight decay tips the balance: the memorisation components are shed, the Fourier circuit dominates, and test accuracy suddenly spikes. The "sudden" transition is actually the culmination of a gradual process that was invisible to standard metrics.

This three-phase narrative is one of the strongest results in mechanistic interpretability to date. It demonstrates that what appears discontinuous from the outside can be continuous when measured with the right instruments—a finding with deep implications for how we monitor and evaluate training in practice.

3. The Competing Explanations

Grokking has attracted theoretical attention from multiple research communities, each bringing different formalisms and different intuitions. The result is a landscape of partially overlapping, partially contradictory explanations. Understanding where they agree—and where they don't—is essential for evaluating what we actually know.

3.1 Weight Decay and Implicit Simplicity Bias

The most widely accepted proximate explanation is that weight decay acts as a slow pressure toward simpler solutions. During memorisation, the model converges to a high-norm solution (the lookup table). Weight decay gradually reduces parameter norms, and the low-norm generalising solution—which is harder to find but has better inductive structure—eventually becomes dominant. Liu et al. (2022) formalised this as an effective theory of representation learning, showing that grokking corresponds to the embeddings transitioning from unstructured to geometrically organised (parallelogram structures for addition, circular structures for modular arithmetic).

This explanation is strong but incomplete. It tells us what drives the transition (weight decay penalises the memoriser), but it does not explain why the generalising circuit forms at all, or why it takes the specific algorithmic form it does. Weight decay is a necessary catalyst, not a sufficient explanation. Moreover, grokking has been observed in classification settings even without weight decay (via the implicit bias of gradient descent on separable data), suggesting the phenomenon has deeper roots than any single regulariser.

3.2 Lazy-to-Rich Regime Transition

Kumar, Bordelon, Gershman, and Pehlevan (ICLR 2024) proposed a unifying framework: grokking occurs when neural networks transition from a lazy training regime (where weights stay near initialisation and the model behaves like a linear kernel method) to a rich regime (where weights move in task-relevant directions and genuine feature learning occurs). In the lazy regime, the model memorises efficiently but cannot learn transferable representations. In the rich regime, the model discovers structure. The transition between these regimes can be abrupt, producing the characteristic grokking signature.

This framework is attractive because it unifies several earlier observations: the role of weight decay (which pushes the system toward the rich regime), the importance of initialisation scale (large initialisations stay lazy longer), and the connection to adaptive optimisers (which can accelerate the transition). It also connects grokking to the broader neural tangent kernel literature, grounding it in existing theory.

The limitation is that the lazy-to-rich framework operates at a high level of abstraction. It identifies the regime transition but does not predict the specific algorithm the model will learn in the rich regime, or how long the transition will take. A 2025 theoretical paper from INRIA formalised the optimisation dynamics on the interpolation manifold, providing rigorous analysis of the two-phase structure, but acknowledged that extending this to practical architectures remains open.

3.3 Phase Transitions from Statistical Physics

A parallel line of work, drawing on statistical mechanics, treats grokking as a genuine phase transition in the training process. Zunkovic and Ilievski (JMLR 2024) provided exactly solvable grokking models with analytic expressions for critical exponents, grokking probability, and grokking time distributions. An ICLR 2024 paper demonstrated that the pre-activations of grokking networks undergo a first-order phase transition, with the latent kernels developing entirely new features that alter sample complexity.

This perspective has the advantage of mathematical precision and universality—phase transitions are a framework-independent phenomenon. However, the models studied are simplified to the point where feature learning is abstracted away, and the connection to realistic deep networks is still largely qualitative. The critical exponents computed for toy models have not been measured in practical-scale systems.

3.4 Circuit Competition and Efficiency

Varma, Shah, Kenton, Kramár, and Kumar (2023) offered an explanation grounded in circuit efficiency: memorising and generalising circuits compete for the model's limited capacity, and grokking occurs when the generalising circuit eventually wins because it is more parameter-efficient. Merrill et al. (2023) extended this, showing that the competition between dense memorising circuits and sparse generalising circuits is a general feature of transformer training.

Huang et al. (2024) went further, proposing a unified framework connecting grokking, double descent, and emergent abilities through the lens of circuit competition across scales. They delineated four regimes—no-fit, memorise, grok, comprehend—and showed that model capacity and data volume determine which regime a model occupies. This suggests that sufficiently large models might skip grokking entirely, jumping straight to comprehension—which may explain why the phenomenon is harder to observe in frontier systems unless one looks carefully.

3.5 Local Complexity and Linear Region Dynamics

Humayun, Balestriero, and Baraniuk (ICML 2024) made the provocative claim that deep networks always grok. They demonstrated grokking in practical settings—CNNs on CIFAR-10, ResNets on Imagenette—by showing that the linear regions (spline partition regions) tiling the input space undergo a phase transition during training: they migrate away from training samples (smoothing the mapping there) and toward decision boundaries (sharpening discrimination). They also introduced delayed robustness, where adversarial robustness emerges long after generalisation, suggesting grokking is just one instance of a broader family of delayed learning phenomena.

This work is significant because it moves grokking from a curiosity of small algorithmic models to a potentially universal feature of deep network training. If confirmed at scale, it would fundamentally change how we think about training schedules, early stopping, and model evaluation.

4. What We Know, What We Suspect, and Where We're Hand-Waving

The honest assessment of the grokking literature is that we have several strong local explanations and no fully satisfying global theory. Each framework captures part of the elephant. None captures it whole.

4.1 What the Evidence Genuinely Supports

First, grokking is real and reproducible across architectures, tasks, and even model classes. It has been observed in transformers, MLPs, CNNs, kernel methods, and tensor networks. The mechanistic interpretability work on modular arithmetic is airtight: Nanda et al. fully reverse-engineered the learned algorithm, confirmed it via ablations in Fourier space, and showed the three-phase training dynamics with continuous progress measures. This is one of the most complete explanations of a learned neural computation in the literature.

Second, weight decay (or some form of regularisation) reliably accelerates grokking and, in regression settings, appears necessary to produce it. The role of regularisation as a catalyst for the lazy-to-rich transition is well-supported across multiple theoretical frameworks.

Third, the "suddenness" of grokking is, at least in small models, a measurement artefact. The underlying process is gradual when tracked with appropriate progress measures. This is an important finding for the broader emergence debate: what looks discontinuous may reflect our poverty of metrics, not a discontinuity in the underlying dynamics.

4.2 Where the Evidence Is Thin

The critical weakness of the field is scale. Almost everything we know about grokking's mechanisms comes from models with fewer than a million parameters, trained on synthetic datasets with hundreds to thousands of examples. The Li et al. (2025) paper on grokking in 7B-parameter LLM pretraining is an important first step, but it studied a single architecture (OLMoE), and the "grokking" observed—local, asynchronous, domain-dependent—may be a qualitatively different phenomenon from the sharp transitions in toy settings.

The assumption that mechanistic insights from modular arithmetic transfer to natural language, code generation, or multi-step reasoning is, frankly, an act of faith. The Fourier multiplication algorithm is elegant and interpretable, but no one has found analogous clean circuits in large language models. The algorithms learned by frontier systems may be far more distributed, redundant, and resistant to reverse-engineering than the neat structures found in toy models.

Similarly, the phase transition analyses from statistical physics are mathematically rigorous but rely on simplifications (single-layer networks, Gaussian data, mean-field approximations) that may not survive contact with the deep, heterogeneous, noisy reality of modern training. Computing critical exponents for a 70-billion-parameter model on web-scale data is not currently feasible.

4.3 The Uncomfortable Questions

Several assumptions in the grokking literature deserve more scrutiny than they typically receive:

Is multi-epoch training necessary? Most grokking results involve training for hundreds or thousands of epochs on small datasets. Modern LLM pretraining typically involves a single pass (or near-single pass) through the data. The Li et al. result suggests grokking can occur even without repeated exposure, but the dynamics may differ fundamentally. If grokking requires the optimiser to slowly erode a memorisation solution through repeated gradient pressure, single-epoch training may produce a weaker or absent version of the phenomenon.

Does scale suppress or merely hide grokking? Huang et al.'s framework suggests that large models may jump directly to the "comprehend" regime, skipping the grok phase entirely. If true, grokking may be primarily a small-model phenomenon—interesting for theory but irrelevant for frontier systems. Alternatively, grokking may still occur in large models but be masked by the averaging of many asynchronous local grokking events across different capabilities and data domains.

Are we conflating correlation with mechanism? Weight norm decreases during grokking. Feature rank decreases during grokking. Pathway similarity increases during grokking. But which of these are causes, which are effects, and which are epiphenomena of a deeper process? The field has many correlated progress measures and few causal experiments.

5. Grokking and the Bigger Picture

5.1 Emergent Abilities and Scaling

The connection between grokking and emergent abilities in large language models is suggestive but unproven. Huang et al. (2024) argued that emergent abilities—sharp improvements in capability at certain model scales—can be understood as grokking along the scale axis rather than the time axis: memorisation circuits dominate until sufficient capacity allows generalisation circuits to form and compete. If correct, this would mean that studying grokking in controlled settings could help predict what abilities will emerge in frontier models and when.

This connects to the work of Du et al. (NeurIPS 2024), who showed that emergent abilities align more closely with pre-training loss thresholds than with parameter count. If grokking is a loss-landscape phenomenon rather than a scale phenomenon, it may be possible to induce or accelerate it through training dynamics alone—curriculum design, learning rate schedules, data ordering—without brute-force scaling.

The counterargument, articulated by Schaeffer et al. (2023), is that much of what looks like emergence may be an artefact of discontinuous evaluation metrics. Grokking researchers should take this seriously: the sharp transitions that define grokking may, in some cases, reflect the sharpness of the metric rather than the sharpness of the underlying learning.

5.2 Reasoning and Implicit Computation

Wang et al. (2024) demonstrated that small transformers trained to grok reasoning tasks become "implicit reasoners"—performing multi-step logical inferences through their feedforward layers without explicit chain-of-thought. This is provocative: it suggests that grokking doesn't just improve pattern matching but can produce qualitatively different computational strategies, where the model internalises an algorithm rather than memorising input-output pairs.

For the reasoning model paradigm—exemplified by OpenAI's o1 and o3—this raises an important question: are the impressive reasoning capabilities of these models the result of grokking-like transitions during training, where the model crosses from pattern-matched responses to genuine algorithmic reasoning? If so, understanding grokking dynamics could inform how to train better reasoning models and how to evaluate whether their reasoning is robust or fragile.

5.3 Safety and Alignment

The safety implications of grokking are under-discussed relative to their importance. If models can harbour latent capabilities that activate only after extensive training or fine-tuning, this creates a fundamental monitoring problem. A model that appears to have reached its capability plateau may be in the pre-grokking memorisation phase for dangerous capabilities—deception, situational awareness, reward hacking—that have not yet crystallised into functional circuits.

This connects directly to the sleeper agents research (Hubinger et al., 2024) and alignment faking observations (Greenblatt et al., 2024): if safety-relevant behaviours can grok into existence, then evaluations conducted at one training checkpoint may miss capabilities that emerge later. The Li et al. (2025) finding that different domains grok asynchronously makes this worse: a model could be safe on evaluated capabilities while simultaneously developing dangerous capabilities in domains not yet assessed.

Conversely, grokking offers a potential tool for alignment. Grokked models have been shown to develop more structured, disentangled representations—which makes them both more interpretable and, recently, better candidates for machine unlearning. If we can understand and control the grokking process, we may be able to steer models toward internalising safe algorithms rather than merely memorising safe responses.

5.4 Implications for Safety-Critical AI Systems

For those of us building AI systems in safety-critical domains—healthcare, surgical environments, autonomous systems—grokking underscores a principle that should already be non-negotiable: classical deterministic systems should make final decisions, not neural networks operating in regimes we do not fully understand. A surgical instrument recognition model that appears to have converged may still be in a pre-grokking phase for some class of instruments, some lighting condition, or some camera angle. The model's training loss tells you it has memorised your training data. It tells you nothing about whether it has learned a robust algorithm that generalises to the operating room.

This is not an argument against using deep learning in safety-critical systems. It is an argument for building hybrid architectures where neural networks handle perception and pattern recognition while classical systems enforce hard constraints, validate outputs, and maintain deterministic fallbacks. Grokking reminds us that the gap between memorisation and generalisation can be vast, silent, and invisible to the metrics we routinely monitor.

6. The Open Frontier

The grokking literature is three years old and already rich. But the most important questions remain unanswered. Here are five that I believe could define the next wave of progress:

6.1 Can We Predict Grokking Before It Happens?

If grokking is preceded by gradual circuit formation, then there should exist early-warning signals detectable before test accuracy improves. Nanda et al.'s progress measures (restricted loss, Fourier component norms) work for modular arithmetic. Notsawo et al. (2023) showed that loss landscape properties can predict grokking long before it occurs. Li et al.'s pathway metrics track generalisation in MoE models. But no general-purpose, architecture-agnostic grokking predictor exists. Developing one would be transformative: it would tell practitioners whether continued training is likely to yield a generalisation leap or is genuinely wasted compute.

6.2 Does Grokking Occur in Frontier Models?

The Li et al. (2025) result on a 7B MoE is the first evidence at practical scale, but the frontier is at 70B–400B+ parameters with dense architectures. Whether grokking manifests in GPT-class or Claude-class models during pretraining, fine-tuning, or RLHF is unknown. The asynchronous, domain-specific pattern observed by Li et al. may intensify or disappear at larger scales. This is arguably the single most important empirical question in the field.

6.3 Can We Induce or Accelerate Grokking on Demand?

If grokking produces qualitatively superior solutions—more robust, more interpretable, better at unlearning—then we should want to induce it deliberately rather than stumbling into it accidentally. This implies a research programme at the intersection of curriculum learning, optimiser design, and regularisation scheduling. Could we design training regimes that systematically push models from lazy to rich regimes? Could we use weight decay annealing, learning rate warmup-cooldown cycles, or data presentation ordering to compress the grokking timeline from thousands of epochs to tens?

6.4 Is There a Unified Theory?

The lazy-to-rich transition, circuit competition, phase transitions, and implicit regularisation are all partial views of the same phenomenon. A unified theory would need to explain why the model learns the specific algorithm it does (not just that it transitions between regimes), predict the grokking time as a function of architecture, data, and optimiser, and account for the phenomenon across model classes and scales. Singular Learning Theory (SLT), which quantifies model complexity through the local learning coefficient, offers one promising path. It has been used to identify developmental stages in transformer training and could potentially provide the mathematical scaffolding for a unified account. But this work is early.

6.5 What Does Grokking Tell Us About Understanding?

The deepest question is also the most speculative. When a transformer groks modular addition, it discovers discrete Fourier transforms and trigonometric identities—an algorithm that would be recognisable to a mathematician. Is this understanding? The model was never taught trigonometry. It discovered it through gradient descent on a loss function. The algorithm is interpretable to us, but the model has no access to our interpretive framework.

This matters beyond philosophy. If grokking reliably produces interpretable algorithms, it could become a powerful tool for scientific discovery: train a network on data, wait for it to grok, then reverse-engineer the solution it found. If the algorithms are reliably interpretable only in toy settings and become opaque at scale, then grokking's practical value for interpretability diminishes. The answer to this question will determine whether grokking is a curiosity, a tool, or a clue to something fundamental about the relationship between optimisation and understanding.

7. Conclusion: The Unsettling Lesson

Grokking's most important lesson is not about modular arithmetic or Fourier transforms. It is about the limits of our observability. For years, we have monitored training with loss curves and accuracy metrics, treating them as reliable windows into what a model has learned. Grokking reveals that these windows can be opaque: a model can be zero-loss and zero-understanding simultaneously, and the transition between these states can be invisible to our instruments.

This should make us humble. Every time we look at a loss curve and conclude that a model has "converged," we are making an assumption—that the important learning is over—that grokking shows can be spectacularly wrong. Every time we evaluate a model's capabilities at a single checkpoint, we are taking a snapshot of a dynamical system that may be in the middle of a phase transition we cannot see.

The field has made real progress: we can fully explain grokking in toy models, we have multiple theoretical frameworks that partially overlap, and we have the first evidence of the phenomenon at LLM scale. But the gap between what we can explain in controlled settings and what we can predict in frontier systems remains enormous. That gap is where the most important work will happen in the years ahead.

An OpenAI researcher forgot to stop a training run, and what they found when they returned has not yet finished reshaping our understanding of how neural networks learn. Three years in, we are still grokking grokking.


References

[1] Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. (2022). Grokking: Generalisation beyond overfitting on small algorithmic datasets. arXiv:2201.02177.

[2] Nanda, N., Chan, L., Lieberum, T., Smith, J. & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023.

[3] Liu, Z., Kitouni, O., Nolte, N., Michaud, E.J., Tegmark, M. & Williams, M. (2022). Towards understanding grokking: An effective theory of representation learning. arXiv:2205.10343.

[4] Kumar, T., Bordelon, B., Gershman, S.J. & Pehlevan, C. (2024). Grokking as the transition from lazy to rich training dynamics. ICLR 2024.

[5] Humayun, A.I., Balestriero, R. & Baraniuk, R. (2024). Deep networks always grok and here is why. ICML 2024.

[6] Fan, S. et al. (2024). Deep grokking: Would deep neural networks generalise better? arXiv:2405.19454.

[7] Varma, V., Shah, R., Kenton, Z., Kramár, J. & Kumar, R. (2023). Explaining grokking through circuit efficiency. arXiv:2309.02390.

[8] Huang, W. et al. (2024). Unified framework: Grokking, double descent, and emergent abilities via circuit competition.

[9] Zunkovic, B. & Ilievski, E. (2024). Grokking phase transitions in learning local rules with gradient descent. JMLR 25.

[10] Li, Z. et al. (2025). Where to find grokking in LLM pretraining? Monitor memorization-to-generalization without test. arXiv:2506.21551.

[11] Wang, Z. et al. (2024). Grokked transformers are implicit reasoners: Making transformers do implicit multi-hop reasoning.

[12] Notsawo, P. et al. (2023). Predicting grokking long before it happens: A look into the loss landscape of models which grok.

[13] Du, Z. et al. (2024). Understanding emergent abilities of language models from the loss perspective. NeurIPS 2024.

[14] Schaeffer, R. et al. (2023). Are emergent abilities of large language models a mirage? NeurIPS 2023.

[15] Hubinger, E. et al. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training.

[16] Greenblatt, R. et al. (2024). Alignment faking in large language models. Anthropic.

[17] Thilak, V. et al. (2022). The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.

[18] Gromov, A. (2023). Grokking modular arithmetic. arXiv:2301.02679.

[19] INRIA (2025). A theoretical framework for grokking: Interpolation manifold dynamics. hal-05425613.

Topics

GrokkingNeural NetworksGeneralisationInterpretabilityTraining Dynamics