Shah Vision — Engineering Reliable Intelligence

Your brain is doing something extraordinary right now. As you read these words, only a tiny fraction of your neurons are firing. The vast majority sit silent — not because they're broken, but because silence is the signal. Decades of neuroscience have converged on a profound insight: the brain encodes information sparsely. A few neurons fire with precision; the rest stay quiet. It's metabolically cheap, remarkably robust, and deeply efficient.

Now consider how today's most powerful AI systems represent information. They do the opposite. Every dimension of every embedding is active, all the time — a dense, buzzing cloud of numbers where nothing is zero and nothing is quiet. This works, in the way that brute force often works. But it comes at a cost that's easy to miss: these representations are opaque, energy-hungry, and surprisingly fragile in ways we're only beginning to understand.

A new paper from Yilun Kuang, Yash Dagade, Tim Rudner, Randall Balestriero, and Yann LeCun — Rectified LpJEPA — asks a deceptively simple question: What if we taught AI to be sparse by design?

The answer turns out to be elegant, mathematically principled, and genuinely surprising.

The Collapse Problem

To understand why this matters, you need to understand the central tension of self-supervised learning. The setup is intuitive: show a model two different views of the same image — say, two random crops — and train it so that the internal representations of those views match. If the model learns that a dog is still a dog whether you crop the ears or the tail, it has learned something meaningful about the world.

But there's a trap. The easiest way to make two representations match is to make every representation identical. Map everything to the same point — a constant vector — and the matching loss is perfect. The representation is also perfectly useless. This is called collapse, and avoiding it has been the obsessive focus of the self-supervised learning community for years.

The dominant solution? Push the learned representations to resemble a Gaussian distribution — that familiar, smooth bell curve spread evenly across all dimensions. Gaussian regularization is mathematically clean: it maximizes entropy under an energy constraint, it prevents collapse by forcing representations to spread out, and it has closed-form properties that make optimization tractable.

LeJEPA, the predecessor to this work, formalized this beautifully through the Cramér–Wold theorem — the insight that matching a high-dimensional distribution can be decomposed into matching many one-dimensional projections. This makes the whole approach scalable and theoretically grounded.

There's just one problem. The Gaussian assumption quietly forces density.

When you regularize representations toward a Gaussian, you're implicitly telling the model: every dimension should be active, every component should contribute, and the overall shape should be smooth and symmetric. There is no room for zeros. No room for silence. No room for the kind of selective, sparse encoding that makes biological neural systems so efficient.

Changing the Target

The core insight of Rectified LpJEPA is that the choice of target distribution — the shape you push representations toward — is not a technical footnote. It's a design decision that fundamentally determines what kind of intelligence your model can express.

The authors propose replacing the Gaussian target with a Rectified Generalized Gaussian (RGG) distribution. The construction is intuitive and proceeds in stages.

First, they generalize from Gaussian to the broader family of Generalized Gaussian distributions, parameterized by an exponent p. When p = 2, you get the standard Gaussian. When p = 1, you get the Laplace distribution — already known to promote sparsity (it's the distributional cousin of L1 regularization, the engine behind Lasso). When p < 1, you enter territory where sparsity pressure intensifies dramatically — the geometry of the Lp ball develops sharp corners along coordinate axes, concentrating probability mass near configurations where many dimensions are close to zero.

Then comes the critical step: rectification. By applying a ReLU nonlinearity — clamping negative values to zero — they create a distribution that naturally lives in the non-negative orthant. This isn't a hack; it's a principled construction with precise mathematical consequences. The authors prove that the expected L0 norm (the number of non-zero entries) of a Rectified Generalized Gaussian vector is exactly controlled by three parameters: the mean shift μ, the scale σ, and the shape parameter p.

This means you can dial in the sparsity level of your representations through distributional parameters alone — no ad-hoc thresholding, no post-processing, no hand-tuned penalties.

The Entropy Guarantee

Here's where the mathematics gets genuinely beautiful. A natural worry with sparsity is that you'll throw the baby out with the bathwater — make representations sparse but destroy the information they carry. The authors address this head-on through an information-theoretic argument.

The Rectified Generalized Gaussian distribution preserves a maximum-entropy guarantee, rescaled by the Rényi information dimension. In plain terms: among all distributions with a given energy budget and a given sparsity level, the RGG spreads information as evenly as possible across the active dimensions. It's sparse where it should be silent, and maximally informative where it speaks.

This is the key theoretical contribution. Sparsity and information preservation are not in tension — they're co-designed through the geometry of the target distribution. The Gaussian turns out to be just one point on a continuous spectrum, and it happens to be the point where sparsity is zero.

Making It Work: RDMReg

The practical implementation introduces a new regularizer called Rectified Distribution Matching Regularization (RDMReg). Because the Rectified Generalized Gaussian is not closed under linear projections — unlike the Gaussian, projecting an RGG along a random direction doesn't give you another RGG — the authors can't use the one-sample goodness-of-fit test from LeJEPA. Instead, they use a two-sample test based on the sliced Wasserstein distance: sort the projected features and the projected RGG samples, then measure the gap.

The resulting loss function is elegant enough to fit in roughly fifteen lines of PyTorch. The architecture adds a single ReLU at the output of the representation network. That's it. No new modules, no complex training schedules, no adversarial components.

The entire framework — which the authors call Rectified LpJEPA — strictly generalizes LeJEPA. Set p = 2, remove the rectification, and you recover the original Gaussian-regularized method exactly. Every prior JEPA with Gaussian regularization is a special case of this broader family.

What the Experiments Show

On CIFAR-100 and ImageNet-100, the paper maps out clear sparsity-performance tradeoff curves. As representations become sparser, performance holds steady for a surprisingly long range before eventually degrading. The sweet spots — representations where 50–80% of dimensions are zero — often match or exceed the accuracy of fully dense baselines.

The practical implication is striking: you can learn representations that carry the same task-relevant information in a fraction of the active dimensions. Fewer active features mean cheaper downstream computation, more interpretable representations, and potentially more robust models — benefits that compound across every system that consumes these embeddings.

The Bigger Picture

What makes this work significant is not just the results, but the shift in perspective it represents. For years, the self-supervised learning community has treated the Gaussian as the natural, default regularization target. This paper demonstrates that the Gaussian is just one choice — and not always the best one. By treating the distribution of learned representations as an explicit design variable, the authors open a door that leads in several directions at once.

Consider what comes next. Can these sparse, non-negative representations improve continual learning, where catastrophic forgetting remains an open wound? Biological systems — which are sparse by nature — handle continual learning effortlessly. Could sparse representations enable more efficient model merging or modular composition, where distinct subsets of dimensions encode distinct concepts? What happens when you push p toward zero and representations approach truly binary, on-off codes — does something qualitatively different emerge?

And there's a deeper question lurking beneath the mathematics: if the optimal regularization target depends on the downstream task, the data modality, and the computational budget, then are we at the beginning of a new field — distributional architecture search — where the shape of the representation space is as much a hyperparameter as learning rate or model depth?

Rectified LpJEPA doesn't answer all of these questions. But it provides something arguably more valuable: a principled framework for asking them. It reminds us that how a model organizes its internal world — how it chooses what to say and what to leave silent — may matter just as much as how many parameters it has or how much data it's seen.

Sometimes, the most intelligent thing a neuron can do is not fire.

📄 Paper

💻 Code

📝 Blog

When Less Is More: The Quiet Revolution in How AI Learns to See

The Collapse Problem

Changing the Target

The Entropy Guarantee

Making It Work: RDMReg

What the Experiments Show

The Bigger Picture

Topics

Related Articles

The Calibration Problem Nobody Talks About