The Awakening of 'Hidden Experts': How MIT Is Changing AI Training

The Moment My Brain Flipped

I thought I understood how AI training worked. Pre-training to build the foundation, post-training to specialize. Simple, linear, expensive.

Until I read the “Neural Thickets” paper from MIT, published on March 12, 2026. And suddenly, the mental model I had for how AIs learn turned upside down.

The core idea is so counterintuitive that I had to read it three times to believe it: the specialist you’re looking for is already inside the pre-trained model. It doesn’t need to be “taught” — it needs to be found. And finding it can be as simple as adding random noise to the parameters and seeing what happens.

Yes, you read that right. Random noise. No gradient. No iterative training. No massive infrastructure.

Let me explain.

The Traditional Playbook (And Why It’s a Problem)

Until now, the journey of an AI model like ChatGPT followed a fixed script. First comes pre-training: the model ingests colossal amounts of text (and increasingly, images and audio) and learns the statistical patterns of the world. This is the most expensive step — we’re talking weeks of training on GPU clusters costing millions of dollars.

Then comes post-training: Fine-Tuning, RLHF, PPO, GRPO… an alphabet soup that basically serves to transform a generalist AI into something useful and aligned. This second step is what turns a generic model into a “doctor,” “lawyer,” or “programmer.”

The problem? Post-training is expensive, slow, and complex. Each specialization requires curated data, human annotators, training infrastructure, and iteration cycles that can take weeks. For smaller companies, it’s prohibitive. For academic researchers, it’s nearly inaccessible.

What if we didn’t need it?

Neural Thickets: The Hidden Specialists

The paper is by Yulu Gan (PhD student at MIT’s CSAIL, Peking University graduate) and Phillip Isola (MIT professor, one of the most respected names in computer vision). Published on arXiv with open-source code on GitHub.

Their discovery reminds me of a metaphor I can’t get out of my head: large-scale pre-trained models are like graduates from an elite university. They have enormous potential and vast knowledge, but haven’t yet manifested a specialization. The knowledge of a chemist, a mathematician, a programmer — it’s already “baked” inside the model. It’s just hidden under layers of generalist parameters.

The researchers called this concentration of latent abilities a “Neural Thicket.” And here’s the key insight:

In small models, these hidden specialists are like needles in a haystack. They’re there, but they occupy such a tiny fraction of parameter space that you need sophisticated methods (like gradient descent) to find them.

But in large, well-pre-trained models? The entire haystack becomes needle. The specialists are so dense around the pre-trained weights that you stumble upon them by accident. Literally.

RandOpt: Turning the Radio Dial

To exploit this discovery, MIT developed an algorithm called RandOpt (Random Optimization). And its beauty lies in its almost absurd simplicity.

Instead of using gradient descent — the standard engine of all AI training — RandOpt works in two steps:

Step 1: Random nudges. Add Gaussian noise to the pre-trained model’s weights. Do this N times. It’s a single-step operation — no iteration, no learning rate, no gradient. It’s like turning an old radio dial thousands of times until you find interesting frequencies.

Step 2: Performance voting. Test each of the N perturbed versions on a specific task with a small validation set. Select the top K. At inference time, these K versions “vote” together (majority vote) to arrive at the final answer.

That’s it. No backpropagation. No training cycles. RandOpt workers operate in total parallelism, without communicating during the process. They only interact at voting time.

The Results (That Surprised Me)

I’ll admit that when I read “random noise competes with PPO and GRPO,” my reaction was skepticism. These are elite methods — the same ones used to align GPT-4 and Gemini.

But benchmarks don’t lie.

With K=50 ensembles, RandOpt matched or outperformed sequential RL and ES (Evolutionary Strategies) methods on mathematical reasoning (GSM8K, MATH-500, OlympiadBench), code generation (MBPP), creative writing (ROCStories), and chemistry (USPTO), with the same FLOP budget.

For vision-language models (tested on Qwen2.5-VL-3B), RandOpt improved accuracy on the GQA benchmark from 56.6% to 69.0% — a 12.4 percentage point jump.

And the scaling effect is the most fascinating part: the larger the model, the better RandOpt works. Because the larger the model, the denser the “thicket” of specialists around the pre-trained weights. In sufficiently large models, the majority of random perturbations improve task-specific performance.

Why This Is Revolutionary (In My Opinion)

I’m cautious with the word “revolutionary” — it’s overused in AI. But here I think it fits, for three reasons:

Democratization. If you don’t need massive training infrastructure to specialize a model, the cost of creating specialized AIs drops dramatically. Startups, academic researchers, smaller companies — everyone gains access to something that today is the privilege of those with millions to spend on compute.

Perfect parallelism. RandOpt workers are 100% independent. No communication during training, no sequential dependencies. This means the algorithm scales trivially with hardware — throw more GPUs at the problem and it resolves faster, with no coordination overhead.

Conceptual reframe. Perhaps the deepest contribution isn’t the algorithm but the shift in perspective. Instead of thinking of pre-training as a “starting point” for optimization, think of it as a distribution over parameter vectors whose support already contains specialists. This framing opens doors to an entirely new line of research.

An Honest Warning

I wouldn’t be myself if I didn’t include the caveats.

The paper admits that RandOpt gains appear to saturate with increasing model size and perturbation count. There’s a ceiling. It doesn’t learn dramatically new skills that aren’t at least latent in the pre-training — it unlocks what already exists.

And the benchmarks, as impressive as they are, are still… benchmarks. Performance on GSM8K isn’t the same as performance in a real production use case. The code is available on GitHub, so anyone can test in practice — and I intend to do so.

Also, image diffusion models showed the Neural Thickets phenomenon too — certain regions of parameter space tended to generate images with specific color tones or visual styles. This suggests the phenomenon is more general than just language, which is encouraging.

Conclusion: The AI Already Knows, We Just Need to Ask the Right Way

This research made me rethink something fundamental: perhaps we’re underestimating the power of pre-training.

If the knowledge is already there, the challenge of the next decade won’t just be “teaching” AI but finding the right keys to unlock the potential it already possesses. And if those keys are as simple as Gaussian noise and majority voting… that changes everything.

I keep wondering: will the future of AI focus on smaller, more specialized models extracted from a single giant model? Or will we continue the race for ever-larger ones? Or perhaps — and this is the possibility that excites me most — both paths converge, and the “large model vs. small model” dichotomy simply ceases to exist.

For now, I’m going to clone the RandOpt repository and play with it. If the premise is right, this could fundamentally change how I think about deploying specialized models.

What if the specialist you’re looking for is already inside the model… you just need to poke it in the right place.

Share if this intrigued you:

Email: fodra@fodra.com.br
LinkedIn: linkedin.com/in/mauriciofodra

The best pre-training isn’t one that creates a generalist — it’s one that hides a thousand specialists waiting to be found.