Virtual Experts: Injecting Python into MoE Routing - Inside GPT-OSS

Channel Chris Hay

Date January 19, 2026

Duration 10:21

Key Takeaways

Virtual experts are non-neural experts (like Python functions) injected directly into MoE model routing, appearing to the model as just another expert
MoE routers classify tasks in early layers before any expert runs, making interception and rerouting possible
GPT-OSS-20B has 768 neural experts but gets 127 x 89 wrong; adding one Python virtual expert gets it right
This approach differs from tool calling because it happens natively within the architecture, not externally after token generation
Expert pruning + virtual experts allows running models at half size with better math accuracy (8.7GB vs 16GB memory)
Virtual experts aren't limited to math - they can be web search, database queries, code execution, or memory lookup

Summary

The Problem: Neural Networks Can't Do Math

Chris opens with a striking demo: asking GPT-OSS-20B (a 20 billion parameter model) to calculate 127 x 89. The model confidently answers 11263 - wrong. The correct answer is 11303. This isn't a confused model; it's confident and incorrect.

Understanding MoE Architecture

Modern models like GPT-OSS-20B are Mixture of Experts (MoE) models, unlike older dense models (LLaMA 2, GPT-2, GPT-3) where every parameter is active during inference. In MoE:

GPT-OSS has 32 experts per layer
Only 4 experts are active at any given time
A router decides which experts to activate
The rest sit idle - this is the efficiency trick enabling large models to run like small ones

The Virtual Expert Concept

The key insight: attention mechanisms already do routing based on context. If attention can route "2 plus 3" to expert 8, why does expert 8 need to be neural? Chris's answer: it doesn't.

"Who says we need a neural expert in the first place? Why can't we just have a tool built into the model?"

How It Works

Using his Lazarus introspection tool, Chris shows that:

As early as layer 0, the model classifies this as a multiplication task
The confidence is locked in - the model knows what it's dealing with
At this point, Chris hijacks the routing and sends it to a Python virtual expert

Two implementation methods exist:

Train the router to route to virtual experts (modifies the model)
Hijack at inference (doesn't modify the model)

Half-Size Models with Better Math

In a teaser for an upcoming video, Chris demonstrates expert pruning:

Removed unused experts, keeping only frequently-used ones
Result: 424 experts instead of 768
Memory: 8.7GB instead of ~16GB
Problem: 2 + 2 = 3 (math experts were pruned)

Solution: Add the virtual Python expert back. The half-size model now gets math right while running in significantly less memory.

Virtual Experts vs Traditional Tool Calling

Traditional Tool Calling	Virtual Experts
External to model	Native to architecture
Token generation -> parsing -> tool call -> inject result	Happens within routing layer
Must run through all layers	Can bypass remaining layers
Extra latency	No additional overhead

Beyond Math

Virtual experts aren't limited to calculations:

Calculator (demonstrated)
Web search
Database queries
Code execution
Memory lookup

Notable Quotes

"27 x 89. Let's ask a 20 billion parameter model. Wrong. The answer is 11303. It's not confused. It's confident. And it's wrong."

"This isn't a hack for fixing math. It's a new primitive."

"768 experts, 20 billion parameters, and the math is wrong. One virtual expert with Python and it gets it right. And the best thing is the model doesn't know."

Chapters

Time	Topic
00:00	Expert Math Failure
00:13	Virtual Math Expert Demo
01:12	Understanding Virtual Experts
02:00	Math Classification and Routing
03:30	Math Demos
04:20	Different Expert Types
05:04	Halving GPT-OSS-20B
08:23	Tool Calling vs Virtual Experts
09:20	Future Thoughts

References

Original Video on YouTube
Lazarus - Chris Hay's introspection and virtual expert toolkit
Mentioned: GPT-OSS-20B, MLX-LM, Mixture of Experts (MoE), LLaMA 2, LLaMA 3, GPT-2, GPT-3