Back to Videos

Virtual Experts: Injecting Python into MoE Routing - Inside GPT-OSS

Channel Chris Hay
Date January 19, 2026
Duration 10:21
Tags
MoE Virtual Experts LLM Architecture MLX
TL;DR

Chris Hay demonstrates a novel approach to improving LLM math accuracy by injecting "virtual experts" (Python functions) directly into the Mixture of Experts (MoE) routing architecture. Instead of external tool calling, the model's router intercepts arithmetic tasks in early layers and routes them to a Python expert that executes actual code - making the model natively accurate at math while potentially running at half the size.

Key Takeaways

Summary

The Problem: Neural Networks Can't Do Math

Chris opens with a striking demo: asking GPT-OSS-20B (a 20 billion parameter model) to calculate 127 x 89. The model confidently answers 11263 - wrong. The correct answer is 11303. This isn't a confused model; it's confident and incorrect.

Understanding MoE Architecture

Modern models like GPT-OSS-20B are Mixture of Experts (MoE) models, unlike older dense models (LLaMA 2, GPT-2, GPT-3) where every parameter is active during inference. In MoE:

The Virtual Expert Concept

The key insight: attention mechanisms already do routing based on context. If attention can route "2 plus 3" to expert 8, why does expert 8 need to be neural? Chris's answer: it doesn't.

"Who says we need a neural expert in the first place? Why can't we just have a tool built into the model?"

How It Works

Using his Lazarus introspection tool, Chris shows that:

  1. As early as layer 0, the model classifies this as a multiplication task
  2. The confidence is locked in - the model knows what it's dealing with
  3. At this point, Chris hijacks the routing and sends it to a Python virtual expert

Two implementation methods exist:

Half-Size Models with Better Math

In a teaser for an upcoming video, Chris demonstrates expert pruning:

Solution: Add the virtual Python expert back. The half-size model now gets math right while running in significantly less memory.

Virtual Experts vs Traditional Tool Calling

Traditional Tool CallingVirtual Experts
External to modelNative to architecture
Token generation -> parsing -> tool call -> inject resultHappens within routing layer
Must run through all layersCan bypass remaining layers
Extra latencyNo additional overhead

Beyond Math

Virtual experts aren't limited to calculations:

Notable Quotes

"27 x 89. Let's ask a 20 billion parameter model. Wrong. The answer is 11303. It's not confused. It's confident. And it's wrong."

"This isn't a hack for fixing math. It's a new primitive."

"768 experts, 20 billion parameters, and the math is wrong. One virtual expert with Python and it gets it right. And the best thing is the model doesn't know."

Chapters

TimeTopic
00:00Expert Math Failure
00:13Virtual Math Expert Demo
01:12Understanding Virtual Experts
02:00Math Classification and Routing
03:30Math Demos
04:20Different Expert Types
05:04Halving GPT-OSS-20B
08:23Tool Calling vs Virtual Experts
09:20Future Thoughts

References