For two years, running models locally meant accepting a 30–40 point penalty on coding benchmarks. Llama derivatives scored in the 20s on SWE-bench while Claude and GPT-4 hit 50s and 60s. The gap was too wide. Local was for experimentation, not work.
GLM-4.7-Flash closes that gap to 14.6 points.
Zhipu AI's model scores 59.2% on SWE-bench Verified — real GitHub issue fixes, not synthetic benchmarks. The full cloud GLM-4.7 (358B parameters) hits 73.8%. That remaining gap matters for complex refactors. It doesn't matter for code review, test generation, or documentation. Most daily coding work falls into the second category.
The Architecture That Makes It Work
30 billion parameters, 3 billion active per token. Mixture-of-Experts with 64 specialists, only 4–5 firing per inference. This is how a model runs at 82 tokens/second on an M4 Max MacBook Pro without melting the hardware.
The Setup
The falcon-demo repo has the complete integration. Three commands:
./scripts/install.sh
source ~/.claude-local-config
claude --model glm-4.7-flash
The install script handles Ollama 0.14.3+ deployment (earlier versions throw 412 errors), the 19GB model download, and environment configuration. Ollama's Anthropic API compatibility layer routes Claude Code requests to localhost:11434/v1/messages.
Requirements: macOS Apple Silicon or Linux, 24GB+ RAM, Node.js 18+.
Limitations: Ollama's compatibility layer lacks token counting, prompt caching, URL-based images (base64 only), PDF documents, and Batch API.
Model weights are MIT-licensed on HuggingFace. Audit them, fork them, run air-gapped. No legal review required.
The Math Shifted
Cloud APIs: $20–200/month recurring. Local hardware: $700–800 (used RTX 3090) or $3,500 (M4 Max), then nothing. Break-even: 6–18 months.
But the interesting cases aren't about cost. They're about access.
HIPAA-regulated medical devices. Financial platforms with data residency requirements. Defense contractors on air-gapped networks. Government agencies with security mandates. These organizations couldn't use AI coding tools because sending code to external APIs violated policy. That constraint just disappeared.
What Changed
Mixture-of-Experts isn't new. Local inference isn't new. What's new is a 59.2% SWE-bench score running on consumer hardware under MIT license. The previous best local options scored in the 30s and 40s — useful for prototyping, not production.
GLM-4.7-Flash isn't as good as Claude Sonnet 4.5 or GPT-5. It won't be for a while. But "good enough for daily work" is a different bar than "state of the art," and this model clears it.
The capability floor for local AI coding just rose significantly. That matters more than the ceiling.
References
- falcon-demo — Claude Code + Ollama setup
- GLM-4.7-Flash — Model weights (MIT license)
- Zhipu AI announcement — Benchmarks and architecture details