Running Gemma-4 26B at 124 tokens/sec on a CPU, no GPU

I wanted to see how fast a 26B mixture-of-experts model runs on a normal desktop with no graphics card. Just the CPU: an i9-13900K, 64GB of plain DDR5. I went looking because of a question that keeps deciding the architecture of the robot brain I'm building, how fast can a model think on hardware you don't rent. The answer turned into its own project.

About 40 tokens/sec single-stream, lossless, or about 124 if you batch a few requests. For a 26B model with no GPU, that's a little wild.

One thing up front, because it surprised me. The instinct with a mixture of experts is to quantize the experts, that's where the parameters are. But I counted the bytes you actually read per token, and the experts are 16% of them. The output head, the projection to the 262K-token vocabulary, is 32%. For this model you compress the head, not the experts.

The run itself, on the i9 — one stream generating live, then the batched sweep filling in past 120 tok/s. No GPU, and the closing numbers are measured from this very run.

the setup

Gemma-4-26B-A4B is 26B parameters but only ~3.8B fire per token, because it's a mixture of experts: 128 of them, 8 used at a time. That sparsity is the only reason this fits on a CPU at all. And "lossless" here is literal, every trick either skips work the model would have thrown away, or guesses ahead and has the full model check the guess. The tokens that come out are exactly Q4_0's tokens, just faster. (One lever, running fewer experts, is an approximation; I'll show how I checked it.)

where the time goes

Plain Q4_0, one token at a time: 25 tokens/sec, and it didn't move whether I gave it 8 threads or 24. That's the tell. If cores don't help, you're waiting on memory.

tokens/sec = memory bandwidth ÷ bytes read per token

To make one token you read the active model out of RAM, once. Everything below is a fight over those two numbers: faster RAM, or fewer bytes.

two levers, almost free

Speculative decoding. Gemma ships a small official drafter that guesses the next few tokens; the big model verifies them all in a single pass instead of one at a time. Good guesses mean several tokens per pass. That's 25 → 40 tok/s, still exactly lossless, the big model signs off on every token.

Running 3 of the 8 experts. This one nearly scared me off: perplexity on raw Wikipedia jumped 1.6x. But raw-text perplexity is meaningless for a chat model, it scores terribly there no matter what, and that broken regime blows small differences out of proportion. So I read the actual outputs instead, and top-3 and top-8 give the same answers on real prompts. Free speed, confirmed by looking rather than trusting the number.

the surprise: the head, not the experts

Then I stopped guessing and counted the bytes per token, straight from the model file. The thing people skip is that bytes-per-token is not size-on-disk. For a mixture of experts they're wildly different: you only read the experts that fire, 3 of 128, but you read the whole head on every token.

share of all weights · on disk

experts · ~88%rest

bytes actually read · per token

always-on · 52%head · 32%experts · 16%

Experts: most of the model on disk, smallest slice per token. The head is the inverse, tiny on disk, read in full every step.

So the head is the thing to cut. It sits at 6.5 bits per weight; I dropped it to 2.4 and couldn't find any damage, same answers on every prompt. That's 606 MB down to 225 MB on the single most-read tensor in the model. 2.4 is the floor, not a step on the way down: at 1.75 bits it got slower, more time unpacking the tighter format than saved reading it, and the output started looping in the arithmetic.

what didn't work

Two dead-ends, left in because they're the useful part. First, quantizing harder is supposed to compound. A 32%-smaller model is 18% faster on plain decode. But stack it with speculative decoding and it buys exactly nothing, both fight memory bandwidth and spec already won it. (I logged a 43.6 once, got excited, re-ran it, noise around 41.) Second, you can't shrink the experts much anyway, their down-projection width doesn't fit the low-bit block formats, so they fall back to 4 bits. Doesn't matter, they're 16% of the bytes; forcing it would buy about 5%.

the wall, and the way around it

Is 40 a real wall or a lazy engine? I measured the RAM's actual bandwidth and how much decode uses of it.

DDR5-4800
on paper

76.8 GB/s

actually
achievable

64.5 GB/s

decode
uses

~48 GB/s

Decode already runs at ~78% of what the RAM can deliver. Pinning threads, core counts, more quant, none of it moved that. The gap is how a MoE reads memory, scattered, not laziness.

So single-stream is genuinely near the wall. But the equation is per token, for one stream. Serve a few at once and you read each weight once for all of them, the matmul turns from a vector into a matrix, and the bottleneck moves from memory to compute, the cores that sat idle finally have work.

114

124

batch 1481632

Aggregate tok/s by batch size. It crosses 100 at batch 16 and climbs to 124. Single-stream, meanwhile, never moves off ~40.

It crosses 100 at batch 16. And where threads did nothing for single-stream, here they help, batch-16 goes from 93 at eight threads to 114 at twenty-four, because now it's compute-bound. Same chip, two different limits. This is continuous batching, the trick vLLM made famous on GPUs; it just isn't usually pointed at a CPU. The catch is it's aggregate, not per-stream, at batch 32 each request gets about 4 tok/s. Right for a server with concurrent load, wrong for one person waiting on one answer.

where it lands

single-stream
latency

~40 tok/s

aggregate
throughput

~124 tok/s

100 on a no-GPU desktop is done, as throughput. Single-stream sits at 40 because that's where the memory bus tops out, and pushing past it there is a hardware question, faster RAM or more memory channels, not a software one. Two ceilings, 3x apart, on the same chip.

None of the pieces are mine. Speculative decoding off Google's drafter, the low-bit kernels from ik_llama.cpp, all on llama.cpp. What I did was measure where the walls are, dead-ends and all, and write it down.

It all runs on public models, and the recipe, every number, and the scripts to reproduce it on your own machine are on GitHub: arun-prasath2005/gemma4-cpu-moe.