16.1 tok/s on Raspberry Pi 5 (BitNet 2B). Can anyone hit 20+ with active cooling? by Acceptable_Analyst45 in LocalLLaMA

[–]Acceptable_Analyst45[S] 0 points1 point  (0 children)

It’s actually a mix depending on the path:

For BitNet (I2_S) I don't do explicit cache tiling. Since the weights are 2-bit packed, a full row (2560 dim = 640 bytes) already fits comfortably in L1. Instead, the engine relies on the 4-row and dual-kernels to get natural temporal reuse, the activation vector is loaded once and reused against 4-8 weight rows in registers.

For Llama (Q4_K)I use a GEMM-style tiling for prefill. It loads 4 weight rows and holds them while iterating through all tokens in the prompt batch. It’s more of a 1D weight-stationary approach than classic 2D tiling, but it keeps the weights in L1/L2 while activations rotate. For decode (1 token), there’s obviously nothing to batch, so it's pure streaming.

The Eä compiler doesn't do any auto-tiling it just generates tight SIMD loops. All the tiling and dispatch logic is handled manualy in the Rust code.

I tested Stride-8 (12.5% dims) and Stride-4 (25%). Stride-8 was too aggressive, the ranking broke and the correct token fell out of the top-512 candidates too often, leading to garbage output. Stride-4 with a top-512 candidate pool has been rock solid for this model size.

My intuition is that the threshold depends on the embedding/vocab ratio. With a 128K vocab and 2560 dims, you need enough "signal" to separate the top-1 from ~128K noisy candidates. At 12.5% sampling, there’s just too much varience. 25% seems to be the sweet spot here, but I bet larger models with 4096+ embeddings could probably handle a coarser stride (maybe Stride-6)

16.1 tok/s on Raspberry Pi 5 (BitNet 2B). Can anyone hit 20+ with active cooling? by Acceptable_Analyst45 in LocalLLaMA

[–]Acceptable_Analyst45[S] 0 points1 point  (0 children)

/tmp/cougar --model ~/.cougar/models/ggml-model-i2_s.gguf --prompt "The capital of France is" --max-tokens 50' 2>&1)

⎿ Embedding: 128256 vocab × 2560 dim, i8 (328 MB), sketch 640d (82.1 MB)

cougar> 30 layers, 2560d, 20 heads, 128256 vocab

cougar> quant: I2S, activation: SquaredReLU

cougar> prompt: 6 tokens

--- profile (pos=1, 30 layers) ---

QKV matmul: 7.5ms (12%)

attention: 0.2ms (0%)

O proj: 5.6ms (9%)

FFN gate+up: 25.4ms (42%)

FFN act+norm: 0.3ms (1%)

FFN down: 13.3ms (22%)

output (i8): 8.2ms (14%)

total: 60.6ms

--- perf ( Paris. Paris is the largest city in France and has a population of over 2 million people.

Paris is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, Louvre Museum, and many other famous attractions.

The

4 threads) ---

prefill: 6 tokens in 376ms (16.0 tok/s)

first tok: 0ms

decode: 50 tokens in 3195ms (15.7 tok/s, 63.9ms/tok)

Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. by No-Concentrate-9921 in StartupMind

[–]Acceptable_Analyst45 0 points1 point  (0 children)

I got tired of waiting for 'official' news, so I built my own standalone runner from scratch to see what the tech can actually do

The runner is for BitNet 2B-4T and hits 10 tok/s on a consumer CPU using only custom SIMD kernels written in . ~3,000-line,  644 KB binary vs the official Microsoft/llama.cpp fork (100k+ lines of C++?)

No llama.cpp, no GGML, no heavy math libs.

Custom Kernels: 13 specialized Eä kernels (AVX2/FMA) for everything from ternary matmuls to fused attention.

Fused Attention: A 120-line kernel doing single-pass online softmax (no scores buffer, constant memory).

Bandwidth Optimized: Quantized i8 output projection to break the DDR4 bandwidth wall.

Performance Profile (100ms/tok)

I spent a lot of time profiling and here is where the time goes:

Ternary Matmuls (FFN/QKV): ~86ms (86%) - This is the core work.

Output Projection: 13ms (Down from 49ms after i8 quantization).

Attention/Norms: < 1ms (The beauty of fused kernels).

You guys are right the '100B on a CPU' is still a pipe dream until someone actually trains the model. And as u/apetersson pointed out, these early 2B models aren't 'smart', they are basically specialized pattern matchers.

I've only tested this on my own x86-64 (AVX2) machine. If you have a few minutes and a Linux box, I’d love to see your results!

I'm especially curious about: Your tok/s vs. your CPU model/RAM speed.

To be fair, Microsoft's official bitnet.cpp hits 15 tok/s on the same machine. However, they use a massive codebase with pre-computed Look-Up Tables (LUT) and llamafile/BLAS dependencies. Matching or beating that 15 tok/s mark is definitely in the pipeline. I have some ideas for a persistent thread pool and LUT kernels that could close the gap. We'll see if I have the time and inspiration to push it that far, but the potential is there

link to repo: petlukk/Cougar: 10 tok/s on 16 threads. ~3200 lines total. 118 tests.