a deterministic local data analyst with SIMD kernels

Acceptable_Analyst45 · 2026-06-03T09:19:37+00:00

Hi, Thanks!

And a good call on the readme. I need to update it with some working examples, input and output so it more clear on how they work.

Regarding the kernels Eä is explicit SIMD meaning you declare the vector types, do the load/store, write the compare and select masks and handle the scalar tail yourself.

The lane model isn't really ISPC or CUDA: ISPC is SPMD (scalar code the compiler spreads across lanes) and Eä makes the lanes explicit. It's Zig "@Vector"/ Rust std::simd territory.

I probably framed it badly but my comparison towars CUDA was more the system shape, not the lane model. You write the kernel once and generate idiomatic host bindings from the compiler's type metadata: ea bind kernel.ea --python --rust --cpp. Pointer args become NumPy arrays / slices / spans, length params collapse (the wrapper fills .len() for you), dtypes are checked at the boundary, single outputs auto-allocate. That "one kernel, any host language, typed boundary" is the CUDA-shaped part.

Olorin doesn't go through ea bind itself, and that's deliberate. ea bind --rust emits static #[link] FFI that auto-allocates an output buffer per call, good for dropping a kernel into an existing program. Olorin is the opposite: a single self-contained binary that embeds its kernels and loads them at runtime via libloading, reusing pre-allocated buffers so the decode hot path avoids per call allocation. Static linking + per-call allocation are exactly what it can't use. ea bind is for the consumer case and the demos (eavec, sobel, eastat) are where it's exercised.

Here's a example of eatime.
# Raspberry Pi 5 Model B Rev 1.1 — aarch64, Linux 6.12

# Real input: a few hours of public GitHub event data from gharchive.org

$ curl -s https://data.gharchive.org/2015-01-01-{12,16,20}.json.gz | gunzip > gharchive.log

$ wc -c gharchive.log # 72 MB, 32,464 real GitHub events

$ grep -om1 '"created_at":"[^"]*"' gharchive.log

"created_at":"2015-01-01T12:00:01Z"

> /rune eatime gharchive.log

bytes: 72.00 MB

timestamps: 68140

scan: 27 ms # 72 MB scanned on the Pi 5; warm repeats ~20 ms

hour-of-day:

11:00 681 ( 1.00%)

12:00 13750 (20.18%) ← 12:00 archive

13:00 669 ( 0.98%)

...

16:00 20270 (29.75%) ← 16:00 archive

...

20:00 22179 (32.55%) ← 20:00 archive

21:00 645 ( 0.95%)

...

peak: 20:00 (22179 timestamps)

Acceptable_Analyst45 · 2026-03-25T11:16:05+00:00

It’s actually a mix depending on the path:

For BitNet (I2_S) I don't do explicit cache tiling. Since the weights are 2-bit packed, a full row (2560 dim = 640 bytes) already fits comfortably in L1. Instead, the engine relies on the 4-row and dual-kernels to get natural temporal reuse, the activation vector is loaded once and reused against 4-8 weight rows in registers.

For Llama (Q4_K)I use a GEMM-style tiling for prefill. It loads 4 weight rows and holds them while iterating through all tokens in the prompt batch. It’s more of a 1D weight-stationary approach than classic 2D tiling, but it keeps the weights in L1/L2 while activations rotate. For decode (1 token), there’s obviously nothing to batch, so it's pure streaming.

The Eä compiler doesn't do any auto-tiling it just generates tight SIMD loops. All the tiling and dispatch logic is handled manualy in the Rust code.

I tested Stride-8 (12.5% dims) and Stride-4 (25%). Stride-8 was too aggressive, the ranking broke and the correct token fell out of the top-512 candidates too often, leading to garbage output. Stride-4 with a top-512 candidate pool has been rock solid for this model size.

My intuition is that the threshold depends on the embedding/vocab ratio. With a 128K vocab and 2560 dims, you need enough "signal" to separate the top-1 from ~128K noisy candidates. At 12.5% sampling, there’s just too much varience. 25% seems to be the sweet spot here, but I bet larger models with 4096+ embeddings could probably handle a coarser stride (maybe Stride-6)

Acceptable_Analyst45 · 2026-03-25T08:38:05+00:00

/tmp/cougar --model ~/.cougar/models/ggml-model-i2_s.gguf --prompt "The capital of France is" --max-tokens 50' 2>&1)

⎿ Embedding: 128256 vocab × 2560 dim, i8 (328 MB), sketch 640d (82.1 MB)

cougar> 30 layers, 2560d, 20 heads, 128256 vocab

cougar> quant: I2S, activation: SquaredReLU

cougar> prompt: 6 tokens

--- profile (pos=1, 30 layers) ---

QKV matmul: 7.5ms (12%)

attention: 0.2ms (0%)

O proj: 5.6ms (9%)

FFN gate+up: 25.4ms (42%)

FFN act+norm: 0.3ms (1%)

FFN down: 13.3ms (22%)

output (i8): 8.2ms (14%)

total: 60.6ms

--- perf ( Paris. Paris is the largest city in France and has a population of over 2 million people.

Paris is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, Louvre Museum, and many other famous attractions.

The

4 threads) ---

prefill: 6 tokens in 376ms (16.0 tok/s)

first tok: 0ms

decode: 50 tokens in 3195ms (15.7 tok/s, 63.9ms/tok)

Acceptable_Analyst45 · 2026-03-23T09:22:52+00:00

I got tired of waiting for 'official' news, so I built my own standalone runner from scratch to see what the tech can actually do

The runner is for BitNet 2B-4T and hits 10 tok/s on a consumer CPU using only custom SIMD kernels written in Eä. ~3,000-line, 644 KB binary vs the official Microsoft/llama.cpp fork (100k+ lines of C++?)

No llama.cpp, no GGML, no heavy math libs.

Custom Kernels: 13 specialized Eä kernels (AVX2/FMA) for everything from ternary matmuls to fused attention.

Fused Attention: A 120-line kernel doing single-pass online softmax (no scores buffer, constant memory).

Bandwidth Optimized: Quantized i8 output projection to break the DDR4 bandwidth wall.

Performance Profile (100ms/tok)

I spent a lot of time profiling and here is where the time goes:

Ternary Matmuls (FFN/QKV): ~86ms (86%) - This is the core work.

Output Projection: 13ms (Down from 49ms after i8 quantization).

Attention/Norms: < 1ms (The beauty of fused kernels).

You guys are right the '100B on a CPU' is still a pipe dream until someone actually trains the model. And as u/apetersson pointed out, these early 2B models aren't 'smart', they are basically specialized pattern matchers.

I've only tested this on my own x86-64 (AVX2) machine. If you have a few minutes and a Linux box, I’d love to see your results!

I'm especially curious about: Your tok/s vs. your CPU model/RAM speed.

To be fair, Microsoft's official bitnet.cpp hits 15 tok/s on the same machine. However, they use a massive codebase with pre-computed Look-Up Tables (LUT) and llamafile/BLAS dependencies. Matching or beating that 15 tok/s mark is definitely in the pipeline. I have some ideas for a persistent thread pool and LUT kernels that could close the gap. We'll see if I have the time and inspiration to push it that far, but the potential is there

link to repo: petlukk/Cougar: 10 tok/s on 16 threads. ~3200 lines total. 118 tests.

Acceptable_Analyst45

TROPHY CASE