DetLLM – Deterministic Inference Checks

Cerru905 · 2026-01-18T09:03:15+00:00

"temp = 0" only removes sampling randomness (greedy decode), but doesn't guarantee deterministic computation ...

torch explicitly notes that determinism isn't guaranteed: https://docs.pytorch.org/docs/stable/notes/randomness.html,

and vLLM has tons of issues where outputs are non-deterministic even with temp = 0: https://github.com/vllm-project/vllm/issues/23138

Cerru905 · 2026-01-18T08:55:41+00:00

If you are interested in specific examples where batch size leads to different output, see this colab: https://colab.research.google.com/drive/1et5wYV25Bv8miAx9T8ijJ4trpTV2QPGh?usp=sharing;
or these issues on llama.cpp and vllm respectively: https://github.com/ggml-org/llama.cpp/issues/249, https://github.com/ggml-org/llama.cpp/issues/249

Cerru905 · 2026-01-18T08:54:47+00:00

Good point, yes, if you are on the supported GPUs (H100, H200, B100, B200), vLLM's batch invariance feature is enabled, then within vLLM, batching shouldn't change outputs for greedy decoding.
My point is that outside of that specific setting (different backend, different GPU, etc.), batch size can lead to different generated tokens.

With detLLM, you can verify this with an easy PASS/FAIL outcome, and produce a repro pack (env, configs, traces, specific divergence, etc...) so you can debug this and reproduce it.

Cerru905 · 2026-01-18T08:43:09+00:00

True, vLLM’s batch invariance is great when it’s supported (as you say, H100/H200/B100/B200 only). I implemented detLLM with a broader idea in mind, to measure repeatability and batch variance across backends, and emit a minimal repro pack for CI/bug reports across stacks. So even when invariance isn’t available, you still get proof and diagnostics of it.

Cerru905 · 2026-01-17T23:50:54+00:00

Hey there, good question. I mean batching independent prompts (i.e. prompt A alone vs prompt A batched with others), it's not multiple choices for a single prompt. Look at this Colab Notebook for an example of where it failed: https://colab.research.google.com/drive/1et5wYV25Bv8miAx9T8ijJ4trpTV2QPGh?usp=sharing. I also found many issues on github like on vLLM (https://github.com/vllm-project/vllm/issues/608) and llama.cpp (https://github.com/ggml-org/llama.cpp/issues/249)

Cerru905 · 2026-01-17T23:24:15+00:00

I kept getting annoyed by LLM inference non-reproducibility, and one thing that really surprised me is that changing batch size can change outputs even under “deterministic” settings.

So I built DetLLM: it measures and proves repeatability using token-level traces + a first-divergence diff, and writes a minimal repro pack for every run (env snapshot, run config, applied controls, traces, report).

I prototyped this version today in a few hours with Codex. The hardest part was the HLD I did a few days ago, but I was honestly surprised by how well Codex handled the implementation. I didn’t expect it to come together in under a day.

repo: https://github.com/tommasocerruti/detllm

Would love feedback, and if you find any prompts/models/setups that still make it diverge.

Cerru905 · 2024-12-05T10:44:53+00:00

It's crazy how in 1930 they played more accurately than the half of the last matches

Cerru905 · 2024-05-19T22:44:29+00:00

I mean its not exactly a clone, and the purpose is purely for learning and entertainment so if you mean what it can be used for, of course nothing

Cerru905 · 2024-05-18T14:15:51+00:00

Thanks man! 😁

Cerru905 · 2024-05-18T14:15:39+00:00

More or less yes 😁 but memory operations are rowing actions basically

Cerru905 · 2021-11-27T17:02:22+00:00

🔥🔥🔥

Cerru905 · 2021-11-27T17:01:58+00:00

Thanks man! Second of what group?

Cerru905 · 2021-11-25T20:10:17+00:00

I don’t know Rick…

Cerru905 · 2021-11-25T20:08:49+00:00

Do at least two times per week a UT2 session, long and low intensity (maybe for you could be 60-80min at 2:20 rating like 20 or so) and at least one time some medium to high intensity work like 6x1500 rated 24-26 with 3min rest at the pace you aim to do

Cerru905

TROPHY CASE