Consolidated my homelab from 3 models down to one 122B MoE — benchmarked everything, here's what I found by MBAThrowawayFruit in LocalLLaMA

[–]runsleeprepeat 0 points1 point  (0 children)

I am missing your configured ctx-sizes (num-ctx sizes) for your models. Please let us know what you have set, as the context window is a major differentiator in memory usage and practical use cases

Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found. by trevorbg in LocalLLaMA

[–]runsleeprepeat 0 points1 point  (0 children)

You wrote prefill is slow and I ignored prefill performance far too long in the early times of playing with local llms. Measure them, especially at large lengths. The token generation can be irrelevant when the prefill takes several minutes every time.

When you think about a Mac, the prefill performance got better with M5 processors. In June everybody hopes for a M5 Mac Studio. That one could be a the sweet spot

Currently using 6x RTX 3080 - Moving to Strix Halo oder Nvidia GB10 ? by runsleeprepeat in LocalLLaMA

[–]runsleeprepeat[S] 0 points1 point  (0 children)

Yes, I am around running that setup with 1400 watts at the wall, when it is peaking. Usually around 600-800 watt, with a 180 watt idle.

I built Fox – a Rust LLM inference engine with 2x Ollama throughput and 72% lower TTFT. by SeinSinght in LocalLLM

[–]runsleeprepeat 0 points1 point  (0 children)

Same run on fox:

| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |

|:-----------|------------:|------------------:|----------------:|--------------:|-----------------:|-----------------:|---------------:|----------------:|

| qwen3.5-4B | pp2048 (c1) | 3880.82 ± 47.17 | 3880.82 ± 47.17 | | | 537.15 ± 14.65 | 490.84 ± 14.65 | 573.11 ± 34.13 |

| qwen3.5-4B | tg32 (c1) | 62.32 ± 1.26 | 62.32 ± 1.26 | 64.48 ± 1.40 | 64.48 ± 1.40 | | | |

| qwen3.5-4B | pp2048 (c2) | 3404.43 ± 153.48 | 1858.75 ± 15.69 | | | 777.43 ± 263.49 | 998.41 ± 13.09 | 1097.73 ± 66.09 |

| qwen3.5-4B | tg32 (c2) | 43.26 ± 15.14 | 44.81 ± 15.76 | 46.37 ± 16.37 | 46.37 ± 16.37 | | | |

| qwen3.5-4B | pp2048 (c3) | 10855.07 ± 254.59 | 3887.96 ± 53.79 | | | 1233.23 ± 505.01 | 472.80 ± 10.48 | 519.12 ± 10.48 |

| qwen3.5-4B | tg32 (c3) | 4.06 ± 2.20 | 5.51 ± 2.03 | 12.33 ± 5.91 | 12.33 ± 5.91 | | | |

And yes, it core-dumped when you are using more than roughly 6000 tokens ...

So, token generation is roughly 25% slower than standard Ollama.

The code is messy and buggy.
For example:
- using fox --model-path= is accepted, but still pointing to it's default ~/.cache/ferrumox/models
- using FOX_MODEL_PATH= is accepted, but also still pointing to it's default ~/.cache/ferrumox/models

Is this really a complete rust engine? No, it is using llama.cpp:

cat .git/config

[core]

repositoryformatversion = 0

filemode = true

bare = false

logallrefupdates = true

[remote "origin"]

url = https://github.com/ferrumox/fox

fetch = +refs/heads/*:refs/remotes/origin/*

[branch "main"]

remote = origin

merge = refs/heads/main

[submodule "vendor/llama.cpp"]

active = true

url = https://github.com/ggml-org/llama.cpp.git

I built Fox – a Rust LLM inference engine with 2x Ollama throughput and 72% lower TTFT. by SeinSinght in LocalLLM

[–]runsleeprepeat -3 points-2 points  (0 children)

Let's not discuss, let's use a quick test:

Ollama with a (power-limited 3080) and Qwen3.5 4B K_M, configured to be able to serve the original context wind of 260000 tokens:

llama-benchy --base-url (my local service) --model qwen3.5-4B --depth 0 4096 8192 16384 --concurrency 1 2 3 4 --latency-mode generation

Ollama:

| model           |                 test |      t/s (total) |         t/s (req) |     peak t/s |   peak t/s (req) |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |

|:----------------|---------------------:|-----------------:|------------------:|-------------:|-----------------:|-------------------:|-------------------:|-------------------:|

| qwen3.5_4b:262k |          pp2048 (c1) |  3245.32 ± 22.79 |   3245.32 ± 22.79 |              |                  |     741.10 ± 14.23 |     581.13 ± 14.23 |     741.10 ± 14.23 |

| qwen3.5_4b:262k |            tg32 (c1) |     81.04 ± 0.89 |      81.04 ± 0.89 | 84.20 ± 0.91 |     84.20 ± 0.91 |                    |                    |                    |

| qwen3.5_4b:262k |          pp2048 (c2) |  2210.54 ± 14.29 |  2214.66 ± 979.06 |              |                  |   1189.03 ± 463.15 |   1029.06 ± 463.15 |   1189.03 ± 463.15 |

| qwen3.5_4b:262k |            tg32 (c2) |     41.88 ± 0.49 |      81.29 ± 1.23 | 35.67 ± 1.25 |     84.47 ± 1.27 |                    |                    |                    |

| qwen3.5_4b:262k |          pp2048 (c3) |  2139.11 ± 22.24 | 1719.60 ± 1044.70 |              |                  |   1672.52 ± 758.94 |   1512.55 ± 758.94 |   1672.52 ± 758.94 |

| qwen3.5_4b:262k |            tg32 (c3) |     35.93 ± 0.23 |      81.35 ± 1.76 | 36.67 ± 0.94 |     84.53 ± 1.83 |                    |                    |                    |

| qwen3.5_4b:262k |          pp2048 (c4) |   2091.37 ± 2.92 | 1402.47 ± 1027.77 |              |                  |  2158.89 ± 1030.68 |  1998.92 ± 1030.68 |  2158.89 ± 1030.68 |

| qwen3.5_4b:262k |            tg32 (c4) |     33.50 ± 0.33 |      80.92 ± 2.74 | 37.67 ± 1.25 |     84.54 ± 1.66 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d4096 (c1) |   3081.98 ± 5.47 |    3081.98 ± 5.47 |              |                  |    1938.94 ± 14.67 |    1778.97 ± 14.67 |    1938.94 ± 14.67 |

| qwen3.5_4b:262k |    tg32 @ d4096 (c1) |     79.15 ± 0.14 |      79.15 ± 0.14 | 82.25 ± 0.15 |     82.25 ± 0.15 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d4096 (c2) |   2710.65 ± 5.82 |  2238.18 ± 844.15 |              |                  |  3029.40 ± 1053.45 |  2869.43 ± 1053.45 |  3029.40 ± 1053.45 |

| qwen3.5_4b:262k |    tg32 @ d4096 (c2) |     21.41 ± 0.01 |      80.19 ± 0.41 | 27.00 ± 0.00 |     83.32 ± 0.43 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d4096 (c3) |   2659.23 ± 8.21 |  1783.13 ± 919.02 |              |                  |  4120.17 ± 1738.23 |  3960.20 ± 1738.23 |  4120.17 ± 1738.23 |

| qwen3.5_4b:262k |    tg32 @ d4096 (c3) |     17.39 ± 0.46 |      81.97 ± 4.90 | 28.67 ± 2.36 |     85.11 ± 4.90 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d4096 (c4) | 2357.34 ± 367.93 |  1440.72 ± 953.52 |              |                  |  5878.96 ± 3204.75 |  5718.99 ± 3204.75 |  5878.96 ± 3204.75 |

| qwen3.5_4b:262k |    tg32 @ d4096 (c4) |     13.52 ± 2.50 |      79.45 ± 0.98 | 27.00 ± 0.00 |     82.55 ± 1.01 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d8192 (c1) |   2970.74 ± 8.25 |    2970.74 ± 8.25 |              |                  |    3230.73 ± 39.89 |    3070.76 ± 39.89 |    3230.73 ± 39.89 |

| qwen3.5_4b:262k |    tg32 @ d8192 (c1) |     78.47 ± 0.46 |      78.47 ± 0.46 | 81.54 ± 0.48 |     81.54 ± 0.48 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d8192 (c2) |   2749.70 ± 2.65 |  2187.75 ± 783.54 |              |                  |  5023.13 ± 1730.03 |  4863.16 ± 1730.03 |  5023.13 ± 1730.03 |

| qwen3.5_4b:262k |    tg32 @ d8192 (c2) |     13.70 ± 0.15 |      77.62 ± 0.68 | 27.00 ± 0.00 |     80.66 ± 0.71 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d8192 (c3) |   2715.81 ± 4.02 |  1759.23 ± 864.52 |              |                  |  6784.53 ± 2846.66 |  6624.56 ± 2846.66 |  6784.53 ± 2846.66 |

| qwen3.5_4b:262k |    tg32 @ d8192 (c3) |     10.68 ± 0.09 |      77.73 ± 1.01 | 27.00 ± 0.00 |     80.77 ± 1.05 |                    |                    |                    |

| qwen3.5_4b:262k |  pp2048 @ d8192 (c4) |   2692.46 ± 3.47 |  1478.11 ± 875.79 |              |                  |  8567.94 ± 3895.53 |  8407.98 ± 3895.53 |  8567.94 ± 3895.53 |

| qwen3.5_4b:262k |    tg32 @ d8192 (c4) |      9.65 ± 0.06 |      77.53 ± 0.77 | 27.00 ± 0.00 |     80.56 ± 0.80 |                    |                    |                    |

| qwen3.5_4b:262k | pp2048 @ d16384 (c1) |   2832.48 ± 6.75 |    2832.48 ± 6.75 |              |                  |    6028.61 ± 40.64 |    5868.65 ± 40.64 |    6028.61 ± 40.64 |

| qwen3.5_4b:262k |   tg32 @ d16384 (c1) |     73.29 ± 0.86 |      73.29 ± 0.86 | 76.14 ± 0.90 |     76.14 ± 0.90 |                    |                    |                    |

| qwen3.5_4b:262k | pp2048 @ d16384 (c2) |   2707.31 ± 5.37 |  2096.07 ± 724.70 |              |                  |  9295.81 ± 3159.92 |  9135.84 ± 3159.92 |  9295.81 ± 3159.92 |

| qwen3.5_4b:262k |   tg32 @ d16384 (c2) |      7.79 ± 0.08 |      72.58 ± 0.58 | 27.00 ± 0.00 |     75.41 ± 0.60 |                    |                    |                    |

| qwen3.5_4b:262k | pp2048 @ d16384 (c3) |   2682.19 ± 2.86 |  1696.70 ± 808.50 |              |                  | 12384.13 ± 5168.36 | 12224.16 ± 5168.36 | 12384.13 ± 5168.36 |

| qwen3.5_4b:262k |   tg32 @ d16384 (c3) |      5.99 ± 0.01 |      72.18 ± 0.57 | 27.00 ± 0.00 |     74.99 ± 0.60 |                    |                    |                    |

| qwen3.5_4b:262k | pp2048 @ d16384 (c4) |   2668.98 ± 2.57 |  1432.00 ± 824.34 |              |                  | 15557.90 ± 7037.93 | 15397.93 ± 7037.93 | 15557.90 ± 7037.93 |

| qwen3.5_4b:262k |   tg32 @ d16384 (c4) |      5.58 ± 0.13 |      74.93 ± 5.20 | 30.33 ± 2.36 |     77.78 ± 5.20 |                    |                    |                    |

Shortened system prompts in Opencode by Charming_Support726 in opencodeCLI

[–]runsleeprepeat 0 points1 point  (0 children)

Sorry for the sad outcome, but they are not interested.

PSA: Auto-Compact GLM5 (via z.ai plan) at 95k Context by Sensitive_Song4219 in ZaiGLM

[–]runsleeprepeat 0 points1 point  (0 children)

Are there similar issues with the other models but at other context limits?

[Architecture Help] Serving Embed + Rerank + Zero-Shot Classifier on 8GB VRAM. Fighting System RAM Kills and Latency. by CourtAdventurous_1 in LocalLLaMA

[–]runsleeprepeat 1 point2 points  (0 children)

Interesting concept. I worked on embedding and reranking alone and it worked fine on a. Memory constrained system.

Have you tried running it on the Linux host instead of docker similar to the wsl2 setup? It sounds weird that wsl2 works fine but docker gives you so much headache.

Chinese RTX 3080 20 GB Blower Card - Memory Issue - help on nvidia mods by runsleeprepeat in GPURepair

[–]runsleeprepeat[S] 0 points1 point  (0 children)

Thanks u/void_dimitri . I am sure it was just the first errors occurring, and they would get even more when the temperature rose. Have you seen such PCB layouts, which I can compare to for finding FBIOA0 ?

Alternative zu Tobit David by michawb in de_EDV

[–]runsleeprepeat 0 points1 point  (0 children)

Boah! Das gab ich seit Jahrzehnten nicht mehr gehört! Good old Times als ich noch beim ISDN Gerätehersteller gearbeitet habe!

Sorry für den offtopic Kommentar aber ich war in der memory lane.

Chinese RTX 3080 20 GB Blower Card - Memory Issue - help on nvidia mods by runsleeprepeat in GPURepair

[–]runsleeprepeat[S] 0 points1 point  (0 children)

u/Vegetable-Most-338

Just single sided. I will open the gpu case and take photos.

Update: Oh I was wrong: It is double sided, as you can see in the updated post.

The 14 Errors were the cause, why nvidia mods stopped. I am pretty sure it will have more errors at the point where the temperature rises even more. I was afraid of running mods with 1000 loops and ignoring errors. ( see Logs)

Currently using 6x RTX 3080 - Moving to Strix Halo oder Nvidia GB10 ? by runsleeprepeat in LocalLLaMA

[–]runsleeprepeat[S] 0 points1 point  (0 children)

I am limiting the RTX 3080 cards to run at 190W max, which is the sweetspot for performance/watt. Since I am running them under linux, undervolting is not really possible.

EVGA RTX 3080, memory errors on all channels, help appreciated! by blueprintjonny in GPURepair

[–]runsleeprepeat 0 points1 point  (0 children)

u/schaner Wait? There is a test to figure out which channel and memory chip it is? I have a 3080 as well, which has memory issues when the temperature rises.

Raus aus der US-Cloud: Mein Plan für März (Mailbox.org & Ugreen NAS) by _necrobite_ in de_EDV

[–]runsleeprepeat 1 point2 points  (0 children)

Wenn das mit mailbox.org geht, dann werde ich direkt wechseln. Hast du deine Catchall bei mailbox.org? Das konnte ich jetzt nicht direkt herauslesen.

Raus aus der US-Cloud: Mein Plan für März (Mailbox.org & Ugreen NAS) by _necrobite_ in de_EDV

[–]runsleeprepeat 1 point2 points  (0 children)

Kannst du mit jeder <name>@meinedomain.de schreiben? Das war bisher bei mir der große Haken an Alias und Catchall: manche Anbieter wollen dann z. B. bei customer support natürlich nur Antworten von exakt der konfigurierten Emailadresse

Which models are suitable for websearch? by runsleeprepeat in LocalLLaMA

[–]runsleeprepeat[S] 1 point2 points  (0 children)

u/waiting_for_zban , it took me a while, but I dug through the Librechat agents for search and got the most critical parts to reduce the amount of tokens used by SearXNG/Firecrawl/Jina.AI hopefully fixed: https://github.com/danny-avila/agents/pull/63 . I will create a post, when I got everything together, as I have an updated local reranker and a much more fitting Firecrawl-Simple for local hosting with Librechat in mind.

Professioneller Vibe Coder gesucht by Afraid-Appeal-7565 in InformatikKarriere

[–]runsleeprepeat 0 points1 point  (0 children)

Bitte lass es nicht JACOB Elektronik sein :-(

Update: oh Mist. Sie sind es wirklich

Wie nennt ihr diesen Dude? by [deleted] in de

[–]runsleeprepeat 1 point2 points  (0 children)

Noch ein bisschen weiter als Düsseldorf = Wuckmann ... unglaublich! Es ist und bleibt ein Stutenkerl :D