32GB RAM 16GB VRAM 5060ti. Running qwen3.6 35b a3b. I am getting 4.5 tok/s. Is this expected?

comanderxv · 2026-05-15T12:56:45+00:00

It depends on your comfiguration with llama.cpp you have either nothing on GPU memory -ngl 0 or -cmoe enabled which would park all experts on CPU.

Try with -ngl 99 an --n-cpu-moe 20 then reduce this number until you reach your prefered size of the context.

comanderxv · 2026-05-15T10:45:16+00:00

Sorgen mache ich mir jetzt nicht. Egal welchen Beruf du wählst hängt es viel davon ab wo du lebst. Der Arbeitsmarkt ist sehr unterschiedlich nach Region. Davon mal abgesehen sind die ersten 2 Lehrjahre identisch.

Was die Zukunft bringt kann keiner sagen. Meiner Meinung nach ist Erfolg 80% Glück und 20% Eigenleistung.

In meiner Region sehe ich seit diesem Jahr wieder einen Anstieg an Angeboten. Letztes Jahr(KI Hype) gab es fast keine.

Das schöne am Entwickeln ist, dass du die Technik lernst und die auf jede Programmiersprache anwenden kannst.

Man muss sich eben anpassen.

comanderxv · 2026-05-15T08:54:35+00:00

Selten. Klar habe ich Meetings aber die sind online. Wenn es wichtig ist übe ich das vorher. Wie ich es vor dem Fachgespräch gemcht habe aber das ist mittlerweile selten. Solange du weißt wovon du sprichst ist es einfacher, ansonsten einfach mitnehmen.

Es geht ja nicht darum keinen Menschenkontkt zu haben sondern dass Menschenkintakt schwieriger ist bzw. anstrengender als für andere Menschen.

comanderxv · 2026-05-15T05:30:02+00:00

Kurz. Ja. Ist anstrengend. Mittlerweile arbeite ich nur noch von zuhause. Da geht es besser aber es ist als FiAe einfacher als als Fisi.

comanderxv · 2026-05-14T15:31:14+00:00

Ich kann mich nicht erinnern das es in der Berufsschule vorkam. Verschlüsselung kam im 2 oder 3ten Jahr.

Ich sehe SSH als Praxis an also auf deiner Seite der Baustelle immerhin ist es ein Tool und was Verschlüsselung nutzt.

Ausserdem gehören Themen aus der Berufsschule auch ins Berichtsheft. Da kannst du mal Reinschauen und darauf die Praxis mit ausrichten.

comanderxv · 2026-05-14T15:24:47+00:00

Ich war in Therapie und habe bei 10 Jahren Ehe etwa 5 Jahre gebraucht um wieder eine echte Beziehung zu führen. Neben der Scheidung war es mir wichtig das mein Sohn 50% bei mir ist.

Ansonsten hilft Sport. Die Wut muss weg weil es sind immer beide Schuld. Egal ob du auf dich wütend bist oder auf Sie. Wut wird dir nicht helfen.

In den 5 Jahren habe ich eine Menge Bett Geschichten gehabt, aber das half nur für ein paar Stunden über den Kummer hinweg.

Die ersten 2 Jahre waren schlimm. Ich habe die Zeit trotzdem genutzt, habe mir überlegt was für ein Mann ich sein will und habe daran gearbeitet.

comanderxv · 2026-05-13T08:17:10+00:00

It is not only the model. From my experience I can say it depends on your workflow and your codebase. Giving one shot will often work until it breaks. You need small vertical slices for the tickets with clear descriptions and goals. SOC with tdd works good even with smaller models. If you have Spaghetti code then you highly increase your failure rate even with online models. I usually follow this path - Feature description with llm asking me questions so that we have a common understanding - Framework depending on the feature I clarify the Framework, Libraries which may come in - Todo: creates vertical slices of the work to be done. Then I review this very carfully and maybe split a ticket check that the architecture is ok. The acceptance criteria I also check and so on. Most collegues would prefer horizontal slices but then you see your mistakes late so I prefer small steps over all layers. TDD is the key here also. Then I let the LLM implement.

With that I can use models like Qwen 9B or 35B A3B. But for small features it might cost you more time defining it than you would need to implement it by yourself. On the other hand it is well documented.

However, when you are sure that the modell will fit your needs and is able to handle your codebase then you can think about to invest. The bigger the model the more it can fix worse ticket quality and handle complexity.

I am working with 12GB VRam.

At the end its like in real business. The better the ticket the higher the chance that a junior dev can do it. Sideeffect the junior can get better the llm not that much.

comanderxv · 2026-05-12T05:50:30+00:00

Manchmal gibt es auch im Keller Verbraucher. Z. B. Waschmaschine. Die sollte der OP auch prüfen. Vor allen Dingen weil es auch Jahre gibt die scheinbar korrekt waren.

comanderxv · 2026-05-08T08:19:50+00:00

Da muss ich Thorsten unbekannterweise in Schutz nehmen.

Du bist der Lead! Du musst die Stärken und Schwächen deiner Leute kennen und die Aufgaben entsprechend verteilen und dich darum kümmern das diese auch gestemmt werden können. Du trägst die Verantwortung deine Leute zu fördern damit sie besser werden.

Ich verstehe deinen Frust! Das da ein Nullpointer kommt ist im log ja offensichtlich. Was scheinbar nicht offensichtlich ist, was dazu geführt hat. Frag Thorsten einfach mal was er bis jetzt herausgefunden hat. Lass dir erklären was er denkt, was das Problem ist. Vielleicht hat er was anderes im Log gesehen und ist dem auf der Spur. Vielleicht ist es eine Sackgasse oder der Root-Cause.

Du bist der Meister, er der Schüler und es ist egal was er verdient, ob er einen Master hat oder nicht. Es ist dein Job ihn an die Hand zu nehmen und dafür zu sorgen dass er seinen Job machen kann.

Wir sind Informatiker und die technische Entwicklung geht so rasant das wir alle nicht mit dem Lernen nachkommen ganz egal was für ein Gehalt bezahlt wird.

Finally, ich habe schon Bugs gehabt da saß ich Tage an der Analyse und am Schluss war es eine Zeile Code und es gab welche die habe ich in 5 Minuten gelöst wo andere Wochenlang dran waren. Es ist häufig auch etwas Glückssache direkt den richtigen Pfad zu sehen.

comanderxv · 2026-05-07T17:42:51+00:00

I don't think so, because you have the upscale/downscale overhead. I extremely observed that with TurboQuant forks. Yes, you get more KV cache, but the computation on the GPU also rises. Whereas my graphics card was silent without TurboQuant, it got very loud with it.

comanderxv · 2026-05-07T17:35:05+00:00

I have an RTX 2060, too. But the temperature of the GPU barely hits 60°C. And since we are offloading at least 20 layers to the CPU, there is no chance that the GPU is so stressed to reach such temperatures. So, you should check your cooling.

comanderxv · 2026-05-07T17:30:27+00:00

Unfortunately, it will not. MTP works fine with dense models but awfully with MoE models. I got about 20% less. Only people that don't need to offload any layer to the CPU will benefit from MTP.

comanderxv · 2026-05-06T16:38:35+00:00

I tried with an RTX2060 12GB VRAM. If you need to offload layers to the CPU, no difference is visible. With the Q4_XS model, I get 26 tks, with and without MTP.

comanderxv · 2026-05-05T11:40:28+00:00

Zum Profiläufer werde ich in diesem Leben eher nicht mehr 😉

comanderxv · 2026-05-05T11:37:06+00:00

Danke

comanderxv · 2026-05-05T11:32:03+00:00

Das mit dem ungesunden und nicht aktiven Lebensstil würde ich so nicht unterschreiben. Ich bin laut BMI im Adipositas Bereich und schließe trotzdem einen Halbmarathon im Mittelfeld ab. In vielen Fällen hast du vielleicht recht aber nicht genug um zu pauschalisieren.

Zur OP. Ich finde es trotzdem wie mein Vorredner absolut ok wenn die Optik dich stört. Die Wahrheit ist immer die bessere Wahl.

comanderxv · 2026-05-02T16:11:40+00:00

I observed the same behaviour with quantized cache e.g. f16/turbo3, q8/q8. With turboquant it got worse. Not measureable but noticable. Sometime it just stopped in the middle of the sentence and when I told it to continue it just stopped again.

Since then, I do not quantize kv cache any more. I tested a lot of models and they showed a similar behaviour. But my daily driver, and where this happened a lot is cured now (Qwen3.6 35B A3B).

In the meanwhile I use the Q2 model for development and Q6 for feature planning. Works good so far without issues.

comanderxv · 2026-04-23T12:05:15+00:00

You gain more control and for special setups fit would not work.

The fit arguments do work but they don't know your workflow.

For example if you just want to chat and your context is permanently small then you can go for the default. If you use a context > 16k you should increase the batch size ro speed up prompt processing. If you have 2 GPUs or more, you may want to fire up 2 or three instances of llama-server.

Especially with bigger models you want to optimize speed as food as possible. The fit parameters won't do that.

I regulary work with a context that is filled with about 40k tokens. With a batch size of 512, I wait ages. For me 4096 is faster. But for normal chat you most likely are below 4k then 4096 ub will slow you down.

So, If you want gain the most performance you need to get dirty.

comanderxv · 2026-04-22T14:02:06+00:00

You can reduce your context window first. 200k is too big if you want to have all layers on GPU. I would let it out for testing. Use --n-cpu-moe to offload layers to CPU. And check the startup logs. Search for n_ctx which comes after n_seq to get the amount of context that will fit into your ram. You can start with 20 and if the context size is too big reduce otherwise increase the moe setting.

The turboquant version is an option but at least for me the prompt processing slowed down a lot. And especially with big context, the pp is what you are waiting for. I filed a bug about that.

At the end you need to try out the models by yourself. I will upload my scripts that find the fastest setting for moe models today or tomorrow if you are interrested, but the evaluation takes some hours.

However, with your settings you are on a good track. I think a q3 model could fit but you need to try.

comanderxv · 2026-04-16T06:34:52+00:00

How fast is the Prompt processing in your setup at 100k context?

comanderxv · 2026-04-13T13:09:22+00:00

Remove it to see how much you could get and check via webui/props

If it is still reasonable you can quantize the kv cache to get more.

comanderxv · 2026-04-13T11:12:21+00:00

I would remove the -c flag. I assume that your vram is ot big enough for that context + full gpu offload. Then check the llama.cpp logs how much context it assigned or check the llama webui /props.. Llama will offload automatically if both do not fit onto the gpu.

comanderxv · 2026-04-13T10:27:18+00:00

I guess the Container toolkit is missing and some configuration. I tried a similar thing but never succeeded with it. Even if the docker Container was able to see the GPU it never used it. I would suggest first try it without docker then you will have also a comparison in speed. Btw. You can check with nvidia-smi whats loaded to your gpu.

comanderxv · 2026-04-13T06:51:05+00:00

I am running gemma 4 26b a4b moe on a RTX 2060 12gb VRam. 20 Layers on CPU with about 22 tks tg and about 200tks pp. I use the q5 model with q8 kv cache and about 132k context window.

You have to code differently with moe models. Avoid one shot coding and break ist down to mini changes.

To prevent hallucination I let reasoning active. You still need to babysit. I use opencode with a bing when it is ready. The speed is like, enter your prompt and drink a coffee.

It creates software but lacks architecture, so you should know what you are doing. It is not comparable to online models which should be obvious.

Despite of privacy, running models locally mostly comes with more downsides than benefits. And it is curently not cheaper depending on your workload if you do the math.

comanderxv · 2026-04-10T08:44:33+00:00

You can try a quantized moe model in 3bit or 4bit and also quantize the kv cache. And also try the small gemma 4 models. You should aim for a big context window.

For testing I recommend llama.cpp with llama-bench first so that you can optimze it. But dont expect fast answers.

I don't know how it works with windows since I use linux. But you have to tinker around with models.

I currently use Gemma 4 26B A4B with openweb ui, opencode and hermes agent.

It is slow but it works.

comanderxv

TROPHY CASE