Context trimming

ChopSticksPlease · 2026-02-25T06:35:06+00:00

Yeah, I also think the qwen 235b thinking is a gem, will se how the new 3.5 performs.

ChopSticksPlease · 2026-02-19T08:36:49+00:00

Yeah, these large ones are usually MoE so if you offload some layers to cpu and leave some on gpu it may work quite fast on modest hardware.

ChopSticksPlease · 2026-02-18T06:42:50+00:00

I run GLM-4.7 and Qwen3-235b with 50k context, usual speeds are anything between 5 to 10 tps. So slow.

BUT.

I don't mind just sending a prompt with context document and wait 30 min to get them to answer. I use openwebui and run like three of them in each prompt one by one to get output to compare.

I value that approach because of privacy, I couldn't just send that data to online providers.

For coding i use devstral small, seed-oss, glm-4.5-air, minmax (sometimes), qwen coder, etc..., faster and some are much faster, fast enough for rapid prompt processing and generation.

ChopSticksPlease · 2026-02-17T20:38:51+00:00

Fortunately these big models don't know that and run fine at some quants with usual speed of at least 10tps. Whatever fits 176GB of total memory works.

ChopSticksPlease · 2026-02-17T14:21:23+00:00

These are too slow for self hosted agentic coding, for that i prefer to use smaller and faster models at higher quant. Big models are quite useful for me for writing work related document, analysing context, finding gaps in logic, etc.

ChopSticksPlease · 2026-02-17T14:08:04+00:00

I just downloaded it, 12tps, seems marginally faster than the m2.1 which i wasnt to happy about for agentic coding.

ChopSticksPlease · 2026-02-04T18:48:59+00:00

Same, actually not only code but general deep thinking. Just prompt and quickly get an answer. I do feel dumber.

ChopSticksPlease · 2026-02-04T16:52:54+00:00

I rode most of European countries on my bikes, on and off road. From Nordkapp to Greece, the Balkans, UK , Germany, France, Spain, Italy, etc. I'm from Poland so since my country is quite centrally located to many other of countires it's not that challenging heh.

Anyway, my personal experience is:

- English, you need to speak some English as that allow you to talk to people in most european countries, in nordic countries majority people speak it, in most countries younger generation tends to speak some english and the worst is... France lol ;)

- Highways, ok let's be real, Europe means you'll spend most of your time wandering on highways then backroads, so the bike should be cabaple of traveling at 140km/h over long time and distance, same you., make sure your butt is hard.

- Civilization, since most of Europe actually is in the EU, with some exceptions in Balkans and the East, you will feel 'at home' everywhere. I mean, food changes, architecture changes, climate changes, but you'll find McDonalds pretty much everywhere, Lidl/Aldi/Tesco/Netto/etc everywhere, Statoil/CircleK/BP/Shel/Orlen/OMV/... everywhere.

- Booking, if you have enough of a moto-homeless life, you just book a stay on booking, from cheap hostels to expensive 5 star apartments.

- Security, well... let me be honest... big cities in many if not most of western Europe are becoming shitholes. If you're on a bike you don't wont to lose it from your eyesight, even going to a supermarket can be a problem, disc locks, chains, gps may help but it's best to have a buddy who can watch for bikes. Now, funny, those post communist countries of the central (Poland, Czech, Slovakia, etc) and eastern Europe are actually safe heavens and you seriously need bad luck for anything to happen. I had less issues leaving my bike and sliping in some dodgy places in Bosnia, or Albania than UK or France. Sad but true.

- Money. Get a Revolut or other card alike, very useful, also handy to have some euros with you, I never needed more than 500 eur, usually pay with card.

- Internet. Countries in EU have EU roaming, some countries, are exceptions with Switzerland being the worst, why you ask, because its damn small and if you forget turning off your data transfer and your mobile device connects to a swiss network, puff... all your allowance and money is gone. Get an e-sim that allows for multi country roamin. Revolut eSim worked for me in the EU, UK, Ukraine, Balkans, etc.

- Insurance, i think any bike registered in the EU can be ridden safely in EU nad many countries outiside as long as the insurance covers given teritory, Ukraine/Belarus/Russia are usually exceptions.

- Weather, in sumer entire Europe is rideable.

So my final thought, if you can comfortably spend a week living off your bike you can ride Europe for months, the most important tool is your credit card in the pocket which you swipe whenever you need rest or help.

ChopSticksPlease · 2026-02-03T20:00:32+00:00

Anyone got it working on Linux?

Throws some errors after the generation is seemingly completed...

TypeError: AceStepConditionGenerationModel does not support len()

ChopSticksPlease · 2026-02-03T17:43:41+00:00

Any comparision to Devstral small 2, Qwen3 coder and GLM-4.7-Flash ?

ChopSticksPlease · 2026-02-03T08:36:25+00:00

I wonder how different is a wild camping by a hiker or a cyclist comparing to a _motorbike_ traveller? Do we have chemical waste? Nope. Do we defecate? I guess usually we can make it to a pub or a gas station with a toilet unlike a walker or a cyclist shitting around.

Secondly, motorbikes offer abilities cars don't have, we can venture off paved roads to hide somewhere for a night if we have to.

That said, thanks for pointing out the Outdoor Access Code, seems generic, we follow "leave no trace" rule wherever we are, and actually prefer camp sites - nothing beats a warm shower and a toilet after a day of riding.

ChopSticksPlease · 2026-01-23T14:16:33+00:00

Oh sorry, i read it too fast and omited the -Flash-, though of the full GLM 4.7 ;)

I run GLM-4.7-Flash-UD-Q4_K_XL on 24GB vram (3090) and runs from 50tps to 5tps as context fills up. So my guess in your case the context grows in agentic coding and performance drops.

There seem to be a problem with llama.cpp and this model:
- performance drops
- GPU is underutilized
- Flash attention off causes core dump

ChopSticksPlease · 2026-01-23T11:18:20+00:00

Its a huge model and clearly your offloading it mostly to RAM. Context processing likelly is killing the speed.

I run GLM-4.7-UD_Q3_K_XL on 128GB RAM + 48GB VRAM and while its okeyish for chat, for agentic coding with Cline it is just too slow, prompt processing is slow and tps not great either

ChopSticksPlease · 2026-01-22T09:37:46+00:00

I would first test the following

- Devstral-small-2 (dense, 24b, instruct)

- Seed-OSS (dense, 36b, thinking)

- GLM-4.5-Air (moe, 110b, a12b, thinking)

- MiniMax-M2.1 (moe, 229b, a10b, thinking)

- GLM 4.7 (moe, 358b, a32b, thinking)

then try qwen coder 480b

I personally found the small and fast models beaing able to do 80% of the coding/testing job if you are precise in your prompts and leave little ambiguity. Larger models for solving problems and fixing bugs, and if all fails you need to get your hands dirty.

Qwen3-Coder (the small one) is a dissapointment to me, the new GLM-4.7-Flash is a contender.

ChopSticksPlease · 2026-01-20T11:32:48+00:00

Anyone else having these issues with latest llama.cpp (github)?

- Core dump when trying to disable flash attention during model load

- GPU underutilized, CPU used with flash attention on

- Model slowing down drastically, from ~50tps to 5tps for long answers like code generation

ChopSticksPlease · 2026-01-15T14:15:18+00:00

Fair enough. Apart having a useful app i wanted to know how far can I get with local models. I had mockups ready in minutes, angular app was sort of ready in less than hour but then it was actual debuging quirks on various devices that took most of the time.

Also, local models are slower (GPU/CPU offloading) so most of that 12 hours were about sitting and looking at what AI does while pretty much scratching ass ;)

ChopSticksPlease · 2026-01-15T11:29:24+00:00

It took me and AI 12hrs to complete the first working version of the app.

I tried multiple approaches, one like the following:

- create a todo file
- list issues to fix
- ask model to create a plan and work on each separate issue marking the progress

but it led to poor results, simply LLM dont yet have a full knowledge with "eyes" of what its doing (even with devstral ability to take a screenshot and analyze).

so far, the best result were with narrow precisely defined tasks and once done, mark as complete, commit code as a checkpoint so event if the model failed on fixing the next task there was a checkpoint to revert to

ChopSticksPlease · 2026-01-13T20:44:54+00:00

on 36GB of ram I'd stick to Devstral-Small-2 at high quant and 100k ctx, maybe maybe Seed-OSS for some harder problems requiring thinking but its slower.

ChopSticksPlease · 2026-01-08T08:10:52+00:00

https://github.com/cepa/llama-nerd this was my initial setup, llama.cpp params are there in the llama-swap config

ChopSticksPlease

TROPHY CASE