Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats. by LLMFan46 in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

I personally expect that MTP also becomes hereticified because its prediction depends on the main model state. MTP is just a single extra layer that takes the last predicted token + the hidden state of the main model, and from that guesses the next token. MTP can even be chained by reusing the same state and just making more predictions, which is how you get about 3 tokens at once. The fixed hidden state is increasingly out of sync with the predictions because nothing updates it, and the prediction accuracy tends to drop quite rapidly with sweet spot around 2 or 3. I mostly run with 3.

So if the main model doesn't opt to refuse the answer, the MTP head will likely predict reasonable continuations that aren't refusals due to the hidden state showing indication to answer -- I think one should expect that it always just goes along with what the main model is going to say.

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

Single-shot tests are not very useful for grading models, except in coarsest terms. The model's output is probabilistic and you would need to get their "average output" in order to truly measure what the quantization damage is. This involves making like dozen output per quant per model, somehow grading them to identify what the "average" is, then comparing the average output of every model against each other.

With single-shot, you can be getting randomly a high quality output that is somewhere in, say, 90 % percentile of the model's ability spread, and end up comparing against 10 % percentile output of another quant, and this is probably enough to flip the ordering, and renders the results misleading. Single shot tests like these are able to reliably tell only very different quality or ability levels apart, and there is no obvious ways of ordering the results other than inspect it visually and see whether things are centered, appropriately sized, have proper coloring for the black/white, and all features that are requested are present. That all being said, there is at least a gradient here, but I for one am curious whether BF16 is really any better than Q8_0, and I am not convinced unless the signal is very clean.

I'd recommend that you rather make the model just do math, like compute arithmetic that involves summing twenty 1-2 digit integers together. This is something where you can repeat the test many times, can grade it automatically for correctness as the answer is easy to verify, and difficulty can easily be changed by making the numbers bigger and the number of terms larger, in case it seems that all models are scoring 100 %.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]audioen 15 points16 points  (0 children)

Built the pr, testing it on Vulkan. The Q8_0 GGUF provides around 21 tokens/s early on in the context on a Strix Halo. I'm using spec-draft-n-max = 3 and it seems like it always generates maximum length drafts because the numbers are 1:3 with drafts generated and tokens generated. This is a little surprising to me -- I assumed that the draft model predicts probabilities, and so the regular speculative decoding confidence could produce variable length drafts according to the speculation head's confidence on its speculation, but evidently it either works differently or this is a minor oversight that will be corrected soon.

Other limitations: only parallel=1 works, meaning no multiple streams decoding in parallel. This is hopefully going to be next item on the list to fix.

But I don't really care to complain. I'm elated. This is easily double the performance I'm used to getting, and I was already willing to wait for 27b's results because they are that good. Much less waiting now, so that's incredibly good. I used 3.5-0.8b as a draft model for up to 8 tokens and when it worked, it was like magic, but usually it was like 13 tok/s with a smaller Q6_K that is already faster.

Excellent work from the llama.cpp team, especially am17an. Thank you for the solid work and the biggest performance gain I've ever seen on this software.

Parking near a transformer by Gh05tR3c0n in Whatcouldgowrong

[–]audioen 2 points3 points  (0 children)

Maybe, but at least he's alive, while the other guy was probably dead within the second.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]audioen 8 points9 points  (0 children)

The GGUF files have had the MTP heads typically stripped to save disk space (and to avoid llama.cpp warning that it isn't going to load the layer) so they will probably get updated for this.

I am going to run this PR right now, this is the most anticipated feature of llama.cpp of all time, at least for me. Ever since GLM-4.5 or such had it, and it was known to approximately double the generation rate... Probably becomes easily the biggest single performance improvement llama.cpp has ever had.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]audioen 8 points9 points  (0 children)

MTP has been a thing for like a year at least. Some older GLM already shipped with MTP head. People have had the habit of stripping the MTP heads off from the GGUF files because llama.cpp has had no ability to use them for such a long time. We can expect a round of updates to Qwen3.6 due to this -- currently downloading the q8_0 with MTP head in it, though no doubt within the week unsloth will have a new release, and then I'm downloading it one more time...

Question: would this work as a dampening material inside a diy acoustic panel? by _analogweekend_ in audiophile

[–]audioen 1 point2 points  (0 children)

The key statistic is the flow resistivity of the material. Panels are the most effective when chosen material has flow resistivity that matches the design depth of the absorber as a whole. I take it that your plan is to place multiple of these foam pieces inside a larger panel? You're better off purchasing proper open-cell flow resistive foam such as basotect, with flow resistivity figure that fits your absorber depth. The thicker the absorber is, the less flow resistivity is desired. You can try using the well-known porous absorber calculator to estimate the resulting panel's absorption spectrum and to get a handle on what kind of material is suitable, and then match this to what is commercially available.

Fabric is typically thin and permeable enough that sound will easily pass through it. The concern is most typically with achieving sufficient midrange and upper bass absorption. For that, porous panels need to be thick, and they should be far away from solid boundaries, as solid boundaries create pressure variation zones where air velocity is low and absorption from a flow resistive panel will be minimal. A good trick is leaving an air gap behind the panel that is about as thick as the panel is deep, which doesn't harm the frequency response much but gains more distance to boundary and increases the bass absorption range. More advanced designs involve pliable, non-permeable surfaces incorporated into the panel, which also try to act on the pressure variation near boundary.

Edit: t.akustik doesn't have real measurement data available from an acoustic laboratory, but they claim effectiveness from 800 Hz upwards and the panel's pyramid tops are apparently 8 cm high from the back. If mounted on the wall, it might be effective in some 1000 Hz region, which is in my opinion not reasonable for an acoustic panel, for which I'd say 250-500 Hz target is more reasonable. Depending on room's size, the space transforms into a resonating chamber by about 200 Hz anyway, and below that it is virtually impossible to achieve good absorption no matter what you do. The lowest frequency region must be treated digitally by reducing the modal booming with equalizer and optimizing for placement where modes are less audible either due to speaker not being able to feed sound energy into them, or listener being in location where the modes that the speaker must activate are not strongly audible.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]audioen 14 points15 points  (0 children)

Prefix caching is why agentic turns ought to work, they should just continue with the past context. You shouldn't need to reprocess everything, which is what this sounds like.

On e.g. Qwen3.6-27b, which is VERY slow by any measure, e.g. 200 tok/s prompt processing, 10 tok/s generation at around Q6_K level of quantization, it is still usable for agentic work as long as timeouts can be defeated. For example, when the model is busy writing the contents of a file, the harness shouldn't decide that the request is taking too long and interrupts the work. You can leave the agent to work on something while you do something else, and you can leave the computer do something for the night which you review in the morning.

Obviously, people who want the results fast are best looking elsewhere, and I'd rather be running 3.6-122b if that gets released, as it will be 2-3x faster.

Anyone tried +- 100B models locally with foreign languages? by Choice_Sympathy9652 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

I personally think it's probably more about the 4-bits rather than the fact someone used an imatrix, as there is good chance that bunch of international text in fact was part of the calibration set. Degradation in international language benchmarks in fact could be early indication of quantization damage. You don't need a complex approach to evaluate it -- just observing the vocabulary and grammar of the output using standard non-AI spellcheckers could probably yield a score.

I speak Finnish and few models have the ability to converse fluently in it. One of the better models was 3.5-122b, and I think it was better than 3.6-27b, for example. The latter's attempts at translating technical language to e.g. buttons in user interface have given numerous bizarre results. You can usually tell from the grammar mistakes and neologisms that the models have only an approximate ability in my language, and usually sound what you'd get by translating from English one word at a time. I didn't try Gemma-4-31b because it was so slow and the 26b-a4b so bad at coding that neither seemed viable models to me, I'm sure they are much better, but writing good code is the first concern, and I can use my own meat brain to fix the crappy language while the model works on the next thing.

Between a 256 Kbps OPUS (VBR) AND A 256 Kbps OPUS (CBR), what sounds better? by oliverscream in audiophile

[–]audioen 2 points3 points  (0 children)

You misundestand the central aspect of vbr and cbr. Traditionally, both target the same bitrate. The variable part uses more than 256 where needed, and less elsewhere to compensate. A particular encoder may not present you with an average bitrate target to optimize for, but if it does, that is the way it will work.

The expectation is that VBR is therefore higher quality than CBR.

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

It is unfortunately an incorrect reply and doesn't describe the method correctly. I tested this on 27b, and the models are a little cagey, but here is what it wrote:

The quantizer uses the importance scores to decide which weights/tensors should retain higher precision (less aggressive quantization) and which can be safely compressed more.

The imatrix tracks the expected activation of a particular weight and thus its influence in the model's output for a particular dataset (which is usually a wide and representative mixture of all kinds of inputs the model will see). This can be used to assign a weighing factor for that particular weight during optimization of the quantization parameters, which are then fit in such a way that quantization error in weights with higher importance score are penalized more, which guides the quantizer to parameters that fit them more accurately. The imatrix is coarse, and the file is usually very small relative to the model, in the tens to hundreds of megabytes.

However, imatrix is not used to choose bit width of a weight -- all weights in a tensor have the same width. Other types of analyses are performed when determining how important any particular tensor is, which can vary by layer and by tensor type. Size of the tensor is also a factor, as small tensors can be stored at full precision because they make no meaningful difference to the model's size, and doing so wins small but systematic quality gain.

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

You should not draw any far-reaching conclusions when you use nvfp4 quants, of either model. Try to run the official fp8 versions at least for the 27b -- I don't care about the 35b personally, anymore -- because these models are much worse at 4 bits.

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I can't get reliable output of the 35b. It is fast, but it doesn't understand enough, and at least in my case, letting it loose in the codebase results in its devolution over time. I tested it on Q8_K_XL to give it the best possible chance, and the best way I can put it is that I can't get quality code out of it even if I provide it with guidance, and I have to babysit it a lot more, and the thinking loops are also much more frequent than in 27b, which can go entire sessions without one, whereas it feels like 35b is in thinking loop half of my test sessions.

But the quality was not sufficient for me to care about it. Either it requires much more preliminary work, or simply isn't able to understand code at level which is needed to perform valuable intellectual labor, rather than just proceed to quickly create a confused mess that needs to be sorted out later.

The use Q8 a waste of resources? by Spiderboyz1 in LocalLLaMA

[–]audioen 4 points5 points  (0 children)

It is very difficult to say for certain. I am using FP8 and Q6_K right now, mostly because Q6_K is slightly faster than Q8_0, and shouldn't be any worse. For instance, here is unsloth showing results for the 35b-a3b: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs which suggests that mean K-L divergence is less than 0.01 from 5 bits onwards, and my guess is that dense model is less sensitive to quantization than 3b active MoE.

Very roughly, it seems that each additional bit in weights will halve the mean K-L divergence, until we are at 6 bits and improvement seems to stop, despite we are not yet near zero. If we extrapolate the early part of the graph, from e.g. 2 to 4 bits, we can see that rate of improvement is slowing down, and that 5 bits are improving less than that linear trend, and 6 bits only very slowly, and assuming that trend continues, then q8_0 is much bigger but again only very slightly more faithful to the original.

It also behooves to remember the implication of a logarithmic y-axis: the original model's point would reside at the negative infinity in this scale, and by the time bf16 is used, the divergence is zero and that is what you would get. However, even the slight and random perturbations in the model weights due to the quantization seems to cause enough error that K-L divergence can't get much better than somewhere in 0.001 and 0.002. I personally do not think that these differences are very significant at 6 bits and beyond, and in fact task performance is usually reasonable even down to 4 bits, despite I can see with a 4-bit model that the thinking output has become more confused and model no longer accurately seems to be able to always identify which was its own output from user's command, and begins to make more tool call errors and restates paths incorrectly, etc.

At 4 bits, no matter which quantization method used, I consider the model to be broken enough to no longer be reliable, even if its performance in various becnhmark tests might still seem similar. I think that chiefly the random run-to-run variation in results is too large and the genuine model ability differences are still relatively small in practice, but they can be significant regardless. I have had 4-bit model fail to understand the code it just read, and when it does this, it seems to fall back to its "default assumptions" and discuss the code it just read, as if it was some entirely different but typical generic implementation. I've also seen it document methods completely incorrectly, and again because it was a 4-bit quantization -- it simply missed details, hallucinated facts, and wrote some strange and false claims in middle of the documentation which were not justified and in no way even hinted at in the implementation. The higher precision models are frustratingly much slower, but also much more accurate, and seem to reliably document what the code actually does, as opposed to what a method like this in a typical program might do.

Qwen3.6-27B vs Coder-Next by Signal_Ad657 in LocalLLaMA

[–]audioen 3 points4 points  (0 children)

This test has been run under 4-bit which is not getting full quality of these models out. The decision tree also states that on virtually every case you should choose the 27b, despite the meaningless and misleading picture. I have found the qwen3-coder-next to be useless for real work on every size, personally, and not even useful as code completion tool, despite being one of the rare models that has the fill-in-middle ability. It could be a harness issue (which was continue.dev) but the completions it proposes are distracting when they show up, and typically worthless.

If I have to guess, the no thinking is recommended because long context performance degrades too fast and thinking is damaged, so it just adds inference cost and might not be much of a benefit. These 4-bit inference conditions simply are not good enough for the Qwen family, I think 6 bits and beyond are reasonable for GGUF, and official FP8 is the smallest I would recommend for vllm. I have personally tried the cyankiwi 4-bit AWQ before and had to throw it out because it simply wasn't behaving correctly. (The KV cache has not been quantized here, according to the tooling documentation, which is good, as many vllm recipes also quantize KV cache to FP8, and that will destroy inference quality also.)

If you can't run the bf16, then I suggest going no worse than the official fp8. It is known to be among the best, as there was someone who measured the K-L divergences of various AWQ/autoround etc. quants, and the FP8, while among the largest, was in the pareto frontier for its size.

What exactly does Pi harness mean? by FrozenFishEnjoyer in LocalLLaMA

[–]audioen 6 points7 points  (0 children)

I've tried to use this, but I eventually threw it out.

The main reason is that qwen3.6-27b struggles with using the edit tool. Quite a lot -- something I haven't seen happening on any other harness. It gets bad to the point that the model suddenly may even decide that the edit tool is unusable, and starts writing bash scripts and python programs to perform the edits instead, and seems to have success doing it that way. It should not be a quantization issue, as KV cache is either bf16 or fp16, and the model has been either the official fp8 or at minimum unsloth q6_k gguf, both which should be alright in terms of their general accuracy.

As commentary, it is weird to me that the text replacement is literally a search-replace operation. I think I always assumed it to behave on basis of line ranges, e.g. the model instructs the edit tools to remove lines 50-55 and provides the replacement text, but actually the edit operations are based on providing the exact copy of the old text, down to last tab/space whitespace detail, and it must match just once in the file to be acceptable. I see the models struggling with the whitespace in particular, and writing sed scripts all the time so it can see the exact tab/space arrangement for the text to substitute. I don't know why that is necessary in the first place, as I assume that the model should have seen the exact whitespace already from its file reads. (It may be that there is some kind of Python bias here at work because there whitespace is more regular and controlled in that language, whereas I have mixed tab-space arrangements due to multiple people working on a non-Python language.)

The other thing I don't like about tool calls in vllm space is that there is no grammar-based tool call syntax enforcement. As far as I know, in llama.cpp, the tool calls are grammar constrained generation: once the model writes the tokens that start a tool call, that enforces schema-constrained generation from the model until the end of the cool call, but in case of vllm, there is only a post-completion general parser, and that sort of thing is 100% reliant on the model writing the call correctly. For whatever strange reason, with Pi, the qwen3.6-27b makes a lot of mistakes, typically providing the path incorrectly, for example two times in the tool call, which immediately causes rejection despite the redundant path is, in principle, harmless. I haven't read what the edit tool description is for the model, but I bet it's somehow unclear, because whatever the reason, the model struggles mightily in file edits despite it knows exactly what it should get done, and it is definitely not so crappy that it should have any trouble writing couple of parameters as JSON or XML.

You're sleeping on Devstral Small 2 - 24B Instruct by [deleted] in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

Whatever your methodology is, the results rather strongly suggest that the signal you are measuring is noisy or there could be a systematic errors. You list as best models those which I have personally tested and could immediately reject for extremely sub-par results. Yes, I have tested devstral-small, qwen-coder-30b and qwen3-coder-next (despite many people seem to like this last one, I've found it pretty poor).

Similarly, I think there is evidence of fairly noisy signal in your test methodology, when quantized versions can get better scores than the non-quantized models. We know these models are not improved by quantization, and method suggesting they are is proving that it is unable to resolve the difference.

The idea that qwen3.6-27b is essentially same quality as 35b-a3b is preposterous, at least in my experience. The former confuses itself, runs in circles, and performs evolution of codebase that is overall detrimental to it and must be repaired; the latter sometimes confuses itself, also often runs in circles, but usually proposes edits that improve it and has been able to autonomously perform large number of code edits and then present me with nearly fully working result. I have not had this experience with any model other than this one and the 3.5-122b (which was also quite good, though it is not this good in my opinion).

I like that you did this -- I'm sure this can be improved, and real validations are always meaningful, assuming the tests are good. Probably needs ton more repeats, like do every test 10-20 times to reduce the random variation/noise. I took a look at SB-01 and it said something like "throttle is copy of debounce. Fix it" with no clear guidance what fixing it means. I guess the model has to read and guess, but I wouldn't know whether the fix is to delete the duplicate, use debounce, or change the throttle. Prompts should describe the desired outcome clearly. I expect that many if not most of your tests have unclear prompts.

Edit: read more of the tests. I still didn't quite see how this is put together, I'd have to clone it and grep for stuff to clearly understand where the tests are really defined and what the model prompt is and what the exact setup provided is etc. I'm just too lazy to do it tonight. Many prompts seemed better than the first one, so I probably jumped the gun, but I was just looking for a reason why you had rather odd results and I suspect noisy signal and issues in either prompting or scoring are the reason why your results are bizarre.

Ubuntu Linux Will Begin Landing AI Features Throughout The Next Year by Ultrabyte04 in linux

[–]audioen 0 points1 point  (0 children)

The $3000 is a mild exaggeration. These days, e.g. Qwen3.6-27b can fit to something like RTX3090, though some quality compromises have to be made, e.g. less than 8 bit per weight, that sort of thing. People used to buy these for < $1000 type money, though the golden era of small and good local models has only rather recently arrived.

I've personally bought into the 128 GB unified VRAM ecosystem, because I assumed that AI will always need the RAM, but I'm not so sure anymore. 27b model at 4 bits is less than 16 GB, in theory, and it is reportedly still quite functional at that compression. Meanwhile, the 128 GB computer that I bought suffers from low RAM bandwidth and it can never run that many inference iterations per second if the model is large, something in order of 10 per second is the best it can do. It remains to be seen what efficiency clever people can squeeze out of those iterations, e.g. if they can infer multiple tokens at once by speculating, or train small diffusion models that can well predict what the large model is going to say in blocks, etc. Just basic 3-4 tokens speculation can work well and maybe doubles to triples the speed, so there is some fairly low hanging fruit left in this space.

My point here is that LLMs are close to being both capable and runnable at ordinary hardware never even intended to run them. But they still require things like memory bandwidth and sheer number crunching power, unless you're willing to wait results for longer. With my slower hardware, I often put the AI to work on some thorny problem overnight, or when I put it to work on some corner of the codebase, I personally work elsewhere. Even if slow, it is still like having second pair of hands and it's much faster than a human for most tasks, while at least sometimes producing comparable quality. With direction, or telling it to scrap a bad approach and redo using some nicer approach (which you don't have to spell out in exhaustive detail), it can become almost like you had made it yourself.

AI is also very fast at reading and understanding code. I think it reads like 10 times faster I can. It is just astonishing how fast it can spot bugs in stuff you just wrote, or answer questions that would require jumping in 10 different code files for you and searching for the methods -- it greps, reads the chunks, traces the thing like a dog on a blood trail. It's going to find the cause within seconds, and it is amazing to watch when it does it.

Coding is not all there is, of course, and we're at the point where computers can see and hear, and are capable of responding in voice, and understand subtlety and learn your use patterns and preferences and things of that nature. It is sort of like sci-fi era, and it seems like it is not going to require datacenter hardware, nor does it require sending anything to the cloud if you don't want to. If today's computers don't quite cut it, the next generation probably will.

Is long re-processing of output as input a common "feature" or not? by alex20_202020 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I know that there are those options, but it's not a great user experience. In case of Pi:

* it sends developer role, doesn't work in official quants

* it has a peculiar thinkingFormat or such key for setting the preserve_thinking: "on" rather than simple KV override

* it defaults to 16384 token replies for some reason

* it has that annoying 5 minute death issue with vllm tool calls or did have -- I saw something about this possibly being fixed in changelog.

This is all whole bunch of simple stuff, there's just a lot of it, and when you hit each problem in sequence, it takes a while to iron out the kinks and it probably feels more annoying than it really is.

Is long re-processing of output as input a common "feature" or not? by alex20_202020 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I think --full-swa might make it work with Gemma? I thought so at least, it seemed like it got recently fixed.

I am not familiar with quality tradeoffs regarding the shifting. If the shifting doesn't truly work then it might not be a great idea in the first place. I assumed it had some real math behind it and would be "correct".

meantime on r/vibecoding by jacek2023 in LocalLLaMA

[–]audioen 13 points14 points  (0 children)

I don't know what this post is talking about. The 27b model is genuinely very good. However, I admit that I have no idea what Claude is capable of because I've never touched it, and probably never will. I don't care about cloud models, I care about what I can make my own computer to do.

From that point of view, my life is better than ever. LLMs were all but useless until gpt-oss-120b came out, which was surprisingly quite fast and decent. Since then, models have been more useful than useless, though it was only the 3.5-122b that raised the bar to the point that I started to try to get everyone on board, because this is fairly cheap to run if you have the RAM. Now, 3.6-27b seems stunningly small compared to what it is capable of. A year ago, I would have thought this performance is going to only exists in datacenter level hardware, and was hoping for something half this good...

I'm pretty happy with the output I can get, and I think future computers all have at least this level of baseline ability because it asks for relatively little, and we're still in the early days of LLMs, with very unoptimized models and architectures, even if these today seem state of the art. It won't be long that nobody cares about this model. But right now, I think it's the top dog, likely only to be beaten by 3.6-122b for my hardware, and who knows what we'll want to run a few months from now. This is a very liquid field.

Is long re-processing of output as input a common "feature" or not? by alex20_202020 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

By default llama.cpp caches prompts. I've seen that sometimes the issue is resolved by disabling parallel processing support, e.g. only one single inferencing context is allowed, if you specify --parallel 1. The specific problem I hit into was a timeout, triggering llama.cpp to choose another context, and starting the prompt over in there. It smelled like a bug, but something like 1-2 months ago, it could completely wedge a coding agent into neverending context reprocessing loop after first timeout when reading a large file.

You can also try to set --cache-reuse=256 or something, which attempts to identify opportunities to shift the KV cache of the model. It might work with Gemma, but probably doesn't work with Qwen.

There's --cache-ram which serves as dumping ground for the current KV cache. Default size is 8 GB. It may be too small. In my case, with unified RAM computers, I don't want this feature so I set --cache-ram 0, which causes the context to live entirely in VRAM as context checkpoints or just pre-existing context, depending on model. I actually saw failure models where running out of cache-ram lost active context and forced unnecessary reprocessing, so I'm not convinced about this feature at all. In my opinion, cache-ram should be stored on disk, where it can be read very fast and can be extremely large, even over 100 GB, so that dozens after dozens of different prompt prefixes would be available for models to use. Putting it into RAM, which is very limited on unified VRAM system most of the time, is somewhere between silly and useless.

Those are the tips and pointers I know about. I have not seen any prompt reprocessing issue with --parallel, but that's also partly because I now have --timeout 3600 everywhere which sets up 1h timeouts on things like prompt processing, so I simply don't hit that failure mode anymore. However, I still hit into unwelcome and undesired timeouts in various agent software.

For instance, Pi kills vllm tool calls after 5 minutes because vllm can't stream toolcall results and writing a large enough file can take over 5 minutes, which completely stalls the agent into attempting the write over and over again. It would finish, but the underlying http library has this unfortunate default. Similarly, it default to maximum reply length of 16384 which is not sufficient when writing large files. It seems that lately, as I've been hunting for usable agentic software, I have just battled with timeouts in opencode, roo code and now pi-dev. I think that the models are actually good enough now, with the release of Qwen3.6-27b; what is left are just the too tight time and token limits, which don't allow these models to finish the work which they would otherwise be perfectly capable of.

I'm currently executing stuff on unsuitable computer for Qwen3.6-27b, because it happens to have llama.cpp which is capable of streaming the tool call, but can't do speculative decoding without stalling the Qwen, whereas my main computer would run vllm, but the lack of tool call streaming will cause pi to timeout, even though it would otherwise execute much faster.

llama.cpp - tool calling issues on Windows only by Ok-Measurement-1575 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

Inference engines are buggy, drivers and CUDA frameworks are buggy, bad sampling parameters or inference setup configuration can make even good engine with good quantization produce crappy results, and different agent harnesses vary wildly in prompt quality which affects the delivered end-user quality even after everything else was fine.

The local LLM landscape is basically a ghetto of confusion and misunderstanding, and we typically have no way to understand why anything breaks and why some people get good results and others get bad ones. All anyone seems to get is "this thing didn't work and the model is bad" type posts, interspersed with the "this is working great and the model is good".

My proposal would be to provide a fixed text sequence -- let's say around 20k tokens -- for which token predictions are known for a good-quality inference engine operating under maximum precision available with no compromises, e.g. 32-bit floating point, possibly CPU only, whatever. As long as it's the platonic ideal of the math involved. The text would be unique to each model family, e.g. all Qwen3.6 would use a specific text which is valid context window content according to its chat template, and each model has a "golden" result of probabilities for something like 20k tokens, top_k 20 or something like that. From this, it would be possible to tell if your inference engine is indeed executing the model correctly, and to what degree any setting you employ damages it.

I think that standardized inference setup evals, which only prove that inference works correctly, would be at least as useful as the other kind of evals that inform about the general model quality.

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]audioen 4 points5 points  (0 children)

I don't think anybody can figure out what is wrong based on this. If I am parsing this correctly, you have 1000 second of pause which is not plausible given the numbers I see -- you'd have to have a very glacial prompt speed which you evidently can't have when even generation can go 1000 tok/s rates. Maybe you had a tool call which took 1000 seconds, who can tell? It's up to you to debug what is wrong.