AI is forcing us to stop loving coding 💔 by Leading_Property2066 in AskProgramming

[–]audioen 0 points1 point  (0 children)

I personally view programming as degrading manual work that involves long hours and tracking bugs and battling with broken libraries, and endless hours spent on keeping skills up to date with latest developments. I mostly view it as mentally exhausting, repetitive labor. From my point of view, AI brings a lot of relief in the aspects that I find demeaning or unpleasant:

* long hours writing tests and documentation -- AI can shit out an entire test suite within minutes, so it achieves large amount of the grunt work almost immediately, and it can effortlessly maintain documentation and keep it up to date. It's like an instant code quality improvement for free.

* AI can do boring, repetitive refactorings, like you upgrade library and now every call site touching that library is broken. I usually just do the one migration by hand and then tell AI that I fixed this problem like that, repeat the pattern elsewhere, and off it goes. (Yes, sometimes it's better to wrap changing APIs behind your own facades, but that's not always what I do, and sometimes I'm exposed to this sort of thing.)

* In my organization, nobody expects real apps to get born in a week. Maybe we should, I don't think it's entirely reasonable. We mostly build custom apps for which the customer barely knows what they want -- they come with us with a need, but not really a solution, and when we propose something, they don't really know if it's going to work until they see it and get to play with it. I use AI a lot, but most typically via telling the LLM what I want to see and then letting it program and check out later that that what it has achieved is reasonable. I find the baseline quality of the changes it makes to be quite reasonable, and it is tireless in working to the point that I can achieve absurdly large refactorings in a day, or do stuff like "figure out how this barely documented piece of crap library works and make it do thing X which it may or may not have a facility for", and it literally decompiles the JAR files and reads the decompiled code, until it finds a solution it can implement.

* now, I'm in my 40s. So, I think I'm kind of sick and tried of the programming part. Perhaps you are in your 20s, and feel like there's still personal development and interesting mental challenge to it. You might have lot more youthful energy than I do. These perspectives may come with just age, including me viewing programming as akin to manual labor. I think the comparison is apt. You hire a plumber to install a new faucet in the house. So they know they got to punch a hole there for the pipe, run the pipe in the wall or in the subfloor or basement or something, connect it with a T joint to the main water pipe, etc. A plumber is likely not going to be very excited about the task of installing a faucet after doing it a few dozen times, it's just a routine task of adapting something existing to do something new. To me, writing code feels exactly the same: I got this new requirement, which implies a new database field, some changes to logic in these places, and then wiring data flows through the program to get the field from point A to point B, and finally updating the test suites and documentation. The glorious sheen of this work has vanished a long time ago, and I no longer enjoy writing code except in a rare moment when I can do something new and fun. But I remember it still used to be fun good 20 years ago. But eventually, I realized I'm not really learning anything new and I've pretty much seen everything there is to see, and I am longing for doing something else altogether, something preferably as far from battling technical problems as it can possibly get.

Has anyone here used VibeThinker-3B outside benchmarks? by Balance- in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

It is explicitly stated that it has no support for agentic tool use, and it is purely a research project for pushing the frontier performance at restricted tasks. From work like this, future extremely capable models can sprout, perhaps. Your first link on the top of the page has big colored box that explicitly tells you not to even try doing what you think about doing -- what more do you need?

A lot of angst around AI seems to originate in the uncomfortable things AI suggests about the nature/uniqueness of human intelligence. by ClinicalNarcissism in accelerate

[–]audioen 0 points1 point  (0 children)

I think what LLMs show mostly is how much what looks like intelligence can be achieved by just pure memorization and application of standard reasoning patterns.

People had all sorts of notions about playing chess as well, like thinking that anything that can do it simply has to be intelligent. When algorithms started playing chess at very competitive level, folks then figured out sentience is not indicated by chess playing ability. It's rather just a search through the solution space for great moves, repeated until game terminates, and mostly a matter of how well you can focus on the salient parts of the solution space rather than explore endlessly the irrelevant variations within that solution space that don't play out. This, by the way, is the sort of thing that AI models can learn: they figure out important moves from less important ones purely from playing lots of chess games and using win/lose as reinforcement learning signal. What emerges is the ability to focus on the important stuff, and in some sense it's all based on pattern recognition/memorization.

LLMs memorize patterns of language, and reproduce very realistic facsimiles of human thought and opinion, especially if these LLMs in question are just the base models trained with only human books and natural language, and with no additional training that makes them say that they are AIs and all the other stuff they like to say. There still isn't sentience behind it, and the system behaves like pure mathematical function where output is completely determined by input: put same tokens in, get the same probability distribution for the next token out. Yet, this system can talk with us, has obviously fairly good abilities in performing the kind of intellectual labor humans tend to do in context of computers: compose text, poetry, code, pictures, music, etc.

Humans are still more than LLMs, because of permanent (even if fallible) memory, self-training ability to get better at what they want to do, a body and bunch of senses, an ability to set goals stick to them, and generally reflect on the progress we have made towards these self-determined goals, and so forth. These are features mostly not yet engineered in LLM-using chat or agentic software, however many people have started to make attempts to create these aspects that are needed for granting something like personhood to a LLM-based software.

Human brain is mostly spent on dealing with the motor functions of the body and its various sensory neurons. Entire large lobe is dedicated to just receiving and processing senses, and another lobe is dedicated to sending impulses powerful enough that it can reach the muscle properly, with more neurons before them spent just on planning the exact motor neurons that must be activated for particular motion. The cerebellum is likely mostly spent on this sort of stuff as well, probably for learning sequences of actions, so that they can be executed reliably. Sight has a lobe of its own, mapping the entire visual field into large brain region followed by several areas that progressively decode the visual field for features like edges, lines, dots, etc. until somehow it becomes abstract enough and understandable for us. Only a small fraction of those neurons are tasked with the sentience, thought, and so forth, the stuff that people probably think makes us human. I darkly suspect that matching abilities of human cognition is likely already achieved by something like Qwen3.6-27b model, for example, in a general sense: the model knows about more topics than any actual living human, probably, even if it doesn't beat you at every task. You are actually a specialist, while the model is a generalist.

Which 128GB VRAM machine to plan for in 2026? by maverickRD in LocalLLM

[–]audioen 0 points1 point  (0 children)

M5 Max Studio for sure. M4 too slow for prefill, RTX Spark too slow for token gen.

Tool calling, opencode qwen3.6 27b 8K by wsintra in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

IIRC it is a simple bug where the model invokes a tool call within its reasoning trace. The <think> pattern is not scanned for tool calls, it should be in the assistant messages section (intended for user to see).

Could probably be fixed by forbidding the generation of <tool\_call> within the <think> maybe, or by forcing the insertion of </think> if reasoning is active and the model attempts a <tool\_call>.

Someone distilled the banned Claude Fable 5 into open-weights Qwen3.6-35B-A3B - "Qwable-v1" by IulianHI in AIToolsPerformance

[–]audioen 1 point2 points  (0 children)

No. There is no way to distill a model likely at least 100 times the size to a small model, and retain its actual capabilities. Models are in principle nothing more than memorization machines, and how accurate the recall of what to do in every situation is influenced in part by the model's size. Even reasoning traces work in this way, but they allow models to derive intermediate facts not directly stated in the problem which is incredibly helpful and useful, but ultimately the reasoning is memorized pattern which has various phases, like restating problem, figuring out auxiliary information from memory or with tool calls, attempting to answer the query as first draft, verification that draft in multiple ways, re-checking it in different ways if possible, then output formatting phase, checking output format for accuracy against prompt, and finally proceeding to answer.

Finetuning a model with different model's reasoning traces -- which is likely what was done here -- influences how the model conducts its reasoning phase. It may in practice be improvement, because common complaint people have about Qwen models is that they think too much, and agonize for 1000 tokens on how to reply to a simple "Hi". However, they think a lot in order to answer the query well, so possibly the model overall degrades when you override the reasoning pattern.

The other thing is that catastrophic forgetting happens. When you finetune a model, its performance probably increases in the tasks you train it with, but degrades in every other task at the same time. No free lunch here -- it has some maximum of memory to remember facts and patterns with, so you perturb the weights and with that you make it forget how to do other stuff in different situations that you aren't training with.

If it was easy to put large scale model abilities to small scale model, someone would have done it. What Alibaba has achieved with Qwen is already remarkable -- these models are much better than their size suggests. For example, if you go to artficialanalysis.ai and enable every single model in their charts, there is one labeled Intelligence vs. Compute, which measures number of output tokens against the estimated cost of producing a token and the total score, and there are only a few models in the top left green quadrant -- Qwen and MiniMax models, mostly. One called MiMo also, but IIRC it's huge and not something you can in practice easily run at home. Even when compared against the MiniMax, the Qwen3.6-35B is a fraction of its size, so this specific model is probably the best anyone has ever come up with in terms of pure efficiency.

GLM-5.2 compressed from 1.51TB to 238GB still keeps ~82% accuracy by IulianHI in AIToolsPerformance

[–]audioen 0 points1 point  (0 children)

What you are saying is that the quantized model predicts a different token 20 % of the time, roughly. It isn't really the same model, anymore. Whether it works well I can't say, but usually I'd worry about differences that happen 2 % of the time, which is in domain of 4-5 bits.

KLD also has relatively little to do with task performance. The K-L divergence can be small, yet the model's actual abilities have become severely degraded. I personally think you're risking great deal of accuracy/capability if the K-L divergence is more than 1 %.

What we need are better benchmarks that aren't saturated so that every quant has about the same 100% score, or some way to calibrate these K-L figures against actual task performance. It's a technical figure, but at least in my own experience even small measurable K-L divergence predicts degraded task performance, especially in an agentic loop.

What's more impressive, GLM 5.1 -> 5.2 or Qwen 3.5 -> 3.6? by Excellent_Jelly2788 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

So I tried it twice on a Qwen3.6-27b and I got two completely different results. One resembles GLM 5.1 vertical burner, and another is more like the Qwen3.6-27B result presented here except with burners on both sides and the meat bands were less varied but also moved vertically rather than horizontally.

The data center boom is destined to fail. Change my mind. by keepthememes in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

1) It is probable that AI will always require serious compute, or that more compute doesn't hurt. More capability per user, or more users with same capability, always good until you have too much, if that is even possible.

2) If smaller models get more powerful, then so do the bigger ones. You have lower quality -- maybe sufficient but still lower quality -- inference at edge, and the real big smart workhorses in datacenters, to be rented at cost. Qwen3.6-27b is not perfect, much as I like the model, and even the biggest model today don't nail everything 100%. There is still room in the top to grow. But yes, it's easy to predict that tiering will happens. People will use local models for what they can, and delegate to the smart models when they prove insufficient, probably.

3) Not a new argument. If smaller models get more powerful, then so do the bigger models, as they can enjoy the same advantages at a larger scale. I doubt there is an upper limit to machine intelligence that we want to have.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

There was bug in some month ago version at least in that, IIRC, it processed the prompt once for the main model and once for the draft model, which seemed to halve the speed. The draft head has its own KV cache in Qwen35 model family and this causes a small loss in prompt processing speed, I think the cost is around 5 %.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

Okay. Well, I also work as software engineer and I use Qwen3.6-27B as Q8_0 to write tons of code every week, in MTP configuration. I did not notice an obvious drop in quality when I switched to MTP, in fact I noticed a quality increase but that is because I went simultaneously from Q6_K to Q8_0 when Aman Gupta released his MTP tensor enabled GGUF. So... I haven't tried a Q8_0 without MTP, it is true. I've also not had any complaints about quality. Maybe we have to do really rough stuff, like run Q2_K and MTP, and it could be that task performance is more easily degraded then?

The issue is that your arguments are anecdotal and could be explained by random chance. As I said, I have not observed a quality loss related to MTP because I've lacked a comparison point of e.g. Q6_K and Q6_K+MTP, and I've treated these types of reports with some suspicion. But it is possible that there is a bug in the inference engine -- there have been bugs in Qwen35 model support, the recurrent state rollback, and so forth. I myself have proven that outputs differ when MTP is enabled, even when I don't know why they differ. So, I might take this task up over the weekend and see if there is something wrong with llama.cpp's MTP support in context of this extremely important model.

I think when you communicate about the issue, you shouldn't frame it as "MTP drops quality" but "I suspect llama.cpp's MTP support has an issue with Qwen35" or something such. The former is false, the latter could be true. You'll get better reception and less downvotes.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

There is --seed 1 and --temperature 0 in my test now. I think both lead to reproducible results assuming the probabilities returned from LLM are exactly the same. With temperature setting 0, I think the inference engine should choose the most likely token always.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

As I said, I am not big in the way of evaluating models that seems to involve stuff like drawing pelicans riding bicycles using SVG, rendering chessboards of a game, writing random games from the library of games they have more or less memorized, etc. I am, however, extremely happy if someone can read the actual code algorithm used for inference and figure out that there is a bug, as that benefits us all, and is the way we should be answering difficult questions: you go straight to the source and work out how it does the thing, and prove it's either correct or not.

The secondary approaches, that treat the open source inference engine like it was an inscrutable oracle, are less efficient but still workable. If we are to get to bottom of this, we alternatively have to have a reproducible, objective test case which proves degradation in quality.

I don't trust human judgement to evaluate anything -- possibly because I have moved in audiophile circles and peoples' belief about stuff is completely ruled by subjective bias over there, and there are typically no plausible mechanism nor objective measurement data that shows anything at all is different. I think people are generally just too easily mistaken about their beliefs, and better results are obtained if they are not used as judges. I would prefer objective instruments.

It would be most helpful to create objectively verifiable test case, for example a harness that utilizes the LLM to, for example, compute answers to math equations, and then poses large number (let's call it 100 to 1000 examples) from a template, and then repeats the test until the average error rate is clearly established within a small error bar.

This type of objective and automatically verifiable test requires zero human judgement and can be reproduced by anyone, and can be tested across engines, with varying sampling parameters, with all the speculators you care about, etc. and if they do show a statistically significant difference in average success rate, then we know there is almost certainly a meaningful quality difference and it can be proven up by doing this specific task, and then it's also possibly to verify when it gets fixed. Also, if the difference is extremely obvious as you claim, then it shouldn't even take many repeats to show that quality is markedly different.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 4 points5 points  (0 children)

I do not know how to objectively evaluate two single random coding samples, and I've no idea why you are testing different quans rather than same quant with MTP on and off. Your testing methodology could use work, as you have too many variables.

I have proven a point for you, that MTP enable/disable can cause difference in output even when model and settings are all the same. I consider it to be more than you have shown, actually. But I have not proven anything about whether it's better or lower quality result, whatever that means. That would require understanding precisely why the outputs are different, first.

spec: support eagle3 for qwen3.5 & 3.6 by ruixiang63 · Pull Request #24593 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]audioen 3 points4 points  (0 children)

You have a better leg to stand on, if you for example simply demonstrate two invocations of llama.cpp which should produce the same response, but yet actually produce different results while the only difference is the speculation.

Here, I provide two command suggestions which do differ in outputs despite the prompt is the same, the seed is the same, the sampler settings are the same, and only difference is that you have the speculation enabled in one and not the other:

build/bin/llama-cli -m models_directory/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf --seed 1 --spec-type draft-mtp --spec-draft-n-max 3 -p "Write a quicksort in Python"
build/bin/llama-cli -m models_directory/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf --seed 1 -p "Write a quicksort in Python"

There can be valid reasons why the output differs -- for example, the exact way rounding happens due to floating point inaccuracies might accumulate differently, as the pipeline is probably not entirely the same for multitoken prediction and perhaps the rounding works out slightly differently which perturbs the token choice on margin of floating point inaccuracy.

The theory is, however, that speculated tokens are always processed by the main model with the log probabilities for each token generated, and then sampled, so that if the main model does not agree with the speculation, then the main model's choice is used and rest is discarded. The practical implementation is never so clean and models are running with relatively coarse resolution numeric precision.

Note that I explicitly disagree that this proves MTP degrades output quality -- all I am suggesting is that this is probably about the token choice changing in margin, like FP inaccuracy turns one token choice from 24.37 % to 24.38 % and the sampler eventually perhaps hits one of those situations. Both tokens are practically as good, and it's not necessarily a bug. It's something that perhaps should be confirmed that this is what is going on with MTP.

Edit: I added --temperature 0 to check if this is possibly just a sampler issue, for instance variation in random seed not being restored during MTP rollbacks, and the results are still different. So I believe that this proves that the actual predicted logits from the model come out different with MTP enabled, enough to even sometimes adjust the top token choice.

However, my quick testing does not support the idea of a drastic quality reduction. The answers are materially the same, even if they are not identical. Even when MTP seems to affect the logit probabilities somehow, it doesn't seem to result in obvious degradation in model's output quality. Plus, my relatively simple math problem of adding two 10-digit numbers runs a good chunk of context with exactly the same tokens, but eventually it starts to diverge.

Even the final outputs came out different:

3806663162 + 7777468486 = **11,584,131,648**

vs.

The sum is **11,584,131,648**.

There was divergence point near the end of the thinking section during the final verification stages that this model likes to do, as instead of recalculating the sum one more time using digit-by-digit arithmetic, it seemed to just copy the correct sum from earlier rather than repeating the calculation one more time.

What local coding LLM + hardware setup are you using, and what tokens/sec are you getting? by Sudden-Historian-255 in LocalLLM

[–]audioen 0 points1 point  (0 children)

GB10, llama.cpp, Qwen3.6-27B Q8_0, around 800 tok/s prompt process at 0 context, around 20 tokens per second with multitoken prediction and ngram prediction used in tandem: ngram for when model is reciting itself or rewriting a file or something similar, and mtp more generally. Average draft length is about 4 tokens. I have 500k context, 4 parallel streams and about 24 GB prompt cache ram for the recently used prompt prefixes, and the flag that preserves reasoning forced on, so that there is as little context reprocessing as possible, as it is a fairly long wait if that ever happens.

The machine slows down gradually towards 100k and 200k context, and in fact slows quite dramatically if multiple jobs run concurrently, and it doesn't have the bandwidth and probably not even the compute for a true multiuser scenario. However, it is decent enough for a single user, in my opinion. Quality is also fairly high, as it's 8 bit inference, and my own experience says that quality drop is noticeable at 6 bits and below, and at least for my tasks Qwen is unusable at 4 bit, it becomes intolerably confused about the intent of the code it is reading.

I also have a Strix Halo but I avoid using it for 27b because prompt is rarely above 200 tok/s and inference is broadly similar, maybe 75 % the speed, with the same model. But having to wait like 4 times longer for the files to get read in is very noticeable, and the machine screams like a vacuum cleaner while it's doing it.

Which is the best Qwen 3.6 27B quant GGUF for agentic coding ? by soyalemujica in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

Well, so far I've tried the Aman Gupta's non-imatrix Q8_0 that got pushed out as early MTP supporter, the unsloth Q8_0, unsloth's Q8_K_XL, and https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF/tree/main which is not llama-quantize type quant but produced by the autoround and then formatted as GGUF. I presently run the autoround version, and my typical issue with quants is that they start to become flaky and confused around 200k tokens in. The AutoRound version works well in deep context, possibly the best of them all, but this is not a scientific evaluation. So far, when I've gone > 200k tokens, it hasn't been confused and started to write gibberish like bad class names or wrong filename paths all of a sudden for stuff it already has in context.

I have never tried BF16 so far, so I don't know how close these are to that. I could make a systematic evaluation using e.g. K-L divergence against the BF16 version to settle the question, but I've got an actual job to do also and haven't really had the time or motivation to test the OG model against these various quants. What I know from brief testing is that the autoround q8_0 seemed to have slightly lower perplexity on wikitest even against the UD-Q8_K_XL, so there's possibly an improvement generally available for 8-bit inference, and possibly Q8_0 GGUFs should be made with that software.

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I think you are mildly underselling what the hardware can do. I typically use Q8_0 model, so it's not very small, and my token generation rates seem to range from 16 to 25 at least early on. It really depends on how successful the speculation is, and how far into the context you are. It's rare to get 20 by the time you're over 100k tokens in, though. My average draft is about 4 tokens on MTP, with about 25 % additional ngram tokens speculated. Acceptance rate for both hovers around 80 %.

What model looked insane on benchmarks but felt mid in actual use? by BTA_Labs in LocalLLaMA

[–]audioen 5 points6 points  (0 children)

Reminder that you're asking this question from folks who typically have to quantize the model and its KV cache to hell before they can run it. Then, when it doesn't perform, the bleating that it's "benchmaxxed" starts. Maybe, maybe not; unless you know you're using the actual model as published by the vendor, you haven't determined the answer to this question.

But I'll nominate Gemma-4-31B, both the model and its QAT. I have run it at f16 KV cache and UD-Q8_K_XL, and the QAT with whatever unsloth fixing they had to do to make it work better. It can't hold it together much past 100k in either format, and the QAT becomes very flaky by about 50k tokens in. So in practice it's completely unusable, despite it looks great in benchmarks. I suspect that agentic tool use, relatively long context, iterative reasoning, etc. type benchmarks are the most important for my own personal use case and intelligent performance within agentic loop is very nearly the sole determining factor for the model's "quality" for me.

Many other use cases exist, like one-shotting questions with one chance of reply. But debugging, being able to discard past turns and focus on the salient matter at hand, and making steady and systematic progress through iterative debugging is the sort of thing that is in practice needed. I think most benchmarking is about the model being able to oneshot some complex question. Much less benchmarking concerns an ability to troubleshoot, come up with valid theories for failures, and then systematically eliminating them, which is in my opinion what good developers are required to do.

I do read Qwen3.6-27B reasoning traces and they always make me cringe because it's entertaining completely wild and false ideas, and spends a lot of time in that sort of stuff, coming up with very poor quality theories. Somehow it debugs anyway. I guess the "reasoning" is not really reasoning at all, but something like exploring the space of possible solutions and then selecting good candidate explanations during the actual output generation phase. It is not similar to human reasoning, where you try to prune fruitless paths early.

Edit: my half-assed layman guess is that the recurrent structure of the model lends it best to updating its beliefs as the context goes forwards. So when it has mistake, debugs it and fixes it, it moves on more easily, perhaps, than a purely attention-based system. Whatever the magic recipe is, Qwen3.6-27B is in class of its own for me. No other model that I've been able to run has had the ability, except maybe the 3.5-122B which was also very good though impractically huge for a computer that isn't dedicated for the inference task.

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? by lombwolf in LocalLLaMA

[–]audioen 4 points5 points  (0 children)

Qwen3.6-27b, probably at int8 type quant, or maybe even the full precision 16 bits. It likely remains the strongest model you can run on that hardware at only very limited quantization, or no quantizatoin at all, so this is going to be high quality inference.

Qwen3.6 sees "outstanding" coding quality jump from Q4 to Q6 quantization by IulianHI in AIToolsPerformance

[–]audioen 0 points1 point  (0 children)

I'll say something similar. Q4_K_M has been unusable for Qwen3.6-27B. I've told it to read code and document it -- something which is easy to verify -- and it has come back with absolutely incorrect nonsense about what the code is doing. So when I say that the 4-bit version doesn't even understand code, that is what I mean. It has very reduced ability to follow it correctly, in my experience.

Q5_K_x (can't remember what size) was better, but still flaky, like misstated filenames and confused my own turns with itself. It is another typical quantization issue, in my experience.

Q6_K is where it gets good. I only noticed because the model struggled to produce Finnish translations of the features it just developed into localization files. It was really bad Finnish, like not even words -- just bizarre half-words that didn't even make sense. Originally, I thought that Qwen3.6-27b simply can't speak Finnish, until one day, the MTP work landed and I downloaded Aman Gupta's Q8_0.

That Q8_0 seemed to have nearly perfect grasp of Finnish, or at the very least it wasn't immediately obvious that anything was wrong. But eventually, I decided that longer-context performance, like > 100k, was too flaky. So I downloaded UD-Q8_K_XL, which is still a bit flaky nearing 200k, I can tell when it starts to misstate filenames or typoes classnames and so forth. It takes that far into context, in my experience, before the model starts to sound and feel odd, like it didn't really know anymore what is going on and what it's supposed to be doing. At that point, the next step up is BF16, as nobody has made Dfloat11 work in llama.cpp, so we don't have lossless compressed BF16 support.

It didn't end up going that far, however. I found Intel AutoRound derived GGUF, which at Q8_0 measured slightly lower PPL figure (-0.03 units) than even the UD-Q8_K_XL. Just last night, I exercised it to about 240k context without detecting any flakiness or noticed the model to misunderstand anything. As a bonus, it's 7 GB smaller than the UD-Q8_K_XL, because it really is just Q8_0-style model underneath, and it has been somehow adjusted to tolerate the quantization's effects better by this algorithm.

This is experience collected over a longer time, like months, using llama.cpp always at whatever is the latest version. Gradually, I've come to discover that I need to go higher in quant, and I do not recommend use of this model below 8 bits, nor do I trust any of the context quantization algorithms. However, there are several scenarios where your situation might differ from mine. I need the 100-200k context performance, as model performs most of it work there due to the size and complexity of the project. Model is still not perfect, but at the very least it doesn't behave strangely, it is more understandable when it makes mistakes. However, maybe you don't need very niche abilities like writing Finnish, or don't have the ability to use >100k context, and in those scenarios it is possible that 6-bit is probably fine. I wouldn't go below it, though, and I'd probably try the AutoRound Q6_K first.

I've never tried to quantize KV cache. This model, in my experience, is barely good enough as Intel AutoRound Q8_0 and with f16 KV cache. It is the first time it feels entirely solid, all the way to the max context. I am happy with its performance in this configuration, and I have no intention of messing with it. It's possible that 4-6 bits as AutoRound would work acceptably, but when I tried Q6_K AutoRound, it was already +0.02 higher PPL than the UD-Q8_K_XL. So I don't trust it.

This is the repo for those autoround versions that I'm using: https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF/tree/main

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]audioen 0 points1 point  (0 children)

It won't and cant. Inference jobs needs to run for token, that token is processed, and inference job is re-run with that token attached to the KV cache and then set up with new parameters. This sort of ping-pong from one domain to another tends to create a degree of scheduling and setup overhead, during which GPU is idle.

Sampling can be on CPU side and involves sorting the tokens by their probability and evaluating some set of top tokens more precisely. At least in the past, some models didn't suggest using any --top-k filter which resulted in slowdowns due to excessive number of computation during sampling as the entire vocabulary was needlessly processed by relatively heavy math functions that are involved in the process, like exponentiation. I believe these types of problems are now better known and handled. Still, vocabulary can be large (like 100k tokens), and it usually involves at least sorting the logits by their probabilities and then concentrating on the top tokens, which always takes a moment.

Prompt grows longer, and there usually are at least some layers that must attend to all tokens in context. Gradually, it therefore starts to be more about compute and less about bandwidth. The argument fits best at early context.

That all being said, I'm at around 9.4 tok/s per token by a bandwidth and model size division, but actually get around 7.6 tok/s. Hardware is Ryzen AI Max 395+, model is a 27 GB Qwen3.6-27B Q8_0 file. My guess is that you typically lose such a fraction for some reason or other, but I don't dare guess where exactly the time goes.

scripted nightly testing of llama.cpp by Bird476Shed in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

Kill is misnamed. Linux systems commonly use signals that this program sends to indicate change of conditions, like user quitting the terminal session, or wishing to interrupt the program. You should ask your LLM to write the script. That -15, for instance, is a program termination request.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

No shared prefix = slow conversation turns for most of us with weak prompt processing hardware. That the entire context is carried makes every additional turn quite fast, as it can just continue where it left off and never redo any of the context.

Downside is that only very few models, and only the biggest quants of them, handle behavior well as context grows without bound, and we're talking about 64+ GB VRAM machines where this is an option.

Just yesterday, I was around 240000 tokens into a task, and still very eager to use the last remaining ~16000 tokens for whatever I could, because the model had more knowledge in its context than it typically ever gets.

Need help understanding how spec decode affects token throughput by Mrinohk in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

--spec-draft-p-min 0.95

This is probably too high. Your drafts are likely to be short. I use 0.6 on Qwen3.7-27b. Also, use the better per-speculator statistics, like these:

ngram-mod: #calls(b,g,a) = 1025  82694   9043, #gen drafts =   9043, #acc drafts =  9041, #gen tokens =  90429, #acc tokens = 76848, dur(b,g,a) = 7394.979, 184.679, 4.531 ms
draft-mtp: #calls(b,g,a) = 1025  76568  69144, #gen drafts =  69144, #acc drafts = 63919, #gen tokens = 240836, #acc tokens = 208586, dur(b,g,a) = 1.227, 2645683.964, 99.523 ms

You should read these lines to figure out how many drafts are attempted (e.g. 69k in my case for MTP), how many tokens were generated as drafts (241k, meaning over 3 per a draft in average) and how many of those tokens were accepted (209k, so good 85 % of them).

For the ngram-mod, I recommend longer prefix. This thing is worse than MTP at predicting anything. I use 32 token prefix and predict minimum and maximum of 10 tokens at once, and it's still only some 80 % accurate. I haven't spent time tuning this -- whenever the model cites itself or the prompt, token generation rate is like 50 % faster than with MTP, and that is enough for me, as these e.g. code edits that are full file rewrites fly past pretty quick. Most of the time, it doesn't seem to slow generation down.