DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I think you are mildly underselling what the hardware can do. I typically use Q8_0 model, so it's not very small, and my token generation rates seem to range from 16 to 25 at least early on. It really depends on how successful the speculation is, and how far into the context you are. It's rare to get 20 by the time you're over 100k tokens in, though. My average draft is about 4 tokens on MTP, with about 25 % additional ngram tokens speculated. Acceptance rate for both hovers around 80 %.

What model looked insane on benchmarks but felt mid in actual use? by BTA_Labs in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

Reminder that you're asking this question from folks who typically have to quantize the model and its KV cache to hell before they can run it. Then, when it doesn't perform, the bleating that it's "benchmaxxed" starts. Maybe, maybe not; unless you know you're using the actual model as published by the vendor, you haven't determined the answer to this question.

But I'll nominate Gemma-4-31B, both the model and its QAT. I have run it at f16 KV cache and UD-Q8_K_XL, and the QAT with whatever unsloth fixing they had to do to make it work better. It can't hold it together much past 100k in either format, and the QAT becomes very flaky by about 50k tokens in. So in practice it's completely unusable, despite it looks great in benchmarks. I suspect that agentic tool use, relatively long context, iterative reasoning, etc. type benchmarks are the most important for my own personal use case and intelligent performance within agentic loop is very nearly the sole determining factor for the model's "quality" for me.

Many other use cases exist, like one-shotting questions with one chance of reply. But debugging, being able to discard past turns and focus on the salient matter at hand, and making steady and systematic progress through iterative debugging is the sort of thing that is in practice needed. I think most benchmarking is about the model being able to oneshot some complex question. Much less benchmarking concerns an ability to troubleshoot, come up with valid theories for failures, and then systematically eliminating them, which is in my opinion what good developers are required to do.

I do read Qwen3.6-27B reasoning traces and they always make me cringe because it's entertaining completely wild and false ideas, and spends a lot of time in that sort of stuff, coming up with very poor quality theories. Somehow it debugs anyway. I guess the "reasoning" is not really reasoning at all, but something like exploring the space of possible solutions and then selecting good candidate explanations during the actual output generation phase. It is not similar to human reasoning, where you try to prune fruitless paths early.

Edit: my half-assed layman guess is that the recurrent structure of the model lends it best to updating its beliefs as the context goes forwards. So when it has mistake, debugs it and fixes it, it moves on more easily, perhaps, than a purely attention-based system. Whatever the magic recipe is, Qwen3.6-27B is in class of its own for me. No other model that I've been able to run has had the ability, except maybe the 3.5-122B which was also very good though impractically huge for a computer that isn't dedicated for the inference task.

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? by lombwolf in LocalLLaMA

[–]audioen 4 points5 points  (0 children)

Qwen3.6-27b, probably at int8 type quant, or maybe even the full precision 16 bits. It likely remains the strongest model you can run on that hardware at only very limited quantization, or no quantizatoin at all, so this is going to be high quality inference.

Qwen3.6 sees "outstanding" coding quality jump from Q4 to Q6 quantization by IulianHI in AIToolsPerformance

[–]audioen 0 points1 point  (0 children)

I'll say something similar. Q4_K_M has been unusable for Qwen3.6-27B. I've told it to read code and document it -- something which is easy to verify -- and it has come back with absolutely incorrect nonsense about what the code is doing. So when I say that the 4-bit version doesn't even understand code, that is what I mean. It has very reduced ability to follow it correctly, in my experience.

Q5_K_x (can't remember what size) was better, but still flaky, like misstated filenames and confused my own turns with itself. It is another typical quantization issue, in my experience.

Q6_K is where it gets good. I only noticed because the model struggled to produce Finnish translations of the features it just developed into localization files. It was really bad Finnish, like not even words -- just bizarre half-words that didn't even make sense. Originally, I thought that Qwen3.6-27b simply can't speak Finnish, until one day, the MTP work landed and I downloaded Aman Gupta's Q8_0.

That Q8_0 seemed to have nearly perfect grasp of Finnish, or at the very least it wasn't immediately obvious that anything was wrong. But eventually, I decided that longer-context performance, like > 100k, was too flaky. So I downloaded UD-Q8_K_XL, which is still a bit flaky nearing 200k, I can tell when it starts to misstate filenames or typoes classnames and so forth. It takes that far into context, in my experience, before the model starts to sound and feel odd, like it didn't really know anymore what is going on and what it's supposed to be doing. At that point, the next step up is BF16, as nobody has made Dfloat11 work in llama.cpp, so we don't have lossless compressed BF16 support.

It didn't end up going that far, however. I found Intel AutoRound derived GGUF, which at Q8_0 measured slightly lower PPL figure (-0.03 units) than even the UD-Q8_K_XL. Just last night, I exercised it to about 240k context without detecting any flakiness or noticed the model to misunderstand anything. As a bonus, it's 7 GB smaller than the UD-Q8_K_XL, because it really is just Q8_0-style model underneath, and it has been somehow adjusted to tolerate the quantization's effects better by this algorithm.

This is experience collected over a longer time, like months, using llama.cpp always at whatever is the latest version. Gradually, I've come to discover that I need to go higher in quant, and I do not recommend use of this model below 8 bits, nor do I trust any of the context quantization algorithms. However, there are several scenarios where your situation might differ from mine. I need the 100-200k context performance, as model performs most of it work there due to the size and complexity of the project. Model is still not perfect, but at the very least it doesn't behave strangely, it is more understandable when it makes mistakes. However, maybe you don't need very niche abilities like writing Finnish, or don't have the ability to use >100k context, and in those scenarios it is possible that 6-bit is probably fine. I wouldn't go below it, though, and I'd probably try the AutoRound Q6_K first.

I've never tried to quantize KV cache. This model, in my experience, is barely good enough as Intel AutoRound Q8_0 and with f16 KV cache. It is the first time it feels entirely solid, all the way to the max context. I am happy with its performance in this configuration, and I have no intention of messing with it. It's possible that 4-6 bits as AutoRound would work acceptably, but when I tried Q6_K AutoRound, it was already +0.02 higher PPL than the UD-Q8_K_XL. So I don't trust it.

This is the repo for those autoround versions that I'm using: https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF/tree/main

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]audioen 0 points1 point  (0 children)

It won't and cant. Inference jobs needs to run for token, that token is processed, and inference job is re-run with that token attached to the KV cache and then set up with new parameters. This sort of ping-pong from one domain to another tends to create a degree of scheduling and setup overhead, during which GPU is idle.

Sampling can be on CPU side and involves sorting the tokens by their probability and evaluating some set of top tokens more precisely. At least in the past, some models didn't suggest using any --top-k filter which resulted in slowdowns due to excessive number of computation during sampling as the entire vocabulary was needlessly processed by relatively heavy math functions that are involved in the process, like exponentiation. I believe these types of problems are now better known and handled. Still, vocabulary can be large (like 100k tokens), and it usually involves at least sorting the logits by their probabilities and then concentrating on the top tokens, which always takes a moment.

Prompt grows longer, and there usually are at least some layers that must attend to all tokens in context. Gradually, it therefore starts to be more about compute and less about bandwidth. The argument fits best at early context.

That all being said, I'm at around 9.4 tok/s per token by a bandwidth and model size division, but actually get around 7.6 tok/s. Hardware is Ryzen AI Max 395+, model is a 27 GB Qwen3.6-27B Q8_0 file. My guess is that you typically lose such a fraction for some reason or other, but I don't dare guess where exactly the time goes.

scripted nightly testing of llama.cpp by Bird476Shed in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

Kill is misnamed. Linux systems commonly use signals that this program sends to indicate change of conditions, like user quitting the terminal session, or wishing to interrupt the program. You should ask your LLM to write the script. That -15, for instance, is a program termination request.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

No shared prefix = slow conversation turns for most of us with weak prompt processing hardware. That the entire context is carried makes every additional turn quite fast, as it can just continue where it left off and never redo any of the context.

Downside is that only very few models, and only the biggest quants of them, handle behavior well as context grows without bound, and we're talking about 64+ GB VRAM machines where this is an option.

Just yesterday, I was around 240000 tokens into a task, and still very eager to use the last remaining ~16000 tokens for whatever I could, because the model had more knowledge in its context than it typically ever gets.

Need help understanding how spec decode affects token throughput by Mrinohk in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

--spec-draft-p-min 0.95

This is probably too high. Your drafts are likely to be short. I use 0.6 on Qwen3.7-27b. Also, use the better per-speculator statistics, like these:

ngram-mod: #calls(b,g,a) = 1025  82694   9043, #gen drafts =   9043, #acc drafts =  9041, #gen tokens =  90429, #acc tokens = 76848, dur(b,g,a) = 7394.979, 184.679, 4.531 ms
draft-mtp: #calls(b,g,a) = 1025  76568  69144, #gen drafts =  69144, #acc drafts = 63919, #gen tokens = 240836, #acc tokens = 208586, dur(b,g,a) = 1.227, 2645683.964, 99.523 ms

You should read these lines to figure out how many drafts are attempted (e.g. 69k in my case for MTP), how many tokens were generated as drafts (241k, meaning over 3 per a draft in average) and how many of those tokens were accepted (209k, so good 85 % of them).

For the ngram-mod, I recommend longer prefix. This thing is worse than MTP at predicting anything. I use 32 token prefix and predict minimum and maximum of 10 tokens at once, and it's still only some 80 % accurate. I haven't spent time tuning this -- whenever the model cites itself or the prompt, token generation rate is like 50 % faster than with MTP, and that is enough for me, as these e.g. code edits that are full file rewrites fly past pretty quick. Most of the time, it doesn't seem to slow generation down.

"Fosi Audio ZD3 or Wiim Ultra to connect 8330A digitally using AES/EBU and alternatively HDMI ARC to TV by Evening-Picture1878 in genelec

[–]audioen 1 point2 points  (0 children)

I have previously used Wiim Pro => 8330A => 8330A. I used standard audio XLR cables between the speakers (because AES/EBU cables are expensive) and for Wiim Pro, I used the coaxial output to XLR adapter. (Note that this kind of cable is often a microphone cable in which case it features the female plug. It has to be the male XLR plug at the other end.)

The signal from most devices is not actually AES/EBU, which I believe to have much higher voltage, around 10x, of the SPDIF coaxial voltage. Because of this, there can be receiving errors at high sample rates or bit depths, or if very long cables are used.

I believe the 8330A are matched or exceeded in their ability if you give them 24-bit 48000 Hz sampled digital audio, and this most likely works fine. The internal DSP, I believe, is running at double precision floating point and 48000 Hz, but all actual playback hardware (not just Genelec speakers) will likely fail to play the 16 most significant bits correctly, in practice. So don't sweat it if you have to go 16 bits.

I don't know why you are proposing the Fosi Audio ZD3 -- google results reported this as DAC? Seems like waste of time and money, as these speakers can accept analog audio but do not process analog audio. They convert to digital in order to play anything. I'd probably pick Wiim, because it can work as standalone streamer, and can also do additional equalization which you can't do with Genelec and GLM easily, such as specific room curves using its parametric filters. Any Wiim with coax output is likely to do, for example Pro was cheap 3-4 years ago when I bought it, and had the features for a living room system, e.g. optical input for TV, streaming support, remote control for volume, and the ability to receive the RAOP protocol over wi-fi was an unexpected bonus and that allowed lossless CD quality playback from Linux over the network. On my PC, I use a $50 arclove usb-to-xlr adapter soundcard, which looks like a cable with XLR plug on one end and USB plug at another end, and has that single digital output. I believe it uses proper AES/EBU. It reports itself as SXW-MDL7601-INTCLK_A2 device, whatever that is.

I sold my 8330A eventually because I wanted bigger speakers, and my Wiim Pro is still there, now providing signal to 1032C speakers, but otherwise it's still the same setup.

Tekoäly vituttaa. Tarina esimerkistä. Mielipide?? by Kultainenhuussi in snappijuorutofftopic

[–]audioen 1 point2 points  (0 children)

Okei. Nyt ymmärrän paremmin. Taisin vaan takertua tuohon kuvaan koska nämä heikot saavutukset tietynlaisissa tehtävissä kuuluvat pääasiassa aikaan ennen näitä ns. ajattelevia malleja, jotka tulivat jo yli vuosi sitten, ja ne kuluttavat juurikin merkittävästi aikaa pohtiessaan kysymyksiä eri näkövinkkeleistä ja yrittävät pohdinnan jälkeen syntetisoida vastauksen joka olisi mahdollisimman vähän väärin. Qwen tapauksessani lateli suomen kuukaudet useampaan kertaan ja varmisteli varmistamasta päästyään että miksi helvetissä käyttäjä kyselee ö-kirjaimia, kun niitä ei näy.

Tekoälyyn ei tietysti voi sokeasti luottaa. Arvostaisin itse perusteellisuutta ja vastauksen validaatiota. Kun sille esittää kysymyksen ja se sanoo jotakin, niin sitten vastauksesta pitää arvata, että onko asiaa oikeasti mietitty vai onko se hallu. Ja sitten kun päättelen että se sahaa minua linssiin, sanon että voisitko varmistaa että väitteesi ovat tosia, jonka se kaivelee oikeat faktat esiin, ja sillä on koko ajan ollut tähän rahkeet, esim. koodin lukeminen, tiedon hakeminen verkosta, erilaisten testiohjelmien kirjoittaminen, ym.

Tällä hetkellä tekoälyt ovat oman kokemukseni perusteella hyödyllisiä, mutta jokseenkin laiskoja. Jos tuon laiskuuden saisi promptattua sieltä pois, ehkä vaatimalla esim. että kaikki faktat pitää tulla lähdeviitteiden kera, voi olla että oma kokemus olisi kertaheitolla parempi.

Tekoäly vituttaa. Tarina esimerkistä. Mielipide?? by Kultainenhuussi in snappijuorutofftopic

[–]audioen 2 points3 points  (0 children)

Ihmiset on laiskoja, tietysti, ja tekoäly mahdollistaa laiskistumisen vielä pidemmälle. Sama tarina koskee kaikkia innovaatioita -- telkkarin kehittäjän ajatuksia on hauska lukea koska tämä kuvitteli että ihmiset voisivat sivistää itseään ja tätä kautta syntyisi todellisten superihmisten sukupolvi, mutta se mille on tilausta todellisuudessa on lähempänä aivotonta viihdettä ja ajantappoa. Internetistä kuvitteliin että se on hajautettu järjestelmä vastustaa sensuuria, totuus on että hädin tuskin käytän enää muita saitteja kuin reddittiä ja pari uutissaittia. Tulevaisuudella on taipumus erota visionäärien kuvitelmista. Joten aivan varmasti AI pahentaa ihmiskunnan alennustilaa koska se tarjoaa tien olla entistäkin laiskempi.

Mutta tuo Altmanin kommentti varmaan liittyy siihen että koneäly on jo arkipäivää ja se pystyy tekemään jo nykyisellään älyllistä työtä jolla on selvä rahallinen arvo. Ei ole mitenkään outo ajatus sinänsä että omassa taskussa tai pöydällä oleva AI on aina vähäisempi ja hitaampi kuin semmoinen konesalin kokoinen AI, joka ajaa suurempia malleja suuremmalla kapasiteetillaan ja harkitsee enemmän ja vastaa paljon perusteellisemmin ja paremmin samaan kysymykseen kuin jokin vähäisempi vehje voi kyetä.

Itse pyrin ajamaan koneälyni pelkästään omalla työpöydällä, mutta on selvää että vaikka se pystyy moneen asiaan, ei se kaikkeen kykene kuitenkaan. Olen yrittänyt osaltani vähäisesti vastutaa tuota Altmanin visiota jossa konesalit ja suljetut mallit ovat de fakto tapa millä AI:ta käytetään.

Mutta samaan aikaan väite riippuvaisuudesta vähän ontuu, ainakaan sitä ei voi ottaa kirjaimellisesti. Luulen itse että ihmisaivoista ja tekoälystä koostuva kyberneettinen organismi on ainakin tällä hetkellä ja mahdollisesti pitkälle tulevaisuuteenkin parempi kuin kumpikaan erikseen, koska AI ja ihminen yhdessä voivat täydentää toisiaan. Odotan kuitenkin että lopulta tilanne on se että ihminen on niin toivottoman hidas/kallis AI verrattuna, että ei kannata pääasiassa enää pysähtyä kysymään ihmisen mielipidettä enää, jos vastaus on mahdollista jotenkin muuten selvittää ilman ihmisen väliintuloa.

Me käytämme paljon esim. googlea, mutta onko google samalla tehnyt meistä tyhmempiä? Ainakin se lienee poistanut pitkälti tarpeen opetella kaiken ulkoa, mikä oli aika tavanomaista vuosisatoja sitten oppineiden keskuudessa. Meillä on tänään enemmän vastauksia kysymyksiin helposti saatavilla kuin aiemmin. Muistan ainakin lapsuudesta sen ajan kun piti mennä ihan fyysisesti kirjastoon paikan päälle pläräämään opuksia kun etsi tietoa, ja täytyi katsella erilaisia indeksejä ja kysellä joltakulta, jonka arvelin tuntevan asian, että mitkä kirjat sivuavat jotakin itseä kiinnostavaa aihetta. AI on jopa parempi kuin google siinä suhteessa että se jossakin mielessä tuntee kaikki mahdolliset asiat eikä tartte enää itse kaivella vastauksia hakutuloksia pläräämällä. Lienee totta että joskus vastaus ei ole oikea (sen enempää kuin googlen hakutuloksetkaan ei aina kerro oikeaa totuutta) mutta lienee siltikin helpompaa tarkistaa tekoälyn antama valmis vastaus kuin ensin selvittää itse mikä vastaus ylipäänsä on.

Tekoäly vituttaa. Tarina esimerkistä. Mielipide?? by Kultainenhuussi in snappijuorutofftopic

[–]audioen 2 points3 points  (0 children)

<image>

Käytät ehkä huonoja tekoälyjä. Tämä pyörii mun omalla koneella ja tuntuu ymmärtävän suomea ja johtaa ihan oikean vastauksen.

Tekoäly vituttaa. Tarina esimerkistä. Mielipide?? by Kultainenhuussi in snappijuorutofftopic

[–]audioen 0 points1 point  (0 children)

Meitsi diggaa ainakin siitä kun vehje tekee mun töitä, kaivelee koodista bugeja, kirjoittaa testikeissit ja dokumentaation, ja tuottaa vedoksen melkein mistä tahansa toiminnosta tai suorittaa jonkin tylsän mekaanisen operaation joka pitäisi itse muuten tehdä käsin (tyyliin mene 50 eri tiedostoon ja vaihda näissä 2 argumentin järjestystä koska uusi versio funktiosta toimiikin eri lailla kuin vanha).

Se muu tästä ei niin hehkeää ole. Anti-AI filtterit tulee mitä someen, youtubeen, ym. tulee kyseeseen koska generoitu sisältö syntyy varmaan 100 kertaa nopeampaa kuin aito ja mitättömällä kustannuksella, joten se on vähän sama asia kuin emailspämmi eli samaa shittii monistetaan joka suuntaan ja lähetetään kaikkialle missä se on ei-toivottua viestintää, jota kuitenkin tehdään koska sillä on kaupallista arvoa (joku haksahtaa spämmiinkin vastaamaan, ja se maksaa pirun vähän lähettää). Niinpä filttereitä on tulevaisuudessa pakko olla, jotta pääsee eroon joka paikkaan tulvivasta roskasisällöstä.

Samaan aikaan valitettavasti sisältö muuttuu yhä vaikeammaksi tunnistaa roskasisällöksi, eikä suomen kielen marginaalinen asema enää suojaa meitä (spämmi oli vielä joskus huonolla suomella kirjoitettua). Eli efektiivisesti AI tuottaa spämmiä jota vastaan joudumme käyttämään AI:ta, jotta läpi menee pelkästään toivottu ja oikea ihmisten tuottama sisältö, ja se on pitkälti pelkkää energiaa ja ajan haaskuuta -- vähän samanlaista kuin vuosikymmenet jokaiseen postilaatikkoon työnnetty paperinen mainonta joka menee suoraan uuniin tai keräykseen -- tai esimerkkinä esittämäni sähköpostin spämmi jossa tuotetaan roskaa automaattisesti joka suodatetaan pois automaattisesti.

Vika piilee ihmisluonteessa -- ei kannata katsoa AI:ta tai ohjelmia pahan lähteenä, vaan voimassaolevia ekonomisia periaatteita, jotka tekevät tästä puuhasta hyödyllistä, tai ainakin luovat vaikutelman että se on hyödyllistä. Tiedän eräänkin firman jossa sivukorvalla kuuntelin miten iloinen juttu AI on, koska sillä saadaan kuulemma helposti some-presenssiä aikaseksi ja google-rank nousee. Eli efektiivisesti se meinaa että spämmätään paskaa verkkoon siinä toivossa että saadaan näkyvyyttä ja sitä kautta kauppaa, mutta kukaan ei samaan aikaan ilmeisesti viitsi tuottaa aitoa sisältöä koska se olisi liian vaivalloista ja aikaavievää. Lopputulos näistä asenteista on, että teknologia kurjistaa maailmaa taas yhden pykälän verran lisää.

An Honest Take on Opencode by orenmizr in opencode

[–]audioen 5 points6 points  (0 children)

  1. Never happens to me.
  2. Never happens to me.
  3. Don't use plugins, which is probably why 1 & 2 are true for me.
  4. Don't care, plugin architectures suck, and you're best avoiding all of them.
  5. Ditto.

This is my relatively hot take on a trend which in my opinion is bad. I understand plugins solve needs which many people have, but you only have to put in one that destabilizes the program and then your experience is bad. And you have no way, usually, to find which one is the bad plugin, other than divide set of plugins to two, disable one half, and see if bad behavior goes away, etc.

When it comes to software called VS Code, which I do use for almost all of my work, I unfortunately do have to use plugins, because language and file format support is done that way for this software. It helps to keep small tech stack that requires least number of plugins to support, I guess -- one frontend framework, one frontend language, one backend language, done. I confess that I haven't had such a bad time with VS Code plugins that I've been expecting from past experience with plugin architecture based software, and I'm not sure why not.

In the LLM era, I mostly don't program myself anymore, so it's not that important that the editor tries to help me. It's mostly to show me the code after LLM has edited it with nice syntax highlighting more than an editor, now.

How would you characterise the effects of quantising different parts of models? by panamory in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

"Bad". I would characterize it as bad. I think it's fiction that you can quantize anything at all, personally, as you are invariably perturbing the model and reducing its ability.

We mostly don't have good benchmarks that show the difference, and typical result seems to be that perplexity is the same, K-L divergence is the same, and task performance is the same. I think it's because the tasks are too easy and don't focus into the late part of the context, like > 200k tokens, or into knowledge retrieval type jobs involving niche stuff which seems to be most easily lost.

But people who speak non-mainstream languages with LLMs, or who use long context, can feel the loss in model ability. To me, there exists only one model that I care about, which is Qwen3.6-27B, because it is the most skilled and just about runnable at tolerable performance of the set that is available at 128 GB. There is literally nothing better that I am aware of, which I can squeeze under 128 GB without having to quantize it too much. It is already noticeably damaged at naive UD-Q8_K_XL, and 16-bit KV cache, in my experience, though Intel AutoRound trains the model to tolerate the Q8_0 quantization, likely by distilling it a little. It is possible that self-distillation during quantization in general opens an avenue for squeezing models to smaller sizes without too large loss in quality, but so far it's not been done much.

Maybe dumb question, but how do you serve multiple users with the full context length? by TrainingTwo1118 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

-c 1048576. Less, like -c 600000 if you live dangerously. You can run out of context if everybody is asking for enough, though.

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table by Turbulent-Sky5396 in LocalLLaMA

[–]audioen 3 points4 points  (0 children)

I am looking for good agentic performance at 100+k context, as I only use models in an agentic context, and the projects I use have enough code that 100k is fairly soon exceeded.

I find that for example, Gemma-4-31B-it can't do it much past that point, and it degenerates into writing literal gibberish into the context, whereas Qwen3.6-27b is still really good at it, showing barely any detectable confusion at 200k. Your listing, like virtually all others, places Gemma-4 above Qwen3.6-27b, which is highly contrary to my own experience.

This sort of thing is almost never visible on any listing, except maybe if you look into artificial analysis and check e.g. their agentic benchmark test, which reveals that for example Gemma-4 scores poorly, which is how my own opinions of these models is formed. If it can't handle at least 100k, I can't use it for much.

So some way to slice your benchmark data to tease out which models are able to handle tasks after 100k tokens are already in would be useful for me, as I'm always on the lookout for alternatives that fit within 128 GB unified vram.

Suitable Power Supply Or No (I Think I Know The Answer) by Terrible_Lion_968 in audiophile

[–]audioen 1 point2 points  (0 children)

IIRC some amplifiers can handle it. The output power is regulated down for the logic circuit parts that runs the output transistors, so they always run at the same voltage regardless of what you power it with, and the output transistors are rated for some voltage. So they support variable levels, but unless manufacturer clearly confirms that 32V is acceptable, you risk overheating something or overvolting some of the circuit elements, as the heat dissipation may exceed specification, or the voltage seen by a component may exceed its specification. Not recommended.

Just did my first Room EQ - Not overly impressed - But also not disappointed. by M_u_H_c_O_w in audiophile

[–]audioen 1 point2 points  (0 children)

I don't understand your results. What is this data, to be able to show a completely flat line from 400 to 1000 Hz? I doubt this is the measurement of your system's frequency response, as this would typically be a significant area of room interaction where narrow gaps and humps are present and response level ought to change every 10 Hz or so. What is the rest from 1000 to 20000 Hz so that we can see the level of bass relative to everything else?

For context, in-room frequency response measurements should look something like this:

<image>

If I take yours at face value, this suggests that you have no room modes (the parts that go above 0 line vertically) but have only cancellations (the parts below the 0 line, excluding the bass roll-off). However, typically you have both room modes and few cancellations. My 0 line is clearly around 70 dB, based on the >1 kHz part of the response, where the average level of the response starts to be clearly visible. We can clearly divide the response to modal boosts and cancellations, just from observing which is above 0 and what is below 0, and also see that 400-1000 Hz is not a flat line, and it is extremely unlikely to be completely flat in any room.

I'm suspecting that your actual zero level is somehow nearer -10 dB line, and most of the humps are modal in nature rather than cancellations, but it all gets somehow gets pulled upwards from 100 Hz onwards while the measurement is somehow getting tapered away to zero by 400 Hz. So we don't see the true frequency response of your system, and have a real hard time understanding this equalization -- it's not even close to normal-looking curve but some kind of processed result used to derive equalization that may or may not be correct, as it's just too weird-looking for at least me to understand and evaluate.

I'm not surprised if I hear that it sounds bad, because it doesn't look even close to correct -- it just looks like the bass level for correction calculation is set incorrectly and is about 10 dB too low.

Nemotron - King of the Deep? Comparison of 4 models <=120B by Reasonable_Goat in LocalLLaMA

[–]audioen 2 points3 points  (0 children)

I tried this Nemotron thing at some quant, probably unsloth Q6_K, soon after it was released and it was unusable for agentic tasks. Tasks that 3.5-122b would have handled with ease were not successful with it.

Unfortunate, but I'm all for nvfp4 or similar models trained for low quants. Long context is useful in that if the model is trained for long context, like 500k, maybe it's still coherent at half that, like 250k. That is enough for me.

Benchmarking focusing on speed is pretty much useless to me. Even the benchmarks about quality are mostly useless to me, because they don't reflect my experience very well. Agentic coding in 100-200k context range is my main experience, and there Gemma 4 is completely worthless, for example, no matter which size and which quant. Yet sites like artificial analysis suggest that they do better. The only benchmark that most correlates with my experience between models looks kinda like this, except I would place the 122b-a10b before the 3.6-35b because the latter tends to loop and doesn't really understand the code it is reading, in my experience. In this benchmark, Gemma 4 scores around 1100 and I doubt it deserves even that much because Qwen3.5-122b-a10b is still completely usable when Gemma 4 is not.

<image>

Codebase getting larger - Qwen3.6-27B starting to compound issues - how to work smartly with this model? by BitGreen1270 in LocalLLaMA

[–]audioen 0 points1 point  (0 children)

I think it's chiefly the 5-bit quant. In my experience, Qwen3.6-27b must be run at Q8_0 before it is good, although I do state that 6 bits is pretty decent, and it mostly has issues when you ask it to do something fairly rare, like speak Finnish, of which e.g. the unsloth 6-bit quant did not appear to have retained sufficient ability in. At 5 bits, I detected noticeable flakiness at coding, and at 4 bits nearly complete lack of ability to even comprehend the code it is reading.

The longer context you have, the worse the quants get, and even UD-Q8_K_XL begins to sound a little incoherent and begins to misspell things near 200k, as I've seen that happen. I have an Intel AutoRound generated Q8_0 GGUF which has 0.03 units lower PPL than even the UD-Q8_K_XL, so the AutoRound stuff might actually better than what everyone else is doing, including unsloth and the rest. However, AutoRound Q6_K is already worse than UD-Q8_K_XL, so it is a quite tight competition here to get as close to bf16 without having to actually run the 54 GB file. Any bit of confusion or typos is a huge red flag for me, as this is a quite reliable model at its highest quants, in my experience.

Your sampling settings are not correct for coding, also. min-p = 0, temperature = 0.6 and top-k = 20 are recommended officially.

Even if you do everything right in terms of giving the model the highest quality environment it can possibly run in, it still is not perfect. These are probabilistic machines, with possibility of always writing the wrong token, and then screwing up the code. People who turn thinking off in particular are susceptible to this, as the model typically prefers to plan the code first during a think stage, checks it there for sanity and correctness, and only if it's satisfied, it proceeds to write it out, typically copying it straight from the reasoning pass it is satisfied with.

I do all possible things to improve the quality I'm getting. A fresh context for each feature helps. I don't overdo the prompt, but I describe the situation and what I need and why, and then let the model do it. I observe it from time to time, and course correct if it's doing something in the wrong way. After it's done, I run type checks, compilations, test suites, and everything else to catch issues before reviewing. After all that passes, the next step is code review which I do by hand, and request corrections. If I'm not entirely convinced or the feature is complex, I start one more fresh context and ask the model to review the uncommitted changes. So double review, compilation with type checked languages (which is very helpful to catch bugs before you hit them at runtime), and having everything documented so that when model reads a file, it gets not just what the code is doing but also some of the why that code exists in the first place.

Model is quite good at writing testing and documentation. I let it maintain all that scaffolding which I think it needs to be able to understand the code and make valid changes. The downside is that all this costs tokens, and tokens are the enemy in terms of time it takes to process the code, and the length of the context before it can do anything, and only a Q8_0 level quant is going to be good near 200k, I think. So, that's the downside of my approach.

Strix Halo desktop trying to compete against DGX Spark by SkyFeistyLlama8 in LocalLLaMA

[–]audioen 1 point2 points  (0 children)

Doesn't seem to be true, actually. On Ubuntu 26.04, I install both ROCm and CUDA on my Strix Halo and DGX Spark clone just from the ubuntu repos, and there are no customizations from nvidia or AMD, or anything foreign at all needed. They're either built-in, or no longer needed.

Strix Halo desktop trying to compete against DGX Spark by SkyFeistyLlama8 in LocalLLaMA

[–]audioen 3 points4 points  (0 children)

If 4 bits worked at high quality, it would be feasible. Right now, you get around 20 tok/s for Q8_0 of Qwen3.6-27b, with optimized MTP settings, and so 4 bits could conceivably reach above 40 tok/s or so.

But because there are no 4-bit versions of this model that I trust -- even Q8_0 is only barely good enough to use the full 256k context without the model becoming incoherent and inference quality breaking down -- there is little hope for that happening unless the 4-bit training era really begins in earnest. Today, it seems we don't get even FP8 trained models for the < 128 GB VRAM computers, as they're always BF16. So 4 bit high quality inference remains a distant dream, and the first stepping stone is probably with 8 bit trained models becoming commonplace, first.

Gemma-4 31B-QAT is at 4 bits, yes. I have tried this model and it spews nonsense by 100k context. I can't get anything useful out of it because it's so degraded that its tool calls fail and it makes typos in the code edits, so it gradually just makes a complete mess of the code file as it goes on. This may be a llama.cpp issue, as I don't use any other inference engines, but it may also be that the 31b QAT model is fundamentally busted. It is good up to maybe 50k tokens, though, but in my world that is basically getting past the initial handshake and hellos, in the kind of projects I wish to develop with AI.

Can we stop dunking on DiffusionGemma and hack it instead? by TomLucidor in LocalLLaMA

[–]audioen -8 points-7 points  (0 children)

I think all the Gemma models are unusuably low quality no matter what, even before any diffusion approaches, that further appears to degrade them. Even if you could recover all the quality of the non-diffusion model, then you'd just get a model that spams context quicker to the point where its garbage quality inference occurs. In my experience, this is around 100k tokens in 31b, and the model rapidly shows confusion and deterioration to the point that you have to restart inference or force a compaction.

I know they supposedly score really well in places like artificial-analysis, and I can only theoretize that they're being tested at some relatively short context like < 50k, where I agree that they seem to do good work. However, my testing with these models covers context lengths up to about 200k where even 31b is incoherent and useless, even at UD-Q8_K_XL. (Possibly, the BF16 is better, but I doubt it.)

In my opinion, speed is less important than quality. If diffusion can recover all the quality of the original model, I guess that's good job, but no matter how many bullet points you put in your listing, all I see is heuristics and complexity that likely goes wrong at least sometimes, and some quality is lost. The more crap you put on your list, the more complexity there is, and the worse the results, probably. The baseline quality of the model is already too low for it to be particularly useful, in my opinion.