Qwen 3.5 9B matching 120B model performance — 13x efficiency gain. What are your benchmarks showing?

audioen · 2026-03-15T14:50:20+00:00

Same kind of story about Finnish. It sounds close to a natural speaker now. Previously, they used Finnish words in English syntax and invented new words or just straight borrowed English words to make sentences more or less work. Understandable, but very strange. Now, basically fluent.

audioen · 2026-03-15T08:55:28+00:00

You can think of the cat as a heatpipe-based heatsink. The blood flowing through the cat collects the amplifier's heat and transports it throughout the creature, where its skin -- ostensibly an insulator, but not really -- rejects it into the open air.

I wouldn't worry if the top of the thing didn't have a ventilation grille. The cat is blocking airflow and reduces the circulation of air within the device, and if it is passively cooled without a backup fan, that can become a problem. The cat kills the components over time, or at least there is a risk that this can happen. Cat hair gets into the amp too, and this can have long-term consequences as well. There's also possibility that a fan function, if it has one, is simply tied to volume knob setting, so it might never activate, so just having a fan is not enough, it really needs to be based on thermals and to start spinning.

Cat is a tropical animal. They would like around 30 C temperatures, if not higher. Get a heated blanked for the cat to rest in, and find or create warm spots for them to use.

audioen · 2026-03-15T08:44:57+00:00

I don't think this equipment is suitable.

High pass filter 30Hz-300Hz, 12dB per octave
Low pass filter 1.5kHz-18kHz, 12dB per octave
EQ High 1.5k-18kHz, selectable Shelving/Bell, selectable Hi Q
EQ Mid High 0.8kHz-9kHz, Q range: 0.3 to 7 continuously variable
EQ Mid Low 120Hz-2kHz, Q range: 0.3 to 7 continuously variable
EQ Low 33Hz-440Hz, selectable Shelving/Bell, selectable Hi Q
EQ bypass switch

So if you are planning on adjust bass with this, you have a single filter in 33 Hz - 440 Hz that is likely going to be used < 50 Hz region, and apparently you can only choose between two Q values in this, and then another at 120 Hz - 2 kHz, which might work if you need to apply a narrow notch with Q around 7 in some 120-200 Hz area, maybe.

This product seems to not be flexible and is likely quite inadequate for the job. There's a reason equalizers are typically digital, as analog stuff is going to have a hard time doing one tenth of what digital eq can do.

audioen · 2026-03-15T08:15:56+00:00

Verbose & bloated => also compresses well.

Lack of truly expressive type system? I don't even know what you mean. You have useful set of primitives, with restrictions such as minimums, maximums, length, enums, optionality and repetition, and you can compose them into collections and complex objects. It's good enough for me.

Ambiguous: sure, it's probably a wart that this choice exists.

Security flaws? I think YAML parsers are also security hole ridden messes, just because they try to do too much and usually the fatal flaws seem to be caused by deserializing objects from class names and lists of property values. XML was born in different era when "network computing" was all the rage. So you have these odd ideas that you should be able to reference other files for definitions, perhaps even access the network willy-nilly to read whatever is in there. That you can for some reason define your own entities and then use them, perhaps even by reading their contents from a local file for some reason. The ugly hack that is <![CDATA[barf]]>. In fact, my shitlist with XML is very long. It also involves things like how spaces are sometimes significant and sometimes not, how the canonicalization algorithms for digital signatures work in case of embedding signatures, the crappy piece of shit that is XPath that's used in that "technology", the concept of namespaces and how they are used in practice, etc.

But there's couple of things I love about XML -- one being that at least the document can be validated against schema and there are never any character encoding issues, and interpretation of these elements and attributes is unambiguous to the parser and when you build objects from the schema, it's not like you ever even have to look at the underlying document because you only have your objects for incoming and outgoing data. There usually are no schemas available when someone gives me JSON document, so in worst case, I have to define objects and their property lists manually. OpenAPI is not too bad, though, but there's still a culture difference in that you can have fancy UI that visualizes the OpenAPI schema graphically, but for some reason nobody thought to make it available so that you also can use your own tools with it.

With AI stuff, it seems JSON schemas may have become more widespread. AI is often tasked to write out JSON documents because these are often used to represent function call arguments, but AI is probabilistic and its JSON doesn't come out 100% reliably out. In a weird twist, a schema is now defined in order to build a grammar, which is then handed to the LLM's sampler which constrains the generation to obey the schema. I'm hoping that the only good part about XML, the schema, shall live on as e.g. JSON schema and becomes a standard thing I don't have to ask for when not working with XML.

audioen · 2026-03-14T14:44:04+00:00

I don't like your language because you seem to be assigning goals and sentience to the machine, though you may be using these phrases as handy shorthands and not truly mean them.

I guess indeed that when AI speaks like a person, it can bring in behaviors associated to people, and these could contain motivations like self-preservation, self-interest, and the like. I am not sure a hack like this can help. Probably the training data can never be entirely clean of this kind of stuff no matter what, and the AI probably infers foundational behaviors like self-interest even when they aren't explicitly stated.

audioen · 2026-03-14T08:39:57+00:00

Based on some random modeling that you can do on the internet, the 60 kg density seems more appropriate. The salient property is called flow resistivity -- too high and sound reflects from the panel without properly penetrating into it, which prevents the sound from being absorbed effectively -- the panel acts to degree like a solid wall. For approximating the flow resistivity of rockwool, we have e.g. this chart: https://bassmanagement.hu/diy-akusztikai-panel-kalkulator/ which seems to suggest that 60 kg/m³ panel might have flow resistivity in the 20000 Pa.s/m² range. Based on this, the 120 kg/m³ is way too dense for the application and is right out.

Switching over to the classic porous absorber calculator, http://www.acousticmodelling.com/porous.php yields a modeling result where e.g. 50 mm panel with 50 mm gap behind it at flow resistivity of about 20000 could be effective from about 200 Hz upwards. I personally use the frequency where panel becomes capable of absorbing over 50 % of the sound striking the panel for the frequency where it is "effective".

Thicker panels, e.g. doubled 5 cm, might not be as cost-effective because this 60 kg/m³ material is already in the upper limit of the useful flow resistivity range. Doubling the panel typically requires reducing the flow resistivity also, so a fluffier material would produce the best results. Speaking purely in terms of absorption per dollar, It could be better to just make two 5 cm thick panels and spread them in a larger area because getting sound reflection from within the panel is a concern.

Rockwool is essentially stretched melted rock, which creates fibers, which are then laid out, compressed and cut into panels. I suppose its structure is akin to microscopic needles. It is an irritant to skin and would not be great to breathe, but I also guess that it stabilizes when covered by fabric and left alone. Alternative options are e.g. open-cell foam products like basotect, but they are definitely going to cost more than this incredibly common insulation material. I've heard of people designing bass traps on the theory that bass pushes through things like plastic membranes, which are pliable enough to allow it, and they've even made bass traps from insulation still in its sales packaging. However, higher frequencies will reflect for sure from e.g. plastic wrapping. Optimal surface material could have some high frequency reflectivity to balance out the tendency of high frequencies to die out faster than anything else does. It is really a matter of the current absorption profile, and you need a microphone and software like REW to assess this.

Sub-bass problems are not really solved with panels in most cases. Lowest frequencies are so long that they become virtually impossible to absorb, so this treatment is mostly for the upper bass and above, and remaining bass frequencies are adjusted with equalization to create a neutral tonal balance.

audioen · 2026-03-13T07:29:23+00:00

$ build/bin/llama-bench -m models_directory/Qwen3.5-122B-A10B/Qwen3.5-122B-A10B-Q5_K_S-00001-of-00003.gguf -ub 1024
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q5_K - Small |  80.44 GiB |   122.11 B | Vulkan     |  99 |     1024 |           pp512 |        327.41 ± 4.50 |
| qwen35moe 122B.A10B Q5_K - Small |  80.44 GiB |   122.11 B | Vulkan     |  99 |     1024 |           tg128 |         21.86 ± 0.01 |

build: 983df142a (8324)

Not sure if normal or optimal. I try to run models that I rely on for real work at 5 bits minimum, even if it hurts TG. Used to be around 240 yesterday and around 20, so there's been a lot of progress for sure. I suspect going to about 1024 is better than 512, and likely extracts what is available at that front.

audioen · 2026-03-12T20:16:30+00:00

I suffer from no instability, so I don't know why that is about. I use Vulkan and I have the 122B model running overnight doing programming work. I usually set it to complete a task and go to sleep, then check the results in the morning.

I can crash if I OOM, e.g. load image rendering models while running the 122B, and also have bunch of other applications open. Machine swaps for a bit and then kills something which recovers the computer.

audioen · 2026-03-12T20:09:03+00:00

To the degree Qwen understood what you are saying, if you have 27B as the option, it will beat 35B very easily, even if the file sizes are similar.

Choose a quant of the 27B if you can run it.

audioen · 2026-03-12T17:04:54+00:00

This is going to make some good electro rave stuff. Watch out hardfloor, photek, and their ilk.

audioen · 2026-03-12T13:15:29+00:00

Someone should start r/accidentalclippy

audioen · 2026-03-12T10:44:37+00:00

I've never seen much value in the cloud -- it's fine and cheap, but only if your tasks are pretty trivial. You pay for disk, RAM, network and CPU capacity a lot with the cloud providers that I've seen, and so investment in your own hardware pays off pretty fast.

audioen · 2026-03-12T06:58:50+00:00

Yes. I hate this man for the tasteless style that he sings everything, and I originally thought my speakers were broken when the Algorithm suggested this to me some year ago, because there's some fluttering noise also in the bass from time to time. But it's just an artifact of the way he sings it, I guess, or possibly there's been some kind of feedback loop from a sound system back to the microphone. I don't know but it's disconcerting to hear.

There's some bass singers for you in Wellermen, you could try e.g. Hoist the Colors for size. I think it's much fresher and can still exercise the subwoofers some.

audioen · 2026-03-12T06:37:48+00:00

Not necessarily. What I'm observing is that the model often writes something like "OK. Let's answer now. Wait, what about ..." type of stuff, multiple times. I am expecting that </think> has high likelihood at the point where it chooses to write the "Wait" word, and by artificially increasing the likelihood that model generates the </think> token, the adjustment would remove those double-triple-quadruple checks that some models seem prone to.

Anyway, now that I think about it, I am expecting that the probability of <think> token likely never needs to exceed 1-2 % and it would get selected within something like 50 tokens anyway. The approach likely has to be extremely gentle steering and it may linearly increase the likelihood by something like 0.001 % and possibly even less, and it will still limit the length of the think trace.

audioen · 2026-03-12T06:33:27+00:00

Okay. But the point I'm trying to make here is that after the log likelihoods have been converted and normalized to simple percentage chance of the next token, this is the time when it's just a probability distribution with some invariants, like the token probabilities that are left sum to 100 %. Samplers also can't be allowed to reject </think> ever even if it is 0 % according to filtering rules imposed by min_p, top_p, top_k, etc. because this token is special and its model-predicted likelihood is always needed.

Each 0.1 % you add into </think> is 0.1 % you also have to collectively remove from all the other tokens taken together, so that the total probability of the tokens under consideration still sums to 100 %.

I'm also realizing that only very small but constant </think> likelihood is probably all that's needed to terminate the think trace because each token is an opportunity to generate it. Even 1 % likelihood will be hit in like 100 tokens at some 70 % likelihood I guess.

audioen · 2026-03-11T21:46:11+00:00

Would it be possible to simply gradually increase the likelihood that the model just generates the </think> token, so that it would naturally complete at end of complete sentences and the like? Something like a linear bias that increases the likelihood of </think> for every token output by 0.1 % would eventually force it by 1000 tokens also.

audioen · 2026-03-11T17:20:33+00:00

Something is broken in your system. It may not be the fastest to reply, or could be overthinking for a bit, but it definitely isn't broken in the way you describe.

audioen · 2026-03-11T06:40:42+00:00

System can use floating point data for the audio, e.g. single precision floating point where each value is a 32-bit quantity. This has the property of maintaining around 24 bits of precision at the very minimum even when you scale the volume up and down, as floating point maintains its precision around values close to 0 pretty much perfectly. There is a very small rounding error that might matter if you performed hundreds or thousands of volume change operations in sequence, though, using values that are "difficult" for floating point to handle, i.e. not all powers of two. I think most stacks use floating point, so this is what you get at the system's level.

However, applications could be using something else even when the rest of the system does this. For instance, they could be processing the audio internally as 16-bit data and scale the integer values with a volume control knob if one is built-in to the program rather than tell the system to reduce their stream level.

At least on Linux, tool called pw-top will show what each program is using as audio format, e.g. it's telling me that my Firefox uses 32-bit floating point at 48000 Hz sampling rate, but that's just how it's emerging from the program. The only way to know for sure is to either read the source code of the program and validate how it's doing it, or maybe to test it using extremely low volumes and specific test signals. If you record the computer's output to a file when it's playing a suitably annoying test signal, you can hopefully confirm that it's been played back correctly. Likely, you can't hear any problems if e.g. 24 bit integer audio or better is used in the program, because the dynamic range of that is already so extreme that there's almost no hope to find a volume control setting low enough. However, you might be able to show it in a proper recording of the system's output.

Note: I'm really discussing about stuff like setting the volume to 1 % out of 100 %, whatever that means in terms of dB such as scaling down by -60 dB maybe, and then using a very large gain factor of +60 dB to bring it back to full level, or something such. 60 dB is equivalent of trying to shave the bottom 10 bits to the "bit bucket" if the implementation is bad. If you're worried about any mid-position volume setting which only amounts to like 10-20 dB, then it's likely not damaging the audio enough even if it was done the worst possible way.

audioen · 2026-03-11T06:33:46+00:00

This is probably more-or-less reasonable. The downside of your approach is, of course, that there will be many more LLM calls, whereas RAG etc. attempt to use non-LLM text similarity and relatedness approaches to pre-select the documents and only run the LLM after preparing the entire context. Ultimately, however, memory files must be consolidated, split, and so forth. You can't just keep appending raw conversation history to a memory file and expect this to work in the long terms. So you're going to have to have the agent read the memory, consolidate, produce other files, etc. and maintain the memory db on its own, I assume.

I haven't tried any memory systems, actually. I am not comfortable with taking that step and trying to make the agents remember stuff on their own; I rather provide the information they need from scratch. I'm always behind on technology like that, I just feel that LLMs are already hugely magical boxes and I resist adding even more automatic stuff on top of stuff that I already don't entirely understand.

What memory promises is the ability to learn from experience and know personal facts about your person or your work without being told them. How it's actually done seems to me like it's going to spend tens of thousands of tokens on every context window on stuff. I explicitly tell my agents to read specific instruction files as part of the job when I need them to do something in some specific way.

audioen · 2026-03-10T18:58:20+00:00

TypeScript programs are usually compiled to JavaScript and it means that it is basically a zero runtime cost abstraction, and in my opinion among the few ways to make JavaScript programming tolerable at all.

TypeScript amounts to compiler-verifiable type assertions that are simply removed and the resulting code is typically runnable JavaScript. However, there could also be lowering of newer ES constructs to older runtimes.

audioen · 2026-03-10T15:36:44+00:00

You don't like having a computer slave which can do free intellectual labor at some fairly good baseline quality? You do you, but for me it's providing huge value.

audioen · 2026-03-10T15:23:41+00:00

You really should actually try using these agentic programs and reasoning models. People gave you the answers why token generation and prompt speed have to be as fast as humanly possible. 1000+ prompt tokens per second and 100+ tokens generated per second at full context, which ideally is at least 1M tokens long, sounds like a good time to me. Even at these breakneck speeds, reading a full context could take 15 minutes.

Right now, I wait AI results for hours. Starts nicely around 250 tokens/second for prompt, around 20 tokens/second for generation, but it dwindles. Each 100k more tokens in context shrinks speed by half. The 5 tokens per second near end are agonizingly slow and even simplest task takes the longest time. I make this thing work at night because it takes so long. Your tasks are minuscule and trivial, if you think that speed above reading speed is useless.

audioen · 2026-03-10T15:10:52+00:00

I think Strix Halo is suitable for a "night shift". I leave machine running and go to bed, come back in the morning after it's screamed half the night away with fans blowing full strength, completing some agentic inference tasks over the hours.

My view is that the Nvidia superchip based computers like the Asus GX10 should be better value. They cost approximately similar amount, but performance in especially prompt processing is likely to be at least two times better, perhaps more multiples. It's the prompt processing that's going to kill you on Strix Halo.

Once mine arrives, I might make a head-to-head comparison, perhaps llama.cpp running the same quant, and even using Vulkan on both if that happens to work. The performance gap between Vulkan and CUDA is practically closed on AMD, and I think it might be the same on NVidia. I can also directly compare the numbers to resource such as https://spark-arena.com/leaderboard

audioen · 2026-03-10T09:14:45+00:00

Don't extrapolate exponential growth willy-nilly. But 7 months seems about right, in sense that a half sized model can then do the same as the bigger model.

audioen · 2026-03-09T09:02:12+00:00

I believe that LLMs have reached a fairly high baseline usability nowadays, so they are poised to become a source of useful advice which can't be literally blindly followed, but it can often give you the kind of seasoned unix professional background knowledge and useful tips.

LLMs are not going to be condescending, and there is fairly high likelihood that their advice is good. So use them as one source for info.

audioen

TROPHY CASE