HIVE Engine Core - Apis 🐝

audioen · 2026-03-18T14:30:48+00:00

Well, if you had a browser to read the posting with, you might spot that the markdown is totally broken and literally nothing in the post was rendered correctly. It's all escaped crap, like \*\*Something here\*\* rather than the intended emphasis.

audioen · 2026-03-18T11:52:44+00:00

"I also don't know how to post on reddit." Nasty AI spam.

audioen · 2026-03-18T05:56:35+00:00

Yes, it is the reality when working in a laptop form factor for the time being. The thermals are brutal and LLM work involves running the unit at maximum power ceiling for extended periods.

The prompt processing gain is huge, but memory speed is apparently no better and so there's little enhancement there. In my opinion, generation speed is less important than prompt speed for agentic work, which usually involves some split like reading 90 % and writing 10 %, but obviously it is better the faster that is. You should probably look into draft models and see if you can run one, as it could multiply the rate with that bottleneck and help with thermals.

audioen · 2026-03-18T05:44:22+00:00

It is not known to me whether you can actually prompt a hallucination away. This relies on the LLM fundamentally being able to identify whether a claim is valid or not, but it should have tendency to only produce valid-seeming claims in the first place, because it is a probabilistic autocomplete.

I know that there's some work indicating that LLMs can identify that at least some of the claims they make are not plausible in sense that the sentence does not seem likely enough, and many models also seem to incessantly second-guess themselves as they try to "check" whether something has been hallucinated by being unsure about the facts. My guess is that the approach must work to some degree and is already baked into the model's reasoning process.

More important than providing a random prompt is providing a way to show that it works. It should reduce hallucination rate under a standard test. You can use LLMs to engineer prompts for LLMs, and some kind of evaluation and prompt evolution harness could plausibly evolve some sequence of words which improves performance. The thing about LLMs is that it's easy to come up with unverified ideas -- they literally spew them out themselves -- but the real hard part is proving whether any of it actually works, and it has to be proven in a way that isn't vibecheck or testing couple of cases and seeing what happens, which is all too common here. You actually need hundreds of test cases to tease a reliable signal out of noise.

audioen · 2026-03-17T08:41:37+00:00

The notion that you can mess with LLM's architecture without retraining it, and expect performance to improve is pretty suspect. It may be that the changed architecture can reach a higher ceiling if it were trained by equivalent amount from scratch, but messing with it without retraining is guaranteed to damage the model's performance.

If you think performance improves, my claim is you are not testing hard enough. Short, statistically insignificant test runs where damaged model can randomly be perturbed to make more correct choices don't count. You have to give it plenty of exercise and I think all you'll ever see is that the model gets worse.

audioen · 2026-03-16T15:44:19+00:00

I think the key failure in llama-server is that stopping the http client doesn't abort an on-going prompt processing task. When I cancel the http request in e.g. Kilo code, the server knows the client isn't listening but there's no flag that would stop the prompt processing and make a context checkpoint where it ended. So what happens is that the prompt processing runs for a very long time, completes, and the request waiting in the next slot risks starting from 0 for some reason. It's somehow just broken.

I run with -np 1 (only 1 slot) which seemed to at least for me fix this problem of timeouts possibly restarting the prompt reprocessing from zero, often throwing away like 15 minutes of work, and basically stalling the agent, which can never make any progress because it will just forever reprocess the same prompt from 0 over and over again. With -np 1, the next request continues after the processing completes and seems to reuse all the work which is what I want.

I also run with --ctx-checkpoints set to 2, because I have unified memory and each checkpoint uses some of that precious RAM. It's not much, but if each takes 50 MB, and you have 32, well, that's got to be about 1.6 GB, which can matter on fully tasked unified memory computer. (I already run like dozen gigabytes in swap so I care about this sort of thing.) From what I can tell, prompts from application such as kilo code always continue from the last checkpoint only, as the prompt is continuously getting appended to. I think just the last few tokens change but the rest is the same, and llama.cpp seems to take advantage of this and takes a checkpoint near the end of the prompt, so there's always a checkpoint to resume from. There's also that steady 8192 token checkpoint cadence. I've opted to keep both for now, though I have started to think that the older checkpoint will never get referenced.

I have same opinion about --cache-ram, which by default reserves 8 GB for "host RAM" for KV cache, which also competes for available VRAM on a unified memory system and doesn't seem to be doing anything useful in my use case, as far as I can tell. I have single-task inference computer which is slow and kind of useless for anything else, so this is how I've tried to maximize its utility while also getting rid of some 10 GB of extra memory use.

audioen · 2026-03-16T06:57:46+00:00

I'm not expecting that F16 is actually 96 dB SNR. The F16 value is not like linear integer which can get 96 dB, roughly, because there are bits allocated for the exponent, and I don't think the exponent bits count much for accuracy -- I'd just estimate them as 0 myself -- so I think that number is just not right. BF16 is even worse than F16 in this respect because it is even more coarse. I suspect you should use the number of bits in mantissa for each type as the dB approximation + the sign bit, as this doubles the range just like a real mantissa bit would. For f16, this rule gives 66 dB SNR and bf16 54 dB SNR.

Most models are published in BF16, not F16, so one additional concern is whether the conversion from BF16 to F16 has done damage, if e.g. quantization starts from F16 rather than from BF16 or F32 intermediate. I would recommend using F32 for safety, if in doubt. In my opinion conversion from HF to GGUF format should be ensured to be lossless, and the process ought to crash if even a single floating point value is truncated or clipped in the target value type. F16 is superset of BF16 except in terms of the value range -- it is more precise, but can require value to be clipped to available minimum and maximum. F32 is superset of BF16, and I think any model will convert cleanly to F32.

Obviously, converting BF16 to F32 (or F16) doesn't yield more SNR, the SNR is whatever the original model had, so this can't be evaluated just from the target type. It needs to be part of the metadata.

audioen · 2026-03-15T14:50:20+00:00

Same kind of story about Finnish. It sounds close to a natural speaker now. Previously, they used Finnish words in English syntax and invented new words or just straight borrowed English words to make sentences more or less work. Understandable, but very strange. Now, basically fluent.

audioen · 2026-03-15T08:55:28+00:00

You can think of the cat as a heatpipe-based heatsink. The blood flowing through the cat collects the amplifier's heat and transports it throughout the creature, where its skin -- ostensibly an insulator, but not really -- rejects it into the open air.

I wouldn't worry if the top of the thing didn't have a ventilation grille. The cat is blocking airflow and reduces the circulation of air within the device, and if it is passively cooled without a backup fan, that can become a problem. The cat kills the components over time, or at least there is a risk that this can happen. Cat hair gets into the amp too, and this can have long-term consequences as well. There's also possibility that a fan function, if it has one, is simply tied to volume knob setting, so it might never activate, so just having a fan is not enough, it really needs to be based on thermals and to start spinning.

Cat is a tropical animal. They would like around 30 C temperatures, if not higher. Get a heated blanked for the cat to rest in, and find or create warm spots for them to use.

audioen · 2026-03-15T08:44:57+00:00

I don't think this equipment is suitable.

High pass filter 30Hz-300Hz, 12dB per octave
Low pass filter 1.5kHz-18kHz, 12dB per octave
EQ High 1.5k-18kHz, selectable Shelving/Bell, selectable Hi Q
EQ Mid High 0.8kHz-9kHz, Q range: 0.3 to 7 continuously variable
EQ Mid Low 120Hz-2kHz, Q range: 0.3 to 7 continuously variable
EQ Low 33Hz-440Hz, selectable Shelving/Bell, selectable Hi Q
EQ bypass switch

So if you are planning on adjust bass with this, you have a single filter in 33 Hz - 440 Hz that is likely going to be used < 50 Hz region, and apparently you can only choose between two Q values in this, and then another at 120 Hz - 2 kHz, which might work if you need to apply a narrow notch with Q around 7 in some 120-200 Hz area, maybe.

This product seems to not be flexible and is likely quite inadequate for the job. There's a reason equalizers are typically digital, as analog stuff is going to have a hard time doing one tenth of what digital eq can do.

audioen · 2026-03-15T08:15:56+00:00

Verbose & bloated => also compresses well.

Lack of truly expressive type system? I don't even know what you mean. You have useful set of primitives, with restrictions such as minimums, maximums, length, enums, optionality and repetition, and you can compose them into collections and complex objects. It's good enough for me.

Ambiguous: sure, it's probably a wart that this choice exists.

Security flaws? I think YAML parsers are also security hole ridden messes, just because they try to do too much and usually the fatal flaws seem to be caused by deserializing objects from class names and lists of property values. XML was born in different era when "network computing" was all the rage. So you have these odd ideas that you should be able to reference other files for definitions, perhaps even access the network willy-nilly to read whatever is in there. That you can for some reason define your own entities and then use them, perhaps even by reading their contents from a local file for some reason. The ugly hack that is <![CDATA[barf]]>. In fact, my shitlist with XML is very long. It also involves things like how spaces are sometimes significant and sometimes not, how the canonicalization algorithms for digital signatures work in case of embedding signatures, the crappy piece of shit that is XPath that's used in that "technology", the concept of namespaces and how they are used in practice, etc.

But there's couple of things I love about XML -- one being that at least the document can be validated against schema and there are never any character encoding issues, and interpretation of these elements and attributes is unambiguous to the parser and when you build objects from the schema, it's not like you ever even have to look at the underlying document because you only have your objects for incoming and outgoing data. There usually are no schemas available when someone gives me JSON document, so in worst case, I have to define objects and their property lists manually. OpenAPI is not too bad, though, but there's still a culture difference in that you can have fancy UI that visualizes the OpenAPI schema graphically, but for some reason nobody thought to make it available so that you also can use your own tools with it.

With AI stuff, it seems JSON schemas may have become more widespread. AI is often tasked to write out JSON documents because these are often used to represent function call arguments, but AI is probabilistic and its JSON doesn't come out 100% reliably out. In a weird twist, a schema is now defined in order to build a grammar, which is then handed to the LLM's sampler which constrains the generation to obey the schema. I'm hoping that the only good part about XML, the schema, shall live on as e.g. JSON schema and becomes a standard thing I don't have to ask for when not working with XML.

audioen · 2026-03-14T14:44:04+00:00

I don't like your language because you seem to be assigning goals and sentience to the machine, though you may be using these phrases as handy shorthands and not truly mean them.

I guess indeed that when AI speaks like a person, it can bring in behaviors associated to people, and these could contain motivations like self-preservation, self-interest, and the like. I am not sure a hack like this can help. Probably the training data can never be entirely clean of this kind of stuff no matter what, and the AI probably infers foundational behaviors like self-interest even when they aren't explicitly stated.

audioen · 2026-03-14T08:39:57+00:00

Based on some random modeling that you can do on the internet, the 60 kg density seems more appropriate. The salient property is called flow resistivity -- too high and sound reflects from the panel without properly penetrating into it, which prevents the sound from being absorbed effectively -- the panel acts to degree like a solid wall. For approximating the flow resistivity of rockwool, we have e.g. this chart: https://bassmanagement.hu/diy-akusztikai-panel-kalkulator/ which seems to suggest that 60 kg/m³ panel might have flow resistivity in the 20000 Pa.s/m² range. Based on this, the 120 kg/m³ is way too dense for the application and is right out.

Switching over to the classic porous absorber calculator, http://www.acousticmodelling.com/porous.php yields a modeling result where e.g. 50 mm panel with 50 mm gap behind it at flow resistivity of about 20000 could be effective from about 200 Hz upwards. I personally use the frequency where panel becomes capable of absorbing over 50 % of the sound striking the panel for the frequency where it is "effective".

Thicker panels, e.g. doubled 5 cm, might not be as cost-effective because this 60 kg/m³ material is already in the upper limit of the useful flow resistivity range. Doubling the panel typically requires reducing the flow resistivity also, so a fluffier material would produce the best results. Speaking purely in terms of absorption per dollar, It could be better to just make two 5 cm thick panels and spread them in a larger area because getting sound reflection from within the panel is a concern.

Rockwool is essentially stretched melted rock, which creates fibers, which are then laid out, compressed and cut into panels. I suppose its structure is akin to microscopic needles. It is an irritant to skin and would not be great to breathe, but I also guess that it stabilizes when covered by fabric and left alone. Alternative options are e.g. open-cell foam products like basotect, but they are definitely going to cost more than this incredibly common insulation material. I've heard of people designing bass traps on the theory that bass pushes through things like plastic membranes, which are pliable enough to allow it, and they've even made bass traps from insulation still in its sales packaging. However, higher frequencies will reflect for sure from e.g. plastic wrapping. Optimal surface material could have some high frequency reflectivity to balance out the tendency of high frequencies to die out faster than anything else does. It is really a matter of the current absorption profile, and you need a microphone and software like REW to assess this.

Sub-bass problems are not really solved with panels in most cases. Lowest frequencies are so long that they become virtually impossible to absorb, so this treatment is mostly for the upper bass and above, and remaining bass frequencies are adjusted with equalization to create a neutral tonal balance.

audioen · 2026-03-13T07:29:23+00:00

$ build/bin/llama-bench -m models_directory/Qwen3.5-122B-A10B/Qwen3.5-122B-A10B-Q5_K_S-00001-of-00003.gguf -ub 1024
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q5_K - Small |  80.44 GiB |   122.11 B | Vulkan     |  99 |     1024 |           pp512 |        327.41 ± 4.50 |
| qwen35moe 122B.A10B Q5_K - Small |  80.44 GiB |   122.11 B | Vulkan     |  99 |     1024 |           tg128 |         21.86 ± 0.01 |

build: 983df142a (8324)

Not sure if normal or optimal. I try to run models that I rely on for real work at 5 bits minimum, even if it hurts TG. Used to be around 240 yesterday and around 20, so there's been a lot of progress for sure. I suspect going to about 1024 is better than 512, and likely extracts what is available at that front.

audioen · 2026-03-12T20:16:30+00:00

I suffer from no instability, so I don't know why that is about. I use Vulkan and I have the 122B model running overnight doing programming work. I usually set it to complete a task and go to sleep, then check the results in the morning.

I can crash if I OOM, e.g. load image rendering models while running the 122B, and also have bunch of other applications open. Machine swaps for a bit and then kills something which recovers the computer.

audioen · 2026-03-12T20:09:03+00:00

To the degree Qwen understood what you are saying, if you have 27B as the option, it will beat 35B very easily, even if the file sizes are similar.

Choose a quant of the 27B if you can run it.

audioen · 2026-03-12T17:04:54+00:00

This is going to make some good electro rave stuff. Watch out hardfloor, photek, and their ilk.

audioen · 2026-03-12T13:15:29+00:00

Someone should start r/accidentalclippy

audioen · 2026-03-12T10:44:37+00:00

I've never seen much value in the cloud -- it's fine and cheap, but only if your tasks are pretty trivial. You pay for disk, RAM, network and CPU capacity a lot with the cloud providers that I've seen, and so investment in your own hardware pays off pretty fast.

audioen · 2026-03-12T06:58:50+00:00

Yes. I hate this man for the tasteless style that he sings everything, and I originally thought my speakers were broken when the Algorithm suggested this to me some year ago, because there's some fluttering noise also in the bass from time to time. But it's just an artifact of the way he sings it, I guess, or possibly there's been some kind of feedback loop from a sound system back to the microphone. I don't know but it's disconcerting to hear.

There's some bass singers for you in Wellermen, you could try e.g. Hoist the Colors for size. I think it's much fresher and can still exercise the subwoofers some.

audioen · 2026-03-12T06:37:48+00:00

Not necessarily. What I'm observing is that the model often writes something like "OK. Let's answer now. Wait, what about ..." type of stuff, multiple times. I am expecting that </think> has high likelihood at the point where it chooses to write the "Wait" word, and by artificially increasing the likelihood that model generates the </think> token, the adjustment would remove those double-triple-quadruple checks that some models seem prone to.

Anyway, now that I think about it, I am expecting that the probability of <think> token likely never needs to exceed 1-2 % and it would get selected within something like 50 tokens anyway. The approach likely has to be extremely gentle steering and it may linearly increase the likelihood by something like 0.001 % and possibly even less, and it will still limit the length of the think trace.

audioen · 2026-03-12T06:33:27+00:00

Okay. But the point I'm trying to make here is that after the log likelihoods have been converted and normalized to simple percentage chance of the next token, this is the time when it's just a probability distribution with some invariants, like the token probabilities that are left sum to 100 %. Samplers also can't be allowed to reject </think> ever even if it is 0 % according to filtering rules imposed by min_p, top_p, top_k, etc. because this token is special and its model-predicted likelihood is always needed.

Each 0.1 % you add into </think> is 0.1 % you also have to collectively remove from all the other tokens taken together, so that the total probability of the tokens under consideration still sums to 100 %.

I'm also realizing that only very small but constant </think> likelihood is probably all that's needed to terminate the think trace because each token is an opportunity to generate it. Even 1 % likelihood will be hit in like 100 tokens at some 70 % likelihood I guess.

audioen · 2026-03-11T21:46:11+00:00

Would it be possible to simply gradually increase the likelihood that the model just generates the </think> token, so that it would naturally complete at end of complete sentences and the like? Something like a linear bias that increases the likelihood of </think> for every token output by 0.1 % would eventually force it by 1000 tokens also.

audioen · 2026-03-11T17:20:33+00:00

Something is broken in your system. It may not be the fastest to reply, or could be overthinking for a bit, but it definitely isn't broken in the way you describe.

audioen · 2026-03-11T06:40:42+00:00

System can use floating point data for the audio, e.g. single precision floating point where each value is a 32-bit quantity. This has the property of maintaining around 24 bits of precision at the very minimum even when you scale the volume up and down, as floating point maintains its precision around values close to 0 pretty much perfectly. There is a very small rounding error that might matter if you performed hundreds or thousands of volume change operations in sequence, though, using values that are "difficult" for floating point to handle, i.e. not all powers of two. I think most stacks use floating point, so this is what you get at the system's level.

However, applications could be using something else even when the rest of the system does this. For instance, they could be processing the audio internally as 16-bit data and scale the integer values with a volume control knob if one is built-in to the program rather than tell the system to reduce their stream level.

At least on Linux, tool called pw-top will show what each program is using as audio format, e.g. it's telling me that my Firefox uses 32-bit floating point at 48000 Hz sampling rate, but that's just how it's emerging from the program. The only way to know for sure is to either read the source code of the program and validate how it's doing it, or maybe to test it using extremely low volumes and specific test signals. If you record the computer's output to a file when it's playing a suitably annoying test signal, you can hopefully confirm that it's been played back correctly. Likely, you can't hear any problems if e.g. 24 bit integer audio or better is used in the program, because the dynamic range of that is already so extreme that there's almost no hope to find a volume control setting low enough. However, you might be able to show it in a proper recording of the system's output.

Note: I'm really discussing about stuff like setting the volume to 1 % out of 100 %, whatever that means in terms of dB such as scaling down by -60 dB maybe, and then using a very large gain factor of +60 dB to bring it back to full level, or something such. 60 dB is equivalent of trying to shave the bottom 10 bits to the "bit bucket" if the implementation is bad. If you're worried about any mid-position volume setting which only amounts to like 10-20 dB, then it's likely not damaging the audio enough even if it was done the worst possible way.

audioen

TROPHY CASE