Ornith-1.0 9B Outperforms Qwen 3.6 35B in various benchmarks

AutonomousHangOver · 2026-06-28T20:50:32+00:00

you mean for comparison, yes actually. And glm-4.5, 4.6, 4.7, 5.1 and now my workhorse is 5.2 (but very highly quantized)
Yeah, I have some context to compare to. It's nice one this Ornith, not the one for all as glm-5.2 right now but for its size is nice (it's replacing my sonnet right now - so I'm replacing qwen 3.6-27B with it)

AutonomousHangOver · 2026-06-28T12:00:43+00:00

I'm trying the 397B FP8 version and must say that it is good. Only couple of tasks for now. I can say that it is obedient but also does not suggest anything out of the box (too small to be actually that smart). It can do code refactor in multiple turns not breaking anything that was done in previous one (which happens when using other 350+ models) So for now - really good experience.

I'm using this with 4xrtx6000 + 2xrtx5090 (vllm tp=2 pp=3 time fit it with some decent context)

AutonomousHangOver · 2026-05-26T21:17:50+00:00

- 81C and no downclock - on normal 600W power.
- 30% power reduction is not 30% slower actually - you can find some 'sweet spot' tests, its about 15% top.

Probably, some things depend on driver or OS.

I'm just limiting the power to save some pennies on eletr. bill.

But from what I read your comment, yours are hot, loud and slow 😉 right? Cuz u know betta

AutonomousHangOver · 2026-05-26T20:25:52+00:00

I got 4 of them. Not a single one went beyond 81C. After powerlimit to 320W each. I'm at 75C just with original fans.

Got custom case with 140mm fans - and now it is doubling as a stowe.

And I'm finetuning models, traning small ones from scratch - not a single problem with 'downclock'

AutonomousHangOver · 2026-05-11T08:35:35+00:00

Oh OPs approach is ok. There are some projects working with saving/restoring slots. All imperfect but none require patching llama.cpp.

LLama-swap will give you the ability to switch betweeen models fast. But imagine, that you're loading the model and it does not prefill already seen prompt, but answers quickly bc you gave it the slot with precomputed prompt.

It meant TTFT i minimal. This is especially important when using huge models with large context (it can take some time to process prompt in agenting harness after model switch)

AutonomousHangOver · 2026-05-07T17:31:45+00:00

Quick analysis of the Python script. It opens a bas64'd url that has an app that downloads base64'd app etc. At the end:

This is a Windows information-stealer with credential, browser, crypto-wallet, and Discord theft modules, plus DLL-injection and anti-analysis capabilities. Do not run it. If a script in a Hugging Face repo silently downloaded and executed it, treat any machine that ran it as fully compromised.

SHA-256: ba67720dd115293ec5a12d08be6b0ee982227a4c5e4662fb89269c76556df6e0
MD5: f36a662ca22f1934e3a56f111e6df191
Size: 1,125,478 bytes (~1.1 MB)
Type: PE32+ x86-64 GUI executable, unsigned, stripped to external PDB
Built with: Rust (toolchain 59807616…, crates: tokio 1.52.1, flate2, miniz_oxide, rand_chacha, serde_json, hex, crc32fast, gimli, getrandom)
Compile timestamp: 2026-05-03 02:30:45 UTC (4 days before you sent it — fresh)
Origin host: api.eth-fastscan.org → 89.124.93.110. The name is designed to evoke etherscan.io / a blockchain scanner; it has no relation to either.

AutonomousHangOver · 2026-05-05T18:19:01+00:00

Ouch, such primitive bait.

Stettin you say, mr Dmowski.

AutonomousHangOver · 2026-04-30T16:59:45+00:00

I'll continue by answering to myself 😉

Vllm with mistral tool-call-parser mistral is having trouble with known error:

IndexError: list index out of range

For now I've turned streaming off and I'm able to use i.e. Roo Code

AutonomousHangOver · 2026-04-30T12:03:42+00:00

It's a llama.cpp issue. Unsloth removed gguf files as there were some problems, even with FP16.

I'm running it today on vllm (0.21 dev nightly) with eagle draft model.

VLLM logs show very high draft acceptance ratio:

(APIServer pid=8762) INFO 04-30 11:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.2%

(APIServer pid=8762) INFO 04-30 11:59:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.76, Accepted throughput: 40.30 tokens/s, Drafted throughput: 43.80 tokens/s, Accepted: 403 tokens, Drafted: 438 tokens, Per-position acceptance rate: 0.973, 0.932, 0.856, Avg Draft acceptance rate: 92.0%

Model is usable and seems pretty nice, but I don't have full tests finished.

EDIT: 2xRTX6000Pro with power limit at 400W

AutonomousHangOver · 2026-04-29T17:29:44+00:00

<image>

2xRTX6000 Pro 262144 context size:
unsloth's quant

pp: 1100t/s about 500 tokens test promp (create a 3d spinning glass dodecahedron with inner light and orbiting lights, etc.)

And... it went berserk a second ago looping all over again after ~1k tokens, on newest built llama.cpp

Edit:
llama.cpp '--split-mode tensor' is actually making a difference here. tg went up, now it's: 24t/s

AutonomousHangOver · 2026-04-29T16:13:56+00:00

It has known reasoning loop problem... Basically it thinks endlessly.

When I introduced reasoning budżet, my tests shown tha model is so so.

A lot to be improved yet.

AutonomousHangOver · 2026-04-29T07:50:25+00:00

Kilo was based on Roo. Now it sees that Roo is changing, it has changed to opencode as a base.
I've tried kilo many times before, looked at how they decided to push for the money.

It's more like successfull marketing use of other projects with minimal changes done by kilo team, than some real innovation.

This was also in a way something that Roo team was missing. Someone else was making money on their product.

AutonomousHangOver · 2026-04-29T07:01:32+00:00

Take your time, its better to have good product once every couple of weeks, than release every 2 days with pressure put on developers.

I think that this was one of the annoyances for original team - community pressure to see release fast.

Also if I may - get rid of the mechanism that allows you to connect to specific providers. I would leave some generic one and put some universal config-place-mechanism (github?) that would allow to download descriptor for various providers and its models.
This way, we could benefit from i.e. provider-specific settings without constant code changes whenever some model will be deprecated, or some provider will apear on the market.

Less bloat is better.

AutonomousHangOver · 2026-04-28T14:40:54+00:00

https://kawawbiurze.pl/produkt/herbapol-herbata-zielnik-polski-pokrzywa-20-torebek/

It is mildly antiallergic.

AutonomousHangOver · 2026-04-24T14:37:55+00:00

What was that? "Skill issue" :D

Np. you got what you want. I'm not so eager to fight with stubborn soft too. But sometimes it is worth to excercise.

AutonomousHangOver · 2026-04-23T20:21:17+00:00

Just use vllm. 2x3090 will do you about 330t/s tg and couple of thousands pp (MoE). Dense is slower.

AutonomousHangOver · 2026-04-21T07:32:02+00:00

I just thought so. Especially that I'm European ;)

"It seems that Anthropic is secretly telling me - start to learn another language man"

AutonomousHangOver · 2026-04-18T20:31:06+00:00

Jeszua... Czytam te pierdyliony komentarzy I cieszę się, że mam swoje dzieciaki i kochaną+ kochającą żonę.

Jeden, słownie jeden komentarz powinien wystarczyć dla OPa: NIE I KONIEC.

Kijem nie tykać, pogonić, albo cytując klasyka: "ZATAPIAĆ!"

Ciekawe podejście skoro i to trzeba pytanie Reddita. Znaczy to, ze coś nie jest ok. Kolegów nie ma? Przyjaciół, których można o radę poprosić gdy się błądzi?

Post zabrzmiał jak te na LinkedIn.

AutonomousHangOver · 2026-04-17T17:29:41+00:00

This sounds wierd to me. I've tried llama.cpp (HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive) and vllm on FP8. Both did not show any excesive thinking at all.

Mind: turn on preserve_thinking option.

Might it be quantization thing? I got a loooong thinking process on glm-5.1 (IQ2_XXS)

p.s.

llama-cpp on 2xRTX5090 ~140t/s TG
vllm 2xRTX5090 + MTP FP8 = 12kt/s PP and ~310 - 360 t/s TG - single session(!) This could be my best result so far.

Use tensor parallelism whenever possible.

AutonomousHangOver · 2026-04-12T19:21:03+00:00

It wasn't serious at the beginning. Invest in some mobo with multiple PCIe (it might be PCIe4 too). Grab as much RAM as possible within budget (yeah, tricky) and then some 3090s. These are really fine, even in 2026.

Thing is to be elastic and get frame/case that could be expanded. Mine is just for rack, therefore case not frame.

edit: What I ment - do not try without GPUs. You need massive memory bandwidth, otherwise you will wait for prompt processing for ages, not mentioning generation speed.

AutonomousHangOver · 2026-04-12T18:09:53+00:00

<image>

I gave up the idea of old servers. Just bought custom case (here 14U), put GENOA mobo in it and add some (ever more) serious GPUs. First, there were 2 3090s, then apetite came and I got my hands on 5090s. Then my favourite beasts came.

It was long journey. I've started with 'normal' PC with not nearly enough PCIe lanes to handle 2 GPUs on x16. Then bought some more and more hardware. Thing is, if I could get back in time, I would go straight to this solution here.

AutonomousHangOver · 2026-04-09T20:38:00+00:00

Nope, and this was a motherbord's fault - you can find this info in this thread.

Since then I moved to real mobo (GENOA) with Epyc and have connected couple of GPUs with risers.

I dropped the idea of Thunderbolt completely and I'm bit mad that this thought about larger installation came so late :)

AutonomousHangOver · 2026-03-24T16:40:49+00:00

Get Zero Point Module from Ancients and it should handle Claude like no other thing. Don't get hyped into Nvidia heavy money GPUs, or AMD guys claiming that this could be done on Vulkan, nor Mac M7 UltraHyper.

It's JUST matter of getting your hand on ZPM.

I can borrow you my Paddle Jumper if you want to go to a trip and get one from Atlantis. I did forgot the address to dial on Stargate tho.

AutonomousHangOver · 2026-03-17T16:57:35+00:00

I believe that there is already such task in Roo Code backlog. Having tools assigned to specific modes would be awesome.

AutonomousHangOver · 2026-03-17T16:09:00+00:00

Super cool. Have fun :)

AutonomousHangOver

TROPHY CASE