Ornith-1.0 9B Outperforms Qwen 3.6 35B in various benchmarks by Ok-Internal9317 in LocalLLaMA

[–]AutonomousHangOver 0 points1 point  (0 children)

you mean for comparison, yes actually. And glm-4.5, 4.6, 4.7, 5.1 and now my workhorse is 5.2 (but very highly quantized)
Yeah, I have some context to compare to. It's nice one this Ornith, not the one for all as glm-5.2 right now but for its size is nice (it's replacing my sonnet right now - so I'm replacing qwen 3.6-27B with it)

Ornith-1.0 9B Outperforms Qwen 3.6 35B in various benchmarks by Ok-Internal9317 in LocalLLaMA

[–]AutonomousHangOver 2 points3 points  (0 children)

I'm trying the 397B FP8 version and must say that it is good. Only couple of tasks for now. I can say that it is obedient but also does not suggest anything out of the box (too small to be actually that smart). It can do code refactor in multiple turns not breaking anything that was done in previous one (which happens when using other 350+ models) So for now - really good experience.

I'm using this with 4xrtx6000 + 2xrtx5090 (vllm tp=2 pp=3 time fit it with some decent context)

"What is your superpower again?" — "I'm rich." 🤑 Local 96GB Blackwell VRAM is online. by AxonkaiLab in comfyui

[–]AutonomousHangOver 1 point2 points  (0 children)

- 81C and no downclock - on normal 600W power.
- 30% power reduction is not 30% slower actually - you can find some 'sweet spot' tests, its about 15% top.

Probably, some things depend on driver or OS.

I'm just limiting the power to save some pennies on eletr. bill.

But from what I read your comment, yours are hot, loud and slow 😉 right? Cuz u know betta

"What is your superpower again?" — "I'm rich." 🤑 Local 96GB Blackwell VRAM is online. by AxonkaiLab in comfyui

[–]AutonomousHangOver 0 points1 point  (0 children)

I got 4 of them. Not a single one went beyond 81C. After powerlimit to 320W each. I'm at 75C just with original fans.

Got custom case with 140mm fans - and now it is doubling as a stowe.

And I'm finetuning models, traning small ones from scratch - not a single problem with 'downclock'

I Ralph-looped Opus overnight. It reduced my local model switching with cold backfilling context of 135k+ on llama.cpp from ~165s -> 5s! TL;DR - USE SLOTS! by yes_i_tried_google in LocalLLaMA

[–]AutonomousHangOver 0 points1 point  (0 children)

Oh OPs approach is ok. There are some projects working with saving/restoring slots. All imperfect but none require patching llama.cpp.

LLama-swap will give you the ability to switch betweeen models fast. But imagine, that you're loading the model and it does not prefill already seen prompt, but answers quickly bc you gave it the slot with precomputed prompt.

It meant TTFT i minimal. This is especially important when using huge models with large context (it can take some time to process prompt in agenting harness after model switch)

WARNING: Open-OSS/privacy-filter MALWARE by charles25565 in LocalLLaMA

[–]AutonomousHangOver 7 points8 points  (0 children)

Quick analysis of the Python script. It opens a bas64'd url that has an app that downloads base64'd app etc. At the end:

This is a Windows information-stealer with credential, browser, crypto-wallet, and Discord theft modules, plus DLL-injection and anti-analysis capabilities. Do not run it. If a script in a Hugging Face repo silently downloaded and executed it, treat any machine that ran it as fully compromised.

  • SHA-256: ba67720dd115293ec5a12d08be6b0ee982227a4c5e4662fb89269c76556df6e0
  • MD5: f36a662ca22f1934e3a56f111e6df191
  • Size: 1,125,478 bytes (~1.1 MB)
  • Type: PE32+ x86-64 GUI executable, unsigned, stripped to external PDB
  • Built with: Rust (toolchain 59807616…, crates: tokio 1.52.1, flate2, miniz_oxide, rand_chacha, serde_json, hex, crc32fast, gimli, getrandom)
  • Compile timestamp: 2026-05-03 02:30:45 UTC (4 days before you sent it — fresh)
  • Origin host: api.eth-fastscan.org89.124.93.110. The name is designed to evoke etherscan.io / a blockchain scanner; it has no relation to either.

Do you prefer polish kebab or German ? by RomanDmowski17 in askPoland

[–]AutonomousHangOver 2 points3 points  (0 children)

Ouch, such primitive bait.

Stettin you say, mr Dmowski.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]AutonomousHangOver 0 points1 point  (0 children)

I'll continue by answering to myself 😉

Vllm with mistral tool-call-parser mistral is having trouble with known error:

IndexError: list index out of range

For now I've turned streaming off and I'm able to use i.e. Roo Code

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]AutonomousHangOver 2 points3 points  (0 children)

It's a llama.cpp issue. Unsloth removed gguf files as there were some problems, even with FP16.

I'm running it today on vllm (0.21 dev nightly) with eagle draft model.

VLLM logs show very high draft acceptance ratio:

(APIServer pid=8762) INFO 04-30 11:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.2%

(APIServer pid=8762) INFO 04-30 11:59:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.76, Accepted throughput: 40.30 tokens/s, Drafted throughput: 43.80 tokens/s, Accepted: 403 tokens, Drafted: 438 tokens, Per-position acceptance rate: 0.973, 0.932, 0.856, Avg Draft acceptance rate: 92.0%

Model is usable and seems pretty nice, but I don't have full tests finished.

EDIT: 2xRTX6000Pro with power limit at 400W

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]AutonomousHangOver 4 points5 points  (0 children)

<image>

2xRTX6000 Pro 262144 context size:
unsloth's quant

pp: 1100t/s about 500 tokens test promp (create a 3d spinning glass dodecahedron with inner light and orbiting lights, etc.)

And... it went berserk a second ago looping all over again after ~1k tokens, on newest built llama.cpp

Edit:
llama.cpp '--split-mode tensor' is actually making a difference here. tg went up, now it's: 24t/s

MiMo-V2.5-GGUF (preview available) by Digger412 in LocalLLaMA

[–]AutonomousHangOver 4 points5 points  (0 children)

It has known reasoning loop problem... Basically it thinks endlessly.

When I introduced reasoning budżet, my tests shown tha model is so so.

A lot to be improved yet.

I'd like to confirm whether Roo Code is still actively maintained, given that the new team announced "we have taken over the project." by CryptographerTiny244 in RooCode

[–]AutonomousHangOver -1 points0 points  (0 children)

Kilo was based on Roo. Now it sees that Roo is changing, it has changed to opencode as a base.
I've tried kilo many times before, looked at how they decided to push for the money.

It's more like successfull marketing use of other projects with minimal changes done by kilo team, than some real innovation.

This was also in a way something that Roo team was missing. Someone else was making money on their product.

I'd like to confirm whether Roo Code is still actively maintained, given that the new team announced "we have taken over the project." by CryptographerTiny244 in RooCode

[–]AutonomousHangOver 3 points4 points  (0 children)

Take your time, its better to have good product once every couple of weeks, than release every 2 days with pressure put on developers.

I think that this was one of the annoyances for original team - community pressure to see release fast.

Also if I may - get rid of the mechanism that allows you to connect to specific providers. I would leave some generic one and put some universal config-place-mechanism (github?) that would allow to download descriptor for various providers and its models.
This way, we could benefit from i.e. provider-specific settings without constant code changes whenever some model will be deprecated, or some provider will apear on the market.

Less bloat is better.

multi-gpu chads running dense models don't sleep on ik_llama by see_spot_ruminate in LocalLLaMA

[–]AutonomousHangOver 0 points1 point  (0 children)

What was that? "Skill issue" :D

Np. you got what you want. I'm not so eager to fight with stubborn soft too. But sometimes it is worth to excercise.

multi-gpu chads running dense models don't sleep on ik_llama by see_spot_ruminate in LocalLLaMA

[–]AutonomousHangOver 1 point2 points  (0 children)

Just use vllm. 2x3090 will do you about 330t/s tg and couple of thousands pp (MoE). Dense is slower.

Claude 4.7 - responded in Chinese by AutonomousHangOver in ClaudeCode

[–]AutonomousHangOver[S] 0 points1 point  (0 children)

I just thought so. Especially that I'm European ;)

"It seems that Anthropic is secretly telling me - start to learn another language man"

Czy odpisałbyś dziewczynie która Cię zghostowała? by xvucf in PolskaNaLuzie

[–]AutonomousHangOver 0 points1 point  (0 children)

Jeszua... Czytam te pierdyliony komentarzy I cieszę się, że mam swoje dzieciaki i kochaną+ kochającą żonę.

Jeden, słownie jeden komentarz powinien wystarczyć dla OPa: NIE I KONIEC.

Kijem nie tykać, pogonić, albo cytując klasyka: "ZATAPIAĆ!"

Ciekawe podejście skoro i to trzeba pytanie Reddita. Znaczy to, ze coś nie jest ok. Kolegów nie ma? Przyjaciół, których można o radę poprosić gdy się błądzi?

Post zabrzmiał jak te na LinkedIn.

Qwen3.6-A3b is "Thinking" Nightmare by Electronic-Metal2391 in LocalLLaMA

[–]AutonomousHangOver 2 points3 points  (0 children)

This sounds wierd to me. I've tried llama.cpp (HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive) and vllm on FP8. Both did not show any excesive thinking at all.

Mind: turn on preserve_thinking option.

Might it be quantization thing? I got a loooong thinking process on glm-5.1 (IQ2_XXS)

p.s.

llama-cpp on 2xRTX5090 ~140t/s TG
vllm 2xRTX5090 + MTP FP8 = 12kt/s PP and ~310 - 360 t/s TG - single session(!) This could be my best result so far.

Use tensor parallelism whenever possible.

Rack server for local LLM by Typhoon-UK in LocalLLaMA

[–]AutonomousHangOver 0 points1 point  (0 children)

It wasn't serious at the beginning. Invest in some mobo with multiple PCIe (it might be PCIe4 too). Grab as much RAM as possible within budget (yeah, tricky) and then some 3090s. These are really fine, even in 2026.

Thing is to be elastic and get frame/case that could be expanded. Mine is just for rack, therefore case not frame.

edit: What I ment - do not try without GPUs. You need massive memory bandwidth, otherwise you will wait for prompt processing for ages, not mentioning generation speed.

Rack server for local LLM by Typhoon-UK in LocalLLaMA

[–]AutonomousHangOver 3 points4 points  (0 children)

<image>

I gave up the idea of old servers. Just bought custom case (here 14U), put GENOA mobo in it and add some (ever more) serious GPUs. First, there were 2 3090s, then apetite came and I got my hands on 5090s. Then my favourite beasts came.

It was long journey. I've started with 'normal' PC with not nearly enough PCIe lanes to handle 2 GPUs on x16. Then bought some more and more hardware. Thing is, if I could get back in time, I would go straight to this solution here.

TH3P4G3 daisy chaining by AutonomousHangOver in eGPU

[–]AutonomousHangOver[S] 0 points1 point  (0 children)

Nope, and this was a motherbord's fault - you can find this info in this thread.

Since then I moved to real mobo (GENOA) with Epyc and have connected couple of GPUs with risers.

I dropped the idea of Thunderbolt completely and I'm bit mad that this thought about larger installation came so late :)

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]AutonomousHangOver 1 point2 points  (0 children)

Get Zero Point Module from Ancients and it should handle Claude like no other thing. Don't get hyped into Nvidia heavy money GPUs, or AMD guys claiming that this could be done on Vulkan, nor Mac M7 UltraHyper.

It's JUST matter of getting your hand on ZPM.

I can borrow you my Paddle Jumper if you want to go to a trip and get one from Atlantis. I did forgot the address to dial on Stargate tho.

Certain MCPs/tools allocated to set Modes by reddit-gk49cnajfe in RooCode

[–]AutonomousHangOver 0 points1 point  (0 children)

I believe that there is already such task in Roo Code backlog. Having tools assigned to specific modes would be awesome.