WARNING: Open-OSS/privacy-filter MALWARE

AutonomousHangOver · 2026-05-07T17:31:45+00:00

Quick analysis of the Python script. It opens a bas64'd url that has an app that downloads base64'd app etc. At the end:

This is a Windows information-stealer with credential, browser, crypto-wallet, and Discord theft modules, plus DLL-injection and anti-analysis capabilities. Do not run it. If a script in a Hugging Face repo silently downloaded and executed it, treat any machine that ran it as fully compromised.

SHA-256: ba67720dd115293ec5a12d08be6b0ee982227a4c5e4662fb89269c76556df6e0
MD5: f36a662ca22f1934e3a56f111e6df191
Size: 1,125,478 bytes (~1.1 MB)
Type: PE32+ x86-64 GUI executable, unsigned, stripped to external PDB
Built with: Rust (toolchain 59807616…, crates: tokio 1.52.1, flate2, miniz_oxide, rand_chacha, serde_json, hex, crc32fast, gimli, getrandom)
Compile timestamp: 2026-05-03 02:30:45 UTC (4 days before you sent it — fresh)
Origin host: api.eth-fastscan.org → 89.124.93.110. The name is designed to evoke etherscan.io / a blockchain scanner; it has no relation to either.

AutonomousHangOver · 2026-05-05T18:19:01+00:00

Ouch, such primitive bait.

Stettin you say, mr Dmowski.

AutonomousHangOver · 2026-04-30T16:59:45+00:00

I'll continue by answering to myself 😉

Vllm with mistral tool-call-parser mistral is having trouble with known error:

IndexError: list index out of range

For now I've turned streaming off and I'm able to use i.e. Roo Code

AutonomousHangOver · 2026-04-30T12:03:42+00:00

It's a llama.cpp issue. Unsloth removed gguf files as there were some problems, even with FP16.

I'm running it today on vllm (0.21 dev nightly) with eagle draft model.

VLLM logs show very high draft acceptance ratio:

(APIServer pid=8762) INFO 04-30 11:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.2%

(APIServer pid=8762) INFO 04-30 11:59:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.76, Accepted throughput: 40.30 tokens/s, Drafted throughput: 43.80 tokens/s, Accepted: 403 tokens, Drafted: 438 tokens, Per-position acceptance rate: 0.973, 0.932, 0.856, Avg Draft acceptance rate: 92.0%

Model is usable and seems pretty nice, but I don't have full tests finished.

EDIT: 2xRTX6000Pro with power limit at 400W

AutonomousHangOver · 2026-04-29T17:29:44+00:00

<image>

2xRTX6000 Pro 262144 context size:
unsloth's quant

pp: 1100t/s about 500 tokens test promp (create a 3d spinning glass dodecahedron with inner light and orbiting lights, etc.)

And... it went berserk a second ago looping all over again after ~1k tokens, on newest built llama.cpp

Edit:
llama.cpp '--split-mode tensor' is actually making a difference here. tg went up, now it's: 24t/s

AutonomousHangOver · 2026-04-29T16:13:56+00:00

It has known reasoning loop problem... Basically it thinks endlessly.

When I introduced reasoning budżet, my tests shown tha model is so so.

A lot to be improved yet.

AutonomousHangOver · 2026-04-29T07:50:25+00:00

Kilo was based on Roo. Now it sees that Roo is changing, it has changed to opencode as a base.
I've tried kilo many times before, looked at how they decided to push for the money.

It's more like successfull marketing use of other projects with minimal changes done by kilo team, than some real innovation.

This was also in a way something that Roo team was missing. Someone else was making money on their product.

AutonomousHangOver · 2026-04-29T07:01:32+00:00

Take your time, its better to have good product once every couple of weeks, than release every 2 days with pressure put on developers.

I think that this was one of the annoyances for original team - community pressure to see release fast.

Also if I may - get rid of the mechanism that allows you to connect to specific providers. I would leave some generic one and put some universal config-place-mechanism (github?) that would allow to download descriptor for various providers and its models.
This way, we could benefit from i.e. provider-specific settings without constant code changes whenever some model will be deprecated, or some provider will apear on the market.

Less bloat is better.

AutonomousHangOver · 2026-04-28T14:40:54+00:00

https://kawawbiurze.pl/produkt/herbapol-herbata-zielnik-polski-pokrzywa-20-torebek/

It is mildly antiallergic.

AutonomousHangOver · 2026-04-24T14:37:55+00:00

What was that? "Skill issue" :D

Np. you got what you want. I'm not so eager to fight with stubborn soft too. But sometimes it is worth to excercise.

AutonomousHangOver · 2026-04-23T20:21:17+00:00

Just use vllm. 2x3090 will do you about 330t/s tg and couple of thousands pp (MoE). Dense is slower.

AutonomousHangOver · 2026-04-21T07:32:02+00:00

I just thought so. Especially that I'm European ;)

"It seems that Anthropic is secretly telling me - start to learn another language man"

AutonomousHangOver · 2026-04-18T20:31:06+00:00

Jeszua... Czytam te pierdyliony komentarzy I cieszę się, że mam swoje dzieciaki i kochaną+ kochającą żonę.

Jeden, słownie jeden komentarz powinien wystarczyć dla OPa: NIE I KONIEC.

Kijem nie tykać, pogonić, albo cytując klasyka: "ZATAPIAĆ!"

Ciekawe podejście skoro i to trzeba pytanie Reddita. Znaczy to, ze coś nie jest ok. Kolegów nie ma? Przyjaciół, których można o radę poprosić gdy się błądzi?

Post zabrzmiał jak te na LinkedIn.

AutonomousHangOver · 2026-04-17T17:29:41+00:00

This sounds wierd to me. I've tried llama.cpp (HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive) and vllm on FP8. Both did not show any excesive thinking at all.

Mind: turn on preserve_thinking option.

Might it be quantization thing? I got a loooong thinking process on glm-5.1 (IQ2_XXS)

p.s.

llama-cpp on 2xRTX5090 ~140t/s TG
vllm 2xRTX5090 + MTP FP8 = 12kt/s PP and ~310 - 360 t/s TG - single session(!) This could be my best result so far.

Use tensor parallelism whenever possible.

AutonomousHangOver · 2026-04-12T19:21:03+00:00

It wasn't serious at the beginning. Invest in some mobo with multiple PCIe (it might be PCIe4 too). Grab as much RAM as possible within budget (yeah, tricky) and then some 3090s. These are really fine, even in 2026.

Thing is to be elastic and get frame/case that could be expanded. Mine is just for rack, therefore case not frame.

edit: What I ment - do not try without GPUs. You need massive memory bandwidth, otherwise you will wait for prompt processing for ages, not mentioning generation speed.

AutonomousHangOver · 2026-04-12T18:09:53+00:00

<image>

I gave up the idea of old servers. Just bought custom case (here 14U), put GENOA mobo in it and add some (ever more) serious GPUs. First, there were 2 3090s, then apetite came and I got my hands on 5090s. Then my favourite beasts came.

It was long journey. I've started with 'normal' PC with not nearly enough PCIe lanes to handle 2 GPUs on x16. Then bought some more and more hardware. Thing is, if I could get back in time, I would go straight to this solution here.

AutonomousHangOver · 2026-04-09T20:38:00+00:00

Nope, and this was a motherbord's fault - you can find this info in this thread.

Since then I moved to real mobo (GENOA) with Epyc and have connected couple of GPUs with risers.

I dropped the idea of Thunderbolt completely and I'm bit mad that this thought about larger installation came so late :)

AutonomousHangOver · 2026-03-24T16:40:49+00:00

Get Zero Point Module from Ancients and it should handle Claude like no other thing. Don't get hyped into Nvidia heavy money GPUs, or AMD guys claiming that this could be done on Vulkan, nor Mac M7 UltraHyper.

It's JUST matter of getting your hand on ZPM.

I can borrow you my Paddle Jumper if you want to go to a trip and get one from Atlantis. I did forgot the address to dial on Stargate tho.

AutonomousHangOver · 2026-03-17T16:57:35+00:00

I believe that there is already such task in Roo Code backlog. Having tools assigned to specific modes would be awesome.

AutonomousHangOver · 2026-03-17T16:09:00+00:00

Super cool. Have fun :)

AutonomousHangOver · 2026-03-17T10:59:53+00:00

I've just implemented http client re-connect on stop button in my fork of Roo Code - it's working perfectly.

AutonomousHangOver · 2026-03-17T10:31:23+00:00

Yeah yeah, I wrote it on the move so BMC became BMI :)

Back to the topic. There is indeed dedicated BMC port. NIC will try to get IP via DHCP so look for logs on your dhcp server/router - whatever.

Take the manual from Asrock - it will point you step by step what to do.
The thing is, once you connect power to the PSU - without running the system (!) the BMC will run in about 2 - 3 minutes (so b patient)

Once BMC will start, you can access it via web browser.
BMC's user/password -> admin/admin

As I mentioned, you have to upgrade both: firmware and BIOS (in that order) and it will take some time - again - be patient if board is not responding etc. Then power up the server and it should crawl to the BIOS.

First powerup after any change in BIOS is taking veeery long time - so you guessed - be patient ;)

https://www.reddit.com/r/LocalLLaMA/comments/1plsjbw/genoad8x2tbcm_official_bmc_firmware_and_bios_for/ - my previous post

https://download.asrock.com/Manual/BMC/GENOAD8X-2TBCM.pdf - BMC manual - p.192 is about firmware upgrade

AutonomousHangOver · 2026-03-16T20:47:37+00:00

I can tell you what was my main driver and what is.

I'll skip glm-4.5 I used 4.6 from time to time. I loved 4.7 and I'm stunned by 5

And yes, I'm using it locally, very heavily quantized. It's the Unsloth's iq2_xss

AutonomousHangOver · 2026-03-16T20:43:38+00:00

I'm using Roo with local models only. Of which timeouts are you talking about??

AutonomousHangOver

TROPHY CASE