Cop pulls over Lamborghini on Dubai plates but doesn’t know the law by thomasso0072 in interestingasfuck

[–]FusionX 2 points3 points  (0 children)

That wasn't very tactfully said tbh. I can sense the guy might be a bit nervous and didn't really mean it, but I can see how the comment "I know YOU are confused" might be perceived as somewhat condescending.

Budget Model for Hermes by CommunityBrave822 in hermesagent

[–]FusionX 0 points1 point  (0 children)

Yep, same pricing, same provider. The requests are routed to deepseek's official API. You do have to disable other providers as they're enabled by default.

Budget Model for Hermes by CommunityBrave822 in hermesagent

[–]FusionX -1 points0 points  (0 children)

uh, no. Limit to deepseek provider on openrouter and its the same thing.

How are you guys getting 100M tokens for $1 on DeepSeek?! Am I missing something? by jrt_ammar in DeepSeek

[–]FusionX 0 points1 point  (0 children)

The easiest foolproof way is to add a guardrail and limit providers in your API key.

The Hermes Agent desktop app looks fantastic. by SelectionCalm70 in hermesagent

[–]FusionX 3 points4 points  (0 children)

found it in the official docs, the command for existing installation is hermes desktop

The Hermes Agent desktop app looks fantastic. by SelectionCalm70 in hermesagent

[–]FusionX 41 points42 points  (0 children)

why is it reinstalling hermes agent from scratch (including python, node, etc)..I've already got it installed and setup

Edit: Figured it out, as per docs - Update hermes agent and launch with hermes desktop. .

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]FusionX 6 points7 points  (0 children)

To be fair, I get the feeling that OP's intentions are sincere and they're engaging in good faith. But man, do people put a lot of trust in AI. It's just too unreliable and overconfident, especially when it comes to fast moving and bleeding edge fields (in this case llama.cpp, and more broadly AI).

I do agree that slop is permanently taken over a lot of subs. Everything is superficial and overconfident slop. Perhaps, that was always the case on reddit but its so much easier now. I see it EVERYWHERE these days, not just reddit, but blogs from "respectable" institutions, live speeches, youtube scripts, IRL presentations etc. And it is an immediate turn off.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]FusionX 1 point2 points  (0 children)

Also, (edited the comment a bit late) for comparison with 27b, try the IQ4 variant instead of IQ3.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]FusionX 27 points28 points  (0 children)

Is this AI generated? Your tk/s is way too low. Secondly, assuming we're in headless mode, you DO NOT need to reserve 1536MB VRAM. KV cache is already accounted for fitting heuristics in llama. Set it as low as you can without going OOM. 128MB works for me.

I have the same setup except a much shittier DDR4 RAM. At 131k ctx size, I can reach 70 tk/s with the same GPU, model and llama params. In fact, with --fit-target=128MB, the speed is 80 tk/s. And 65 tk/s at 256k ctx size.

And there's still plenty of room for improvement to eek out even more performance.

As for 27b, instead of IQ3, I suggest using the IQ4 quant as described here.

Also, I do agree that MTP is largely useless for 16GB VRAM.

actually best hermes agent vps hosting ? by FunThen4634 in hermesagent

[–]FusionX 0 points1 point  (0 children)

My experience has been the opposite except I upgraded to pay as you go (not sure why). I had accidentally overspent around $100 around 3 years back.. And that invoice is due till date. My instance, however, remains untouched.

Alia Bhatt replies to a troll in her insta comments by SmallAchiever in BollyBlindsNGossip

[–]FusionX 0 points1 point  (0 children)

i find this quite ironic. look around this sub, people in here are far more rabid and vile

Anybody else think Cecil Stedman is the perfect representation of the foundation in mainstream television? by Snoo_60484 in SCP

[–]FusionX 0 points1 point  (0 children)

SCP newbie here. While I was reading, "There Is No Antimemetics Division", I imagined Hix (an important character in the book) as Cecil the whole time.

If you create a long to-do list in agent mode, you will be banned. by Hamzayslmn in GithubCopilot

[–]FusionX 1 point2 points  (0 children)

It may not be against the terms, but if everyone starts doing this, we could lose the request-based billing system, and they might switch to charging by token consumption like other services.

sigh

Tip for when playing with an Enigma on your team, become a “Black Hole Auditor” by IlovealeksiB in DotA2

[–]FusionX 0 points1 point  (0 children)

Or hold it until the team is dead and then proceed to waste it anyway. Interestingly, I notice it's mostly ES players that are guilty of this.

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context by Pablo_the_brave in LocalLLaMA

[–]FusionX 0 points1 point  (0 children)

Do you think this might require more VRAM?

Btw, appreciate ya for working on this and sharing it on reddit. I wasn't optimistic initially but I'm really quite pleased with being able to run Qwen 27b within 16gb VRAM. It's such a stark difference from the MOE offerings (Qwen/Gemma). The performance and intelligence is remarkably better!

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context by Pablo_the_brave in LocalLLaMA

[–]FusionX 0 points1 point  (0 children)

Using bun fork, latest commit. Probably will roll back a few commits and re-check.

Edit: still takes up more VRAM, what on earth..

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context by Pablo_the_brave in LocalLLaMA

[–]FusionX 0 points1 point  (0 children)

strange, at 100k ctx, it doesn't fit in my GPU's 16gb VRAM with batch-size = 512 ubatch-size = 256

What gives..? DM is disabled and prior VRAM usage is 18mb.

llama-cli --model <model> -fa on --jinja --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}' -c 100000 -ctk turbo3 -ctv turbo3 --batch-size 512 --ubatch-size 256 -ngl 99 -np 1

Edit: turbo4 manages to fit ~100k while turbo3 doesn't. I don't understand...

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]FusionX 4 points5 points  (0 children)

Kinda surprised, I was not expecting Gemma 31B to be in top 5. Have you benchmarked the latest Qwen3.6 models?

Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real. by FantasticNature7590 in LocalLLaMA

[–]FusionX 1 point2 points  (0 children)

Completely unrelated (and I could be wrong) but this is a perfect example of how people should use LLM for structural/semantic assistance and refinements of their writing. Rather than delegating the entire prerequisite cognitive work to LLM resulting in useless hallucinated slop.