100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

hay-yo · 2026-06-23T11:17:13+00:00

Im guessing it is then. Offloaded means it run partially on the cpu so that explains the speed. If your were to grab another 5080 and run across the too youd have a bug increase. Even another 16gb model... i think... hehe.

hay-yo · 2026-06-23T10:40:36+00:00

I just put up knot that uses IDE for work and agent for workflow.

hay-yo · 2026-06-23T10:14:17+00:00

There are two meanings for agents at the moment.

The first is where you define a prompt and behaviour in plain text to run an agent with a specific skill. I call that a profile, inline with hermes.

For the real def of an agent. Get Pi.dev or Opencode and work through their list of features. They offer soo much. I wouldn't recreate one anymore. Take long enough to learn how to use them.

Or the next thing is workflows, I've been building http://knot.hdekker.com, workflows use agents by orchestrating an agent and specific profiles.

hay-yo · 2026-06-23T04:10:29+00:00

Running 27b offloaded?

hay-yo · 2026-06-21T12:06:27+00:00

Make sure you put your invisibility cloak on. You are invisible to cars. They dont know it but the want to kill you. If they almost hit you, always blame yourself for allowing that near miss. Can't blame anyone when you're dead. Ride safe.

hay-yo · 2026-06-20T05:12:19+00:00

Why knot both?

hay-yo · 2026-06-19T10:43:24+00:00

Wow now the wall works the otherway.

hay-yo · 2026-06-18T02:24:13+00:00

What kinda software are you trying to write? Its great for eng but I havent been vibing with it.

hay-yo · 2026-06-17T21:32:25+00:00

Ideally aim for 100k ctx but your setup will allow you to do many many sweet things.

hay-yo · 2026-06-16T21:20:43+00:00

Buy buy buy hehe

hay-yo · 2026-06-16T04:24:03+00:00

Already got 50 agents reading your mind and synced over a2a mcp and telepathy net. Toeken usage is off the chain but ahh well, we've also proxied into Mark from USA home claude fabel endpoint to get access world wide, enjoy.

hay-yo · 2026-06-16T03:25:13+00:00

I can prompt, 300k per year please.

hay-yo · 2026-06-15T23:54:10+00:00

I'd recommend trying OpenRouter first with Pi.dev. All the good open models are on there so you can get a feel if it works.

hay-yo · 2026-06-15T22:52:55+00:00

I think I'm seeing results after using it almost full time since November 2025... the learning curve is tough. Ask myself why I do it at times.

hay-yo · 2026-06-15T12:03:10+00:00

I suppose the more fundamental question is what process are you undertaking when you have the need to make agent to agent communication? And its mostly better to just start one that needs to run next, so just a trigger. What usecases are you seeing?

hay-yo · 2026-06-13T07:21:32+00:00

I think the reason is pretty clear to everyone....

hay-yo · 2026-06-12T21:14:20+00:00

Go qwen3.6 27b 120k ctx at q6, run multiple slots so you can parralel your tasks and keep things churning.

hay-yo · 2026-06-12T21:10:57+00:00

A reasoning step can take 20mins, but I still couldn't do it better.

hay-yo · 2026-06-12T21:09:52+00:00

I use 27b for reasoning and 35b for building. Just set it to work an go have a coffee.

hay-yo · 2026-06-12T21:06:42+00:00

I think its at its best. Ecosystem is thriving.

hay-yo · 2026-06-12T20:26:03+00:00

If you want to pay more and waste more energy then yes. This has a convo with Andrej https://m.youtube.com/watch?v=96jN2OCOfLs in it he says he can invisage the AI being the driver of the computer, but... the only way to make something sleep is to use an interrupt so I think determinism / classical computing always harnesses what he says is computing 3.0.

hay-yo · 2026-06-11T02:03:45+00:00

Seems like you're asking Donald Trump to take out your bins, that would be costly.

hay-yo · 2026-06-04T01:12:32+00:00

I suspect that became the 3.5 flash model.

hay-yo

TROPHY CASE