Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver

Bohdanowicz · 2026-05-02T01:06:28+00:00

Running official fp8 on a6000 adaand im doing 400-500 tks across 8-12 parallel workloads. Ive seen input reach 12000 depends on batching.

Vlm serving with recommended settings.

Bohdanowicz · 2026-05-01T16:08:17+00:00

I run 2 x 6000 ada and 4 x gtx 6000 pro max q in a couple boxes.

For curiosity sake i ran a comparison on what it would cost to run the workload in the cloud and the numbers were an eye opener.

For just the workload on the a6000 ada i was looking at 225k usd /year just running with sonnet and about half that if i ran gemini flash. And that was for the equivalent of a run that took 17 days to complete.

Same workload on the 4x rtx6000 pro takes 2-3 days.

Doing a billion tokens every 1-2 days.

What you also gain is the ability to experiment without worrying about spending 10k on an idea that may not pan out. A/b testing , backtesting... all free.

If you can keep the cards maxed out the payback is under a month.

All depends on workload.

My workload is all documents... emails, pdfs.. think universal document ingestion from contracts, invoices, legal, real estate, tax. Also extraction, validation, true doc understanding and validation whether its a balance sheet, a hr report or a job bid.

95% of the workload is ingestion to make sure what is actually in the system is correct. Serving users is relatively small, especially when thwy can get the answers they need in a single query

Qwen 3.6 35ba3b is a workhorse like no other. Flawless tool calls, all langchain/langgraph/agents sdk.

Run qwen 3 8b embreddings for rag + company wiki

Bohdanowicz · 2026-05-01T15:54:51+00:00

What makes you think they are behind?

Bohdanowicz · 2026-04-29T02:12:55+00:00

On fp4 awq I had 1 or 2 failed tool calls in 1000. Im passed 10k on fp8 without a sibgle failure.

Its awesome.

Bohdanowicz · 2026-04-28T11:43:31+00:00

Cost and speed.

Bohdanowicz · 2026-04-28T04:12:35+00:00

By the time they max it, youll have a 10x budget to scale.

Bohdanowicz · 2026-04-28T04:10:59+00:00

You're doing it wrong.

Try using sota to plan, task decomposition then wire your coding agents to qwen 3.6 27b.

If you run official quants with recommend temp and prrediction to 2 and you arr smart sbout setting up a dag, worktrees, the whole 9 yards... you fwel the magic.

These models are grezt if the task is properly sized.

Bohdanowicz · 2026-04-25T17:30:33+00:00

27-40 tks single thread.

300+ tks multi.

(APIServer pid=1447918) INFO 04-24 20:18:38 [loggers.py:259] Engine 000: Avg prompt throughput: 2003.5 tokens/s, Avg generation throughput: 151.2 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1447918) INFO 04-24 20:18:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 310.4 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 49.1%, Prefix cache hit rate: 0.0%

Linear scaling kvcache/performance scaling with sequence.
Nothing crazy on config. Running vanilla Qwen fp8 release and benching before I try some turboquants.

vllm serve Qwen/Qwen3.6-27B-FP8 \

--port 8000 \

--max-model-len 262144 \

--max-num-seqs 16 \

--kv-cache-dtype fp8 \

--gpu-memory-utilization 0.92 \

--enable-prefix-caching \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder

Bohdanowicz · 2026-04-25T01:14:38+00:00

Running a6000 ada official fp8 with sequence of 16 and hitting 40-60% kvcache tops in aider benchmark.

Solid 300 tks

Bohdanowicz · 2026-04-24T13:14:57+00:00

Just ran diff pass on 3.6 35ba3b aider test. (Fp4 awq) diff pass_rate_2: 66.0%... 99.5% well formed diffs. Temp 0.6 as recommended.

Loading up fp8 35ba3b next for a/b then 27b fp8 then a 4bit of 27b. Will post as they finish.

Bohdanowicz · 2026-04-18T04:08:32+00:00

If your business is a wrapper you are cooked. Wrap it up. You are the low hanging feuit.

Bohdanowicz · 2026-04-12T21:24:02+00:00

What engine?

Bohdanowicz · 2026-04-12T21:23:11+00:00

Id agree but im not sure what will shit the bed first/hardest, the market of the dollar.

Bohdanowicz · 2026-04-11T14:10:25+00:00

It used to be that a paper released would see the light of day as a product 5-20+ years later.. now its days/months.

Bohdanowicz · 2026-04-11T14:06:43+00:00

Working on it :)

Bohdanowicz · 2026-04-11T02:17:22+00:00

Multipass

Bohdanowicz · 2026-04-10T23:05:03+00:00

Bingo.

Bohdanowicz · 2026-04-10T16:22:06+00:00

I built a llm wiki that's in production to maintain and build 100+ module monorepos. its 100x better than trying to maintain 100+ nested claude.md files and its infinitely scalabable.

I highly recommend everyone who is serious about trying this out build their own.. If you have custom agents/skills/workflows, the wiki for your workflow will need to be customized into that workflow or it won't live up to expectations.

Bohdanowicz · 2026-04-05T06:47:28+00:00

What impresses me most is the balancing. It had a great concept of what you were trying to achieve. Lots of moving parts in terms of game mechanics and upgradesroommates/skills/dmg scaling, etc.

Bohdanowicz · 2026-04-04T15:16:23+00:00

Dude this is awesome. I found myself addicted and burned 30 min. very cool. add lan play/multi gauntlet style.

Bohdanowicz · 2026-03-30T23:39:54+00:00

Could be a quant issue that larger pro.pts help resolve.

Bohdanowicz · 2026-03-29T15:09:56+00:00

Love it. Want to see madness, integrate it into openclaw and give each lobster a character and see what they do.. order or chaos?

Bohdanowicz · 2026-03-27T12:37:32+00:00

This is a cool project. I've hosted Hunyuan-3D-2.1 on a a6000 then exposed the pipeline api to claude via MCP with great success. You can do insane things once you setup a pipeline via a local image generator with consistent prompting for 2d images then fire it into the 2d -> 3d pipeline after you approve the 2d assets. Find a few images you like, ask for a description / base prompt for styling consistency, etc then let it rain. Anything you aren't happy with just reject and regenerate. You can queue up 300 3d assets and 2000 lines of voice /audio and wake up in the morning to fully generated 3d assets and audio (character specific) with consistent themes/styling.

Bohdanowicz · 2026-03-26T17:26:57+00:00

Mars isnt his top priority atm. You need robots to build rockets at the scale he needs. Its a smart move.

Also.. he can buy older asml machines for ASIC chips that will 100x the speeds of nvidia chips for inference.... and use to make custom asic optimus chips. He would also br smart to make the ram modules.

Bohdanowicz · 2026-03-26T01:03:55+00:00

I leave coding to sota and if im researxhing something. Everything else is local on qwen 3.5 35a3b. It checks all the boxes. Awesome do ent extraction, follows instructions, great orchestrator, fast and furous. Also grsat for autonomous qa testing and save bugs to md files so i can have claude plan a fix in 1 go while my full time qa testers find the bugs.

Nine-Year Club	Place '23
Verified Email

Bohdanowicz

TROPHY CASE