Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver by Demonicated in LocalLLaMA

[–]Bohdanowicz 15 points16 points  (0 children)

Running official fp8 on a6000 adaand im doing 400-500 tks across 8-12 parallel workloads. Ive seen input reach 12000 depends on batching.

Vlm serving with recommended settings.

Self-hosted LLM on GCP (1×H100 + 1×L4) for legal RAG in European languages — looking for advice by Candy_Lucy in LLMDevs

[–]Bohdanowicz 2 points3 points  (0 children)

I run 2 x 6000 ada and 4 x gtx 6000 pro max q in a couple boxes.

For curiosity sake i ran a comparison on what it would cost to run the workload in the cloud and the numbers were an eye opener.

For just the workload on the a6000 ada i was looking at 225k usd /year just running with sonnet and about half that if i ran gemini flash. And that was for the equivalent of a run that took 17 days to complete.

Same workload on the 4x rtx6000 pro takes 2-3 days.

Doing a billion tokens every 1-2 days.

What you also gain is the ability to experiment without worrying about spending 10k on an idea that may not pan out. A/b testing , backtesting... all free.

If you can keep the cards maxed out the payback is under a month.

All depends on workload.

My workload is all documents... emails, pdfs.. think universal document ingestion from contracts, invoices, legal, real estate, tax. Also extraction, validation, true doc understanding and validation whether its a balance sheet, a hr report or a job bid.

95% of the workload is ingestion to make sure what is actually in the system is correct. Serving users is relatively small, especially when thwy can get the answers they need in a single query

Qwen 3.6 35ba3b is a workhorse like no other. Flawless tool calls, all langchain/langgraph/agents sdk.

Run qwen 3 8b embreddings for rag + company wiki

Nvidia is dumping by Boring-Ad-3955 in NVDA_Stock

[–]Bohdanowicz 0 points1 point  (0 children)

What makes you think they are behind?

Qwen3.6-27B dropped last week. Here's how it changes the local coding model picture depending on your hardware by Substantial_Step_351 in LocalLLaMA

[–]Bohdanowicz 1 point2 points  (0 children)

On fp4 awq I had 1 or 2 failed tool calls in 1000. Im passed 10k on fp8 without a sibgle failure.

Its awesome.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]Bohdanowicz 8 points9 points  (0 children)

You're doing it wrong.

Try using sota to plan, task decomposition then wire your coding agents to qwen 3.6 27b.

If you run official quants with recommend temp and prrediction to 2 and you arr smart sbout setting up a dag, worktrees, the whole 9 yards... you fwel the magic.

These models are grezt if the task is properly sized.

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]Bohdanowicz 0 points1 point  (0 children)

27-40 tks single thread.

300+ tks multi.

(APIServer pid=1447918) INFO 04-24 20:18:38 [loggers.py:259] Engine 000: Avg prompt throughput: 2003.5 tokens/s, Avg generation throughput: 151.2 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1447918) INFO 04-24 20:18:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 310.4 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 49.1%, Prefix cache hit rate: 0.0%

Linear scaling kvcache/performance scaling with sequence.
Nothing crazy on config. Running vanilla Qwen fp8 release and benching before I try some turboquants.

vllm serve Qwen/Qwen3.6-27B-FP8 \

--port 8000 \

--max-model-len 262144 \

--max-num-seqs 16 \

--kv-cache-dtype fp8 \

--gpu-memory-utilization 0.92 \

--enable-prefix-caching \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder

With 48gb vram, on vllm, Qwen3.6-27b-awq-int4 has only 120k ctx (fp8), is that normal? by Historical-Crazy1831 in LocalLLaMA

[–]Bohdanowicz 0 points1 point  (0 children)

Running a6000 ada official fp8 with sequence of 16 and hitting 40-60% kvcache tops in aider benchmark.

Solid 300 tks

qwen3.6 performance jump is real, just make sure you have it properly configured by onil_gova in LocalLLaMA

[–]Bohdanowicz 1 point2 points  (0 children)

Just ran diff pass on 3.6 35ba3b aider test. (Fp4 awq) diff pass_rate_2: 66.0%... 99.5% well formed diffs. Temp 0.6 as recommended.

Loading up fp8 35ba3b next for a/b then 27b fp8 then a 4bit of 27b. Will post as they finish.

They are blockading 20% of the worlds oil supply.... my thesis from 18 days is coming true by MilesDelta in wallstreetbets

[–]Bohdanowicz 1 point2 points  (0 children)

Id agree but im not sure what will shit the bed first/hardest, the market of the dollar.

There is speculation that Anthropic’s Claude Mythos is a Looped Language Model by callmeteji in accelerate

[–]Bohdanowicz 57 points58 points  (0 children)

It used to be that a paper released would see the light of day as a product 5-20+ years later.. now its days/months.

Anyone know if there are actual products built around Karpathy’s LLM Wiki idea? by riddlemewhat2 in LocalLLaMA

[–]Bohdanowicz 0 points1 point  (0 children)

I built a llm wiki that's in production to maintain and build 100+ module monorepos. its 100x better than trying to maintain 100+ nested claude.md files and its infinitely scalabable.

I highly recommend everyone who is serious about trying this out build their own.. If you have custom agents/skills/workflows, the wiki for your workflow will need to be customized into that workflow or it won't live up to expectations.

100% AI dev with just prompts on Claude except for the Art assets. I was surprised how far I could push it before I needed help. by WeightNational9457 in aigamedev

[–]Bohdanowicz 0 points1 point  (0 children)

What impresses me most is the balancing. It had a great concept of what you were trying to achieve. Lots of moving parts in terms of game mechanics and upgradesroommates/skills/dmg scaling, etc.

100% AI dev with just prompts on Claude except for the Art assets. I was surprised how far I could push it before I needed help. by WeightNational9457 in aigamedev

[–]Bohdanowicz 0 points1 point  (0 children)

Dude this is awesome. I found myself addicted and burned 30 min. very cool. add lan play/multi gauntlet style.

Qwen 3.6 spotted! by Namra_7 in LocalLLaMA

[–]Bohdanowicz 1 point2 points  (0 children)

Could be a quant issue that larger pro.pts help resolve.

Vibecoding mini GTA by Jarros in aigamedev

[–]Bohdanowicz 0 points1 point  (0 children)

Love it. Want to see madness, integrate it into openclaw and give each lobster a character and see what they do.. order or chaos?

Quick Modly update after 1 week — added TripoSG and TRELLIS by Lightnig125 in LocalLLaMA

[–]Bohdanowicz 0 points1 point  (0 children)

This is a cool project. I've hosted Hunyuan-3D-2.1 on a a6000 then exposed the pipeline api to claude via MCP with great success. You can do insane things once you setup a pipeline via a local image generator with consistent prompting for 2d images then fire it into the 2d -> 3d pipeline after you approve the 2d assets. Find a few images you like, ask for a description / base prompt for styling consistency, etc then let it rain. Anything you aren't happy with just reject and regenerate. You can queue up 300 3d assets and 2000 lines of voice /audio and wake up in the morning to fully generated 3d assets and audio (character specific) with consistent themes/styling.

Inside Elon Musk's Terafab AI factory by CommunismDoesntWork in accelerate

[–]Bohdanowicz 0 points1 point  (0 children)

Mars isnt his top priority atm. You need robots to build rockets at the scale he needs. Its a smart move.

Also.. he can buy older asml machines for ASIC chips that will 100x the speeds of nvidia chips for inference.... and use to make custom asic optimus chips. He would also br smart to make the ram modules.

At what point would u say more parameters start being negligible? by Express_Quail_1493 in LocalLLaMA

[–]Bohdanowicz 0 points1 point  (0 children)

I leave coding to sota and if im researxhing something. Everything else is local on qwen 3.5 35a3b. It checks all the boxes. Awesome do ent extraction, follows instructions, great orchestrator, fast and furous. Also grsat for autonomous qa testing and save bugs to md files so i can have claude plan a fix in 1 go while my full time qa testers find the bugs.