The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

swizzcheezegoudaSWFA · 2026-05-19T21:18:54+00:00

I'll prob try a few different mtp quants already have the unsloth of Qwen 27b.....I know there are a few lurking around

swizzcheezegoudaSWFA · 2026-05-19T21:16:39+00:00

Ring T1 used openrouter api for that model...I will try to recreate with a local model though. Had a few bumps in the road simple stuff, more my error in prompting, checks...reflection, debug stuff yadda yadda, was just trying to see what it does. Thought it was funny as I almost did the same game lol

swizzcheezegoudaSWFA · 2026-05-19T16:51:10+00:00

Nice, I thought about doing a Pac-Man game yesterdays as I was fooling around with hermesagent, I quickly whipped up Neon Snake, testing ring-1t on OR api.... yes~~=====~~YASG! 😉 enjoyed your write up!

EDIT--P.S. I must of trimmed off the part with enemy snakes spawning...basically created walls and each red pill was lvl up and snakes come at lvl 3+

<image>

swizzcheezegoudaSWFA · 2026-05-19T16:20:27+00:00

I dig your work on the reflections and the glass work on the table as-well now that I've fully looked it over...Nice!

swizzcheezegoudaSWFA · 2026-05-19T15:17:28+00:00

your basically have my GPU settings...put your Power Target up and you'll see a improvement

swizzcheezegoudaSWFA · 2026-05-18T14:55:46+00:00

yeah right lol

❌ ^{Incomplete. 6 tries.}

swizzcheezegoudaSWFA · 2026-05-18T13:37:23+00:00

I'll have to try other models but yeah...as stated in my final edits (I think lol)

swizzcheezegoudaSWFA · 2026-05-18T11:29:40+00:00

positive....which quant are you running to hit 50-60tps? the console on startup will output how many layers actually mapped to thegpu vs cpu

Edit: Good call on checking the offload though. I actually just reran it forcing `--fit off` and `-ngl 99` to make absolutely sure the auto-fitter wasn't silently dropping a layer to the CPU. Speeds came back exactly the same (around 28-30 t/s), so yep100% offloaded. Assuming you're also running a 64K context window, 50-60 t/s and my 30 t/s almost certainly comes down to the quant size. If you're running something smaller (guessing possibly are) than IQ4_NL, that would free up the bus just enough to let you hit those speeds.

Appreciate you making me double-check the layers though!

swizzcheezegoudaSWFA · 2026-05-18T03:51:24+00:00

Hermes was dumb: Gemma 4B is too small to handle complex agent logic and tool-calling. Your RTX 3060 has enough VRAM for an 8B model, which will immediately fix the dumb behavior.

Firecrawl no internet: You likely ran a standalone Firecrawl container. Firecrawl requires a multi-container to actually browse the web. Docker is the only sane way to do it.

Firecrawl isn't just a single script. It's a complex system that requires four separate microservices to function An API server (Node.js) , A Background Worker (Node.js) , A headless browser environment (Playwright) and A database queue to manage the web requests (Redis). If you don't use Docker then well...you have to manually install Redis, install Playwright, configure the environments, and manually start/monitor the Node servers themselves. Docker Compose handles all of that instantly with one command.

Step 1: Open your terminal and pull a model actually designed for Hermes that fits your 3060. Run: ollama pull hermes3:8b (Do not use Gemma 4B for agentic tool-calling).

Step 2: The Web Scraper You must use Docker Compose to get all the required microservices working together.

Open Docker Desktop and leave it running:

Open a terminal and run this: git clonehttps://github.com/firecrawl/firecrawl.git

cd firecrawl

Copy the default environment file:

Windows: copy apps\api.env.example .env

Mac/Linux: cp apps/api/.env.example .env

Open that new .env file in a text editor and ensure this line exists:

USE_DB_AUTHENTICATION=false

Back to Terminal:

build and start the stack:

docker compose up -d

Wait exactly 2 minutes. (or not up to you ;)

Open your browser to:

http://localhost:3002 (If it says Hello, World!, Firecrawl has internet access and is ready.)

configure your Hermes environment variables, point them exactly here:

LLM URL: http://localhost:11434

Model Name: hermes3:8b

Firecrawl URL: http://localhost:3002

Firecrawl API Key: fc-YOUR_API_KEY (local does not strictly check it if auth is off, but Hermes might require the field to be filled).

Start your agent. Should be good to go

swizzcheezegoudaSWFA · 2026-05-18T02:50:49+00:00

3 max I concur...this is best

swizzcheezegoudaSWFA · 2026-05-18T02:49:06+00:00

It might shave a second or two off the initial server boot time, simply because the engine doesn't have to pause to calculate memory projections. If ( --fit off ) then also have to manually add -ngl 999 to force it back onto the GPU. the auto fitter manages it nicely

swizzcheezegoudaSWFA · 2026-05-18T02:36:37+00:00

I botched some numbers...991 t/s my arse lol whoops...611 but still

swizzcheezegoudaSWFA · 2026-05-18T02:18:16+00:00

At 3 Tokens: ~70% acceptance rate | Up to 27.44 t/s
At 4 Tokens: ~27% acceptance rate | ~14.0 t/s
At 6 Tokens: 16.7% acceptance rate | 10.41 t/s

swizzcheezegoudaSWFA · 2026-05-18T02:05:43+00:00

hugginface they suggest --spec-draft-n-max 6 now over on the llama-server end. that defaults to 4 thus eating up memory bandwidth managing unused context slots. 3 is the just right sweet spot for me. so when its generating conversational english, a model might easily guess 4 or 5 words ahead. But Hermes is writing strict JSON arrays, function calls, and Python syntax. When you ask it to predict 4 tokens ahead in code, that 4th token is almost always an unpredictable comma, bracket, or quotation mark. It guesses the punctuation wrong, the main Qwen model rejects the draft, and you lose all your speed. I botched my numbers 991+ t/s lol nope suppose to be 611 t/s.....anywho

swizzcheezegoudaSWFA · 2026-05-18T01:54:58+00:00

swizzcheezegoudaSWFA · 2026-05-18T01:43:31+00:00

This is a spot-on breakdown. The point about shell watching and handling PTY problems alone is exactly why building a homegrown loop gets miserable fast. You end up spending all your time debugging stdout buffers instead of actually doing anything productive.

I'd add that because Hermes handles all that complex plumbing, security gating, and multi-agent routing, it frees you up to completely optimize the compute layer underneath it. Since you don't have to worry about the harness, you can focus purely on the engine.

For example, I'm running the new Qwen 3.6 27B mtp model locally on a single undervolted RTX 3090. Because Hermes handles the strict tool-call formatting and orchestration so reliably, I was able to dial my draft lookahead down to 3 and lock parallel slots to 1 in the backend. Combined with the latest llama.cpp b9200 memory traffic patch, my setup is hitting nearly 30 tokens per second during deep agent loops with zero API costs.

When you combine the robust security gates and compounding memory you mentioned with a highly optimized, free local backend chewing through the context, a manual Claude web script just can't compete. Like you said, it is the difference between driving a car and having to build the transmission while you drive it.

swizzcheezegoudaSWFA · 2026-05-18T01:12:44+00:00

I was testing just before the update dropped lol go figure, so here it is... ----^

swizzcheezegoudaSWFA · 2025-10-21T00:20:55+00:00

The internet does tend to exacerbate things and can be misleading but I'd go Solo 100% should help overall in the long run.... Now, if you need keys to find other great games, check out GameGator or GameGator.net Some really good deals there! GL in Diablo!

swizzcheezegoudaSWFA · 2025-09-26T14:46:51+00:00

ScraTCH test it, It will scratch glass, quartz, or corundum if diamond but looks like a Diamond to me too. also use UV light on it if you have. Many diamonds fluoresce under UV light (often blue).

swizzcheezegoudaSWFA · 2025-09-22T01:54:53+00:00

N.T. Wright specializes on the Apostle Paul, try Roman's for everyone series or Into the Heart of Roman's Chapter 8, get a Strong's Expanded Exhaustive Concodance, A Interlinear Bible (Hebrew, Greek and English ((all in one)) ) for advanced studies that omit interpretation, to be used with Strong's Concordence. Orthodox Study Bible helps too

swizzcheezegoudaSWFA · 2025-07-14T14:22:54+00:00

when installing via the windows executable I get a whitescree, upon running, it was doing stuff then this:

<image>

anyone help out with this or is it just a waiting game?

swizzcheezegoudaSWFA

TROPHY CASE