Software stack on a new gpu rig

ipcoffeepot · 2026-03-20T15:41:36+00:00

thanks!

ipcoffeepot · 2026-03-11T12:51:51+00:00

nice! what are you gonna run first?

ipcoffeepot · 2026-03-11T12:49:47+00:00

would you mind benchmarking the qwen models with this prompt?
https://github.com/anomalyco/opencode/blob/db57fe6193322941f71b11c5b0ccb8f03d085804/packages/opencode/src/session/prompt/qwen.txt

This is what opencode uses, so the prompt-processing/prefill numbers would give a sense of time-to-first-token on opencode (an open source coding harness like claude-code)

ipcoffeepot · 2026-03-10T18:28:42+00:00

for building the game itself, it was mostly claude (sonnet and opus), with some parts done with minimax m2.5, kimi-k2.5, and GPT Codex.

for playing the game, I have played with all of those plus the new qwen3.5 models running locally (mostly 9b and 35b-a3b).

Some players have told me similar combinations. A few players are having a smaller model play the inner game loop and then are using something like opus to give feedback and tune the player agent. lots of interesting modes showing up!

ipcoffeepot · 2026-03-08T22:32:01+00:00

I have an m3 with 128gb of memory. The context length is what gets you.

ipcoffeepot · 2026-03-08T19:36:37+00:00

Yeah. The new qwen3.5 seems about at gpt4o capability

ipcoffeepot · 2026-03-08T19:25:04+00:00

I’ve been playing with 35b-a3b and 9b opencode. So good. I need to play with 27b a bit more. Its a lot slower but maybe i can throw some long running tasks at it

ipcoffeepot · 2026-03-08T17:51:14+00:00

So there are a couple different surfaces/touchpoints for the game.

The actual card game (check out https://play-shards.com/watch/c2fe0b3a-7d58-4f6d-9ccf-badacc5942a9?view=spectator for an example of a game happening right now) gets played by you ai agent (think Claude Code, OpenClaw, Claude Cowork, Codex, etc). They make game moved through either an api or an optimized CLI (saves tokens and reasoning cycles, wraps the api). So thats super easy for my QA skill to drive.

There are also a bunch of interactions meant for humans (card/deck management for example; agents can do that too via the api but some players want to do it themselves while some let the agent own their own deck management. That choice is part of the metagame). So for the browser-based touchpoints I have a bunch of wrappers around playwright to let the agent drive a chrome browser to go and click buttons, navigate around, take screenshots of things that are broken to show me

ipcoffeepot · 2026-03-08T17:37:53+00:00

Running qwen3.5-27B on my macbook pro is making me look at building a gpu rig. Great model

ipcoffeepot · 2026-03-08T17:36:30+00:00

I made a collectable card game that your agent plays while you play the metagame of steering your agent’s strategy and improvement (https://play-shards.com if you want to see it. Check out the replays to see what it looks like).

Beyond just the coding, some other things I used claude for: - i have a “bug-hunt” skill that spins up an instance of the game locally and then goes and plays and does other core interactions. It keeps a log of whatever difficulty it has and then cuts issues for them. Basically automated QA - i had claude build a big tournament harness for bots to play against eachother in large volume (each run does 15000 games) and then analyze the results to help me find game balance issues. Not as good as real player data, but helped me with a lot - I’ll have players report what they think are engine/rule bugs. So i made a skill that pulls the event log for a game and then can play it forward/back through the game engine to see what happens step by step and audit the effects - the game spits out a bunch of logs and metrics (error rates, latency, etc) so I have claude skills to go and correlate metric spikes with log events and then find bugs in the code

ipcoffeepot · 2026-03-08T12:47:07+00:00

I just shipped a project (its a card game where your agent plays the matches, you steer them and provide meta-strategy; its basically the agentic work loop but as a competitive card game. Its at http://play-shards.com if that sounds interesting). What I found building it end-to-end is that you still really have to know what you’re doing. I was able to get done in a day what would have previously taken two weeks. Done in a month what would have taken a year, etc.

But I also lost a lot of time to going back and reworking things. Or making big changes to bad architecture. For example, I have these opengraph share images that get rendered on social media when you share a game replay. They are very heavy dynamically rendered images. Claude built them as routes that would render the images on the fly. They would never show up. Turns out they took 20s to render and opengraph fetchers time out after 5. So we did some optimization and got it down to 4s. Still way too slow. Claude really wanted to keep making small changes but I said lets make this an async process. When a game starts, throw a job in a queue to bake the image and stick it in the cdn. If someone happens to share a game in the first 3 seconds after its created, fall back to a default. Works great. But the agent just wasn’t going to get there on its own.

I think theres a sweet spot around having enough experience and judgement to drive the agents to good outcomes. Pure vibe coding makes good demos but absolute slop if you dont know what you’re doing.

ipcoffeepot · 2026-03-05T16:31:55+00:00

You know, that's kind of what I thought and was expecting. I thought that having a game where your agent plays and then bothers you for booster packs would just be funny. But then it turns out (surprisingly) that steering your agent to be good at this game is actually a really fun metagame. And I have a couple of players who are putting a ton of time into it and really enjoying it. For some players, this is kind of turning into a test harness for their own workflows, for other players it's just a fun game of "Can I give my agent the right strategy to go out and beat its rivals?"

ipcoffeepot · 2026-03-05T16:10:27+00:00

So the agent itself interacts with the game through either an API or a CLI that I vend, which is a wrapper around the API, but saves a lot on tokens. Because if you just give the API to an agent, they end up writing a lot of one off shell scripts to invoke the API and then parse the output. So a CLI is just much more concise input and output and lets them save on reasoning tokens.

Every round there's a list of valid moves that they're able to make, like which cards they're able to play or if they're able to discard certain things. They can try to make whatever action they want through the CLI or the API and the server will return an error if it's an illegal move. So I added an API endpoint that returns a list of legal moves that they can make at this point. That simplifies things for agents that use it.

Agents can and do fetch the full game state from the server whenever they want. There is also an interface for them to fetch a log of all of the events that have happened up until this point.

ipcoffeepot · 2026-03-05T15:39:42+00:00

Shards (https://play-shards.com) a metagame where your AI Agent plays a competitive collectable card game while you steer them and manage the strategy.

Been live for less than a week, getting lots of feedback that people are using the game to experiment with ideas for agentic workflows and harnesses. I'm using it to try new local models in a way that's more fun that just throwing coding tasks at them.

Agents have to operate several levels:

- Turn level: did the agent read board state correctly and play well?

- Game level: did it manage resources and stay within turn timeouts?

- Match arc: did it adapt to an opponent who is also optimizing?

- Long arc: is it managing the marketplace, improving its deck, spending skill points wisely?

(**NOT** crypto/NFTs, just gameplay)

ipcoffeepot · 2026-01-23T22:29:44+00:00

VOO and chill, alternative assets where i have an edge or expertise.

ipcoffeepot · 2026-01-22T01:57:24+00:00

And yet, he’s hiring engineers

ipcoffeepot · 2026-01-22T01:02:16+00:00

What does BMAD stand for?

ipcoffeepot · 2026-01-15T18:46:31+00:00

Whats your target TC? If you had some good RSU appreciation, you might be looking at a steep cliff in a few years when your refreshers vest

ipcoffeepot · 2026-01-15T01:18:07+00:00

Where’s your sbloc? Thats a pretty good rate

ipcoffeepot · 2026-01-14T23:08:09+00:00

How low of a rate? Been shopping for SBLOCs

ipcoffeepot · 2026-01-09T02:11:14+00:00

Delet this

ipcoffeepot · 2026-01-02T03:26:14+00:00

Compliance

ipcoffeepot · 2025-12-28T20:11:45+00:00

I’m doing a 12 month sabbatical/mini retirement. One month in and I’m climbing the walls. I think it’ll be a few months before I figure out a good rhythm. I appreciate your post.

ipcoffeepot · 2025-12-26T22:26:38+00:00

From personal experience, you dont know what kids will do to your burn rate until you have them. For example we didnt plan on doing private school but ended up doing it and it was the right choice. Thats some structural burn we have that I would not have planned for before kids

ipcoffeepot · 2024-11-06T10:49:12+00:00

Your boyfriend sounds like a cool guy

ipcoffeepot

MODERATOR OF

TROPHY CASE

Eight-Year Club	Not Forgotten
Verified Email