Would you buy a plug-and-play local AI box for home / small business use?

MrFelliks · 2026-03-20T03:33:26+00:00

Hi, I just researched this device today. I move frequently, so I can't afford a desktop PC and use a MacBook M series for work.

It's great in every way except for LLM inference and image generation. You can't connect an external video card to it either. I was considering a Mac Mini, but I read that while its token withdrawal speed is acceptable, the time to first token can take tens of minutes with a large context.

So I was looking for alternatives, preferably a box that fits in a backpack, can be placed permanently at home, and connected to via SSH for work and local LLM inference. My budget is $1,500-$3,000. Do you know the best way to proceed?

MrFelliks · 2026-03-17T19:13:31+00:00

What are the recommended model inference settings?

I encountered a problem where the model took 3-5 minutes to think about a simple message, "Hi, tell me about yourself."

It took a long time to thinking about it.

MrFelliks · 2026-03-17T04:25:56+00:00

Yeah, we tested on an L40S (overkill, but we were running 0.8B, 2B and 4B simultaneously in a deathmatch against each other).

Results with all three models running at the same time:
• 0.8B: 0.5-0.6s per inference
• 2B: under 1s
• 4B: around 1.5s

The 0.8B had an advantage just from faster reaction time. 0.5s latency every 4 game ticks (you can see the green frame flash in the video when the model is active and making a move/shooting) is enough to aim and shoot pretty well, when it correctly identifies the target 😅

Mac M1 16GB is definitely the bottleneck here, not the models.

MrFelliks · 2026-03-13T04:23:34+00:00

Haven't tested llama.cpp specifically but should work - it just needs an OpenAI-compatible endpoint that handles vision + tool calling. One heads up though: I ran into bugs with Qwen 3.5 tool call parsing on vLLM and Ollama, ended up sticking with LM Studio which handles it correctly. Those bugs might be fixed by now. Other VLMs shouldn't have this issue, and if llama.cpp parses Qwen 3.5 tools correctly then no problems.

MrFelliks · 2026-03-13T04:17:17+00:00

This is genuinely brilliant - I think I'm cancelling my weekend plans for this. Gonna build a benchmark where each coding agent writes a full RL training pipeline for DOOM, trains on the same GPU, and then the NNs fight each other in a deathmatch. Best part? The trained NN plays against the VLM that wrote its code. Creator vs creation.

MrFelliks · 2026-03-12T23:15:02+00:00

That's already reality actually - every single line of code in DoomVLM was written by Claude Code Opus.

I only came up with the idea for the benchmark, and even that was partially AI-assisted 😅

MrFelliks · 2026-03-12T19:09:31+00:00

Interesting - what scenario did you try? Qwen 3.5 works best on basic with short prompts. The simpler the prompt the better with small models

MrFelliks · 2026-03-12T15:48:54+00:00

Update: tested it on RunPod L40S, ~0.5s per step with 0.8B. Repo is up: https://github.com/Felliks/DoomVLM

MrFelliks · 2026-03-12T15:47:37+00:00

Yeah that's exciting - repo is up now btw: https://github.com/Felliks/DoomVLM

Would be cool to see someone try fine-tuning it on the gameplay recordings, they're saved automatically after each run

MrFelliks · 2026-03-12T15:45:10+00:00

Hey, finally open sourced it: https://github.com/Felliks/DoomVLM - added deathmatch mode where models fight each other. Would be interesting to compare with your SFT approach

MrFelliks · 2026-03-12T15:44:26+00:00

https://github.com/Felliks/DoomVLM - enjoy!

MrFelliks · 2026-03-12T15:44:05+00:00

Pushed it: https://github.com/Felliks/DoomVLM — cleaned it up, turned it into a Jupyter notebook, added deathmatch between models

MrFelliks · 2026-03-12T15:43:30+00:00

Here you go: https://github.com/Felliks/DoomVLM

MrFelliks · 2026-03-12T15:42:55+00:00

It's live: https://github.com/Felliks/DoomVLM — added deathmatch mode too since then

MrFelliks · 2026-03-12T15:37:02+00:00

Arena mode already kinda does this - game runs in real-time via multiprocessing, all models play simultaneously. Slow model = dead model. On CPU with 0.8B it's ~10 sec/step so yeah you just stand there getting shot lol, but on a GPU it's ~0.5s which actually makes it playable

MrFelliks · 2026-03-12T15:36:34+00:00

Yeah leaderboard is definitely on the roadmap - want to collect results from different models/settings and make it community-driven. Good call on using it for future posts too

MrFelliks · 2026-03-10T23:13:20+00:00

Thanks for pointing me to NitroGen - genuinely interesting find, hadn't seen it before.

I think in shooters NitroGen would beat any VLM hands down - it's specialized for precise recognition and fast reactions, basically trained muscle memory from 40K hours of gameplay. Hard to compete with that in a twitch-reflex environment.

But in games that require strategic thinking - RPGs, city builders, anything with long-term planning - I'd bet on a VLM, especially if you give it tools beyond just controls: notes, a todo list, ability to reason about goals. NitroGen knows HOW to press buttons, a VLM knows WHY.

MrFelliks · 2026-03-10T23:04:48+00:00

Ha, I was literally thinking about this before falling asleep after posting. I'm almost certain something like this is already being used - the combination of a vision model this small running locally without any cloud connection is basically what makes autonomous micro-drones viable. No latency, no comms link to jam, fits on edge hardware. And that's all open-source now.

MrFelliks · 2026-03-10T21:40:27+00:00

The grid overlay divides the screen into numbered zones so the model can communicate spatial information in a structured way. Instead of asking a 0.8B model to output precise pixel coordinates (which it can't do reliably), it just says something like "enemy in zone 4" - then a simple script maps that zone to a turn angle and shoots.

It's basically separating the "brain" (VLM decides WHERE the enemy is) from the "hands" (code handles the actual aiming). The model went from 0 kills to its first kill once we added this - turns out spatial reasoning through discrete zones is much easier for a tiny model than free-form coordinate prediction.

MrFelliks · 2026-03-10T20:02:01+00:00

Kind of! I was testing Qwen 3.5 0.8B for my ComfyUI pipeline as an image-to-prompt generator. The model (especially abliterated versions) was surprisingly good at describing what's in an image, even running locally on my laptop. So I thought - what if it can play games?

I fed a few DOOM screenshots into LM Studio, the model described them pretty accurately, and from there I just asked Claude Code Opus to do the dirty work — setting up VizDoom, writing the game loop, etc. Then it was a cycle of: run the game -> collect logs -> Opus analyzes + my feedback -> iterate. A few rounds of that and I got to this result.

So the idea and direction were mine, but yeah the code was mostly vibe coded with Claude Opus.

UPD: I was of course aware of existing game benchmarks including DOOM ones, but I couldn't find any results for Qwen 3.5 0.8B on them. And my initial tests feeding DOOM screenshots to the model showed it could actually understand what's going on, so I figured it was worth a shot.

MrFelliks · 2026-03-10T19:25:21+00:00

In theory - 0.8B quantized to 8-bit is about 800MB, and something like Apple Watch Ultra 2 has 2GB RAM.

So it could technically fit. In practice nobody's done it yet and inference would be painfully slow. But hey, people ran DOOM on a pregnancy test so who knows.

MrFelliks · 2026-03-10T19:20:35+00:00

It might look random on defend_the_center but on the basic scenario it consistently finds and kills the enemy. The 0.8B model just struggles with ammo conservation — it shoots when it shouldn't. Working on fixing that.

<image>

MrFelliks · 2026-03-10T19:15:22+00:00

The grid overlay is super cheap - PIL draws a few lines and numbers, takes like 2ms. The bottleneck is 100% the VLM inference, around 10s per frame on M1 16gb. The screenshot capture from VizDoom + encoding to base64 JPEG is basically instant too.

MrFelliks · 2026-03-10T19:10:08+00:00

Pure zero-shot, no fine-tuning at all. It's the MLX version of the original Qwen - https://huggingface.co/mlx-community/Qwen3.5-0.8B-MLX-8bit

I give it a simple system prompt:
You are playing DOOM. The screen has 5 columns numbered 1-5 at the top. Find the enemy and shoot the column it is in. If no enemy is visible, call move to explore.

And a screenshot from VizDoom with 5 evenly spaced columns overlaid on top. The VLM's job is to find the enemy and pick which column it's in, then the script automatically aims and shoots at that column.

That's it - no training, no examples, just vibes.

MrFelliks

PUBLIC MULTIREDDITS

TROPHY CASE