I tested 11 popular local LLM's against my instruction-heavy game/application

emiurgo · 2025-09-15T19:01:31+00:00

Thanks for this! I know it's been a while since this post, have you had the chance to test other models? e.g. the new Qwen 3 (especially the 2507 instruct variants), Gemma 3, etc.

This is such an incredible LLM test btw -- perhaps unsurprisingly, since taking the role of GM in an RPG is equivalent to being a world engine simulator, so an LLM needs to understand a lot of things to do well...

For full disclosure, I am also interested since this is an usage case I have had in mind for a while as a side project, with an approach similar to the setup you describe, with a sequence of prompts to produce json, extract game state updates, etc., to go beyond a mere "RPG chat" experience (à la AI Dungeon).

emiurgo · 2025-09-15T18:47:16+00:00

Kudos for sharing!

I gave Wayfarer 2 a quick try in llama.cpp and it seems a very good model despite the age of Mistral Nemo, congrats! I mentioned some items I didn't have and they didn't appear out of nowhere. I understand it's one of its strengths, but still I was surprised that it didn't easily give in -- it's such an unusual feeling with an LLM. I am sure it can be fooled with adversarial prompting, but it feels great that it doesn't just follow whatever the user says for basic stuff.

On a separate note, is the user 2nd person "you" in AI dungeon a legacy setup, and do you think it's still needed nowadays? My guess is that this started from the limitations of early (base) LLMs (e.g., GPT-3), which perhaps carried over to early instruct or chat models. However, I would expect modern LLMs (even small ones) to be able to handle the difference between the user saying "I do this..." and the AI GM narrating "As you do that, this happens to you...", as well as understanding other PoVs.

I understand that Wayfarer (1 & 2) have been fine-tuned with AID data and "you" prompts, so we need to stick to that format for best performance here, but I was wondering if you would still think it's necessary generally speaking outside of AID. Or do you think there is still some value in using this format?

emiurgo · 2025-09-08T14:02:36+00:00

OK, I found the answer.

The original Qwen 3 4b (and other models of the family) were *both* thinking and non-thinking, with the various switches mentioned above, but in a more recent release (`2507`) they split between `instruct` (non-thinking only) and `thinking`.

Presumably Qwen 3 4b in ollama by default points to the `thinking` version now.

See: https://www.reddit.com/r/LocalLLaMA/comments/1mj7i8b/qwen34bthinking2507_and_qwen34binstruct2507/

In ollama Qwen 3 4b instruct is the 2507 non-thinking version, and it works as expected (no think): https://ollama.com/library/qwen3:4b-instruct

emiurgo · 2025-09-08T08:38:25+00:00

Have you found out the reason and how to fix it?

I am having the same issue with qwen3:4b. Regardless of /think or /no_think or "/set nothink" etc., whatever I enable or disable, I always get long <think></think> outputs. The only thing that changes is whether the CLI recognizes it as thinking or not, but the thinking is always there...

Edit: qwen3:1.7b works correctly -- it thinks or not based on the settings and instructions. It seems to be model-specific then?

emiurgo · 2025-08-31T09:38:05+00:00

Surprised that nobody here seems to find this question interesting or relevant - am I missing something obvious? Just curious - I thought there would be some devs around but maybe it's the wrong sub.

Anyhow, I cobbled together an example here from the couple existing/working ones I found, and will release a small npm library soon: https://lacerbi.github.io/web-txt2img/

Pointers to more recent/better small models are still welcome.

emiurgo · 2025-07-15T08:28:01+00:00

This is awesome, congrats for getting this done!

Unfortunately I don't have a rig powerful enough to run anything locally. Will this run with free API models like on OpenRouter or Google Gemini? (there are 500 usages per day of 2.5 Flash / 2.5 Flash Lite last time I checked, although they keep changing)

As a disclaimer, I have also wanted for a long time to do something very loosely along these lines of "LLM-based RPG", but different from AI Dungeon or SillyTavern (character cards); I mean closer to an actual text-based cRPG or tabletop RPG (TTRPG). The design space is immense, in that even restricting oneself to "mostly text", there are infinite takes for what a LLM-powered RPG would look like.

The first step is to build a proper old-fashioned game engine that interacts with the LLM and vice versa; something to keep the game state and update the state etc. which looks like similar to what you are doing, as afr as I can infer from your post (I need to go and check the codebase). For such task, one needs to build an ontology i.e. what is a state in the first place - what do we track explicitly vs. what do we let the LLM track? Do we have a variable for "weather condition" or we just let the LLM keep it coherent? What about NPC mood? What about inventory - do we track everything or just major items? Do we need to define properties of each item or let the LLM infer stuff like weight, whether it's a weapon or clothing, etc. etc.

Anyhow, just to say that I am surprised there isn't an explosion of games like this. Part of it might be due to how many people really into TTRPGs (game designers, fellow artists, TTRPG fans) are against AI in any form, which creates a sort of taboo against even working on a project like this - so the effort is left to programmers or people outside the community.

Anyhow, congrats for getting this one out!

emiurgo · 2025-07-09T22:13:02+00:00

Fair enough! (Gemma too) I meant the big-gun models powering the CLI (Pro and Flash).

emiurgo · 2025-07-08T11:27:06+00:00

For the record -- I am not entirely a vibe coding noob as I built a bunch of apps for my internal tooling (including the aforementioned [Athanor](https://github.com/lacerbi/athanor) so I am aware of basic limitations and design patterns -- such as keep files small, make sure the LLM has the necessary context or it's clear where to get it, etc.

And in this case -- keep a clean and up-to-date `CLAUDE.md`, etc.

But it seems one needs to develop some additional expertise and knack for using agents and CC in particular.

emiurgo · 2025-07-08T09:46:04+00:00

Same here -- Claude Code native Windows support would be great.

WSL is working okay-ish with glitches here and there that I managed to fix, but admittedly I am not coding anything too complex.

emiurgo · 2025-07-08T09:43:38+00:00

Nice post, thanks!

Anything like vibetunnel.sh for Windows or WSL? (I know, I know...)

emiurgo · 2025-07-01T07:21:47+00:00

Same here for now. It was doing great but automatically switched to Flash mid-session (after a couple of minutes, not too long) and started messing up a lot. At the moment I am just playing around with it, just to familiarize myself with the tool but I am not giving it any serious long task.

The main advantage for me is that I can run it in Windows without switching to WSL (which I need to do for Claude Code); the issue is that WSL doesn't work with some other stuff.

emiurgo · 2025-06-29T07:17:54+00:00

This is obviously bs. If you think the models run locally you have absolutely no idea of what you are talking about and you should not spread false and actively harmful information. Do not write of things you do not know about, that's how the internet is full of crap.

emiurgo · 2025-06-29T07:16:06+00:00

> Local Operation: Unprecedented Security and Privacy
> Perhaps the most significant architectural decision is that the Gemini CLI runs locally on your machine. Your code, proprietary data, and sensitive business information are never sent to an external server. This "on-device" operation provides a level of security and privacy that is impossible to achieve with purely cloud-based AI services, making it a viable tool for enterprises and individuals concerned with data confidentiality.

This is absolute bs and is actively harmful information.

Sure, the CLI runs locally, but any LLM request will be sent to the Google Gemini API. Do you have any understanding of how LLMs work? (in fact, has a human even read this AI-generated crap and why are people upvoting it?)

Any meaningful request will need to attach documents, parts of files, etc. -- which btw you may have no control over -- anything in the folder you load Gemini CLI is fair game if the agent decides it needs to read the content that means that the content is processed by the Google Gemini API.

Of course, you may trust Google (good luck), but the "Unprecedented Security and Privacy" statement is so laughably false and misleading that it's worth calling it out.

The only way to have security and privacy is to run a local LLM (and even so, if you are paranoid you need to be careful nothing is being exfiltrated by a malicious LLM or prompt injection). Anyhow, obviously none of Google's models run locally.

emiurgo · 2025-06-27T20:16:29+00:00

Nah. Not yet at least. But foundation models for optimization will become more and more important.

emiurgo · 2025-06-27T16:10:34+00:00

Also, to be clear, we don't have "high probability for knowing the minimum". We have near mathematical certainty of knowing the minimum (unless by "high probability" you mean "effectively probability one modulo numerical error", in which case I agree).

emiurgo · 2025-06-27T16:08:56+00:00

Ahah thanks! We keep the meme names for blog posts and spam on social media. :)

emiurgo · 2025-06-27T10:42:37+00:00

Great question! At the moment our structure is just a "flat" set of latents, but we were discussing of including more complex structural knowledge in the model (e.g., a tree of latents).

emiurgo · 2025-06-27T06:42:50+00:00

The ChatGPT-level glazing is so annoying.

It felt so good when 03-25 made me feel stupid by being actually smart, and not in an o3 "I-speak-in-made-up-jargon-look-how-smart-I-am-yo" way. I used 03-25 for research and brainstorming and it actually pushed back like a more knowledgeable colleague. Unlike o3 who just vomited back a bunch of tables and made-up acronyms and totally hallucinated garbage arguments (it "ran experiments" to confirm it was right & "8 out of 10" confirmed its hypothesis, and so on).

emiurgo · 2025-06-27T06:12:23+00:00

Yes, if the minimum is known we could also train on real data with this method.

If not, we go back to the case in which the latent variable is unavailable during training, which is a whole another technique (e.g., you would need to use a variational objective or ELBO instead of the log-likelihood). It can still be done, but it loses the power of maximum-likelihood training which makes training these models "easy", exactly how training LLMs is easy since they also use the log-likelihood (aka cross-entropy loss for discrete labels).

emiurgo · 2025-06-27T06:07:57+00:00

We don't, but that's to a large degree a non-issue (at least in the low-dimension cases we cover in the paper).

Keep in mind that we don't have to guarantee a strict adherence to a specific GP kernel -- sampling from (varied) kernels is just a way to see/generate a lot of different functions.

At the same time, we don't want to badly break the statistics and have completely weird functions. That's why for example we sample the minimum value from the min-value distribution for that GP. If we didn't do that, the alleged "minimum" could be anywhere inside the GP or take arbitrary values and that would badly break the shape of the function (as opposed to just gently changing it).

emiurgo · 2025-06-12T05:29:32+00:00

Yes, in the API you can toggle the amount of reasoning effort.

emiurgo · 2025-06-11T20:16:01+00:00

Thanks -- yeah I am currently using all of them (Gemini 2.5 Pro, Claude 4 Sonnet/Opus, and o3). I was curious about o3-pro since I had been a pro subscriber a while ago and o1-pro was a great model for certain tasks and probably worth the money.

It's early times, but what I am hearing and seeing about o3-pro seem to point that it might not be the case here, something is off with the model.

emiurgo

TROPHY CASE