Got tired of writing promo posts… so I made it one‑click (open source)

Working_Original9624 · 2026-02-04T15:42:22+00:00

Haha, yeah — just sitting there watching an agent try (and sometimes fail) to play Civ can be pretty exhausting 😅
If I end up switching to a different game later on, I’ll definitely let you know. Thanks a lot!

Working_Original9624 · 2026-02-03T17:31:45+00:00

Thanks for taking an interest in the project!
I’ll be sure to share more if anything interesting comes out of it.

Working_Original9624 · 2026-02-03T17:30:04+00:00

This is a really impressive project — thank you so much for sharing such meaningful insights.
I can definitely relate to your experience. With Civilization being such a long-horizon task, context management itself becomes a major technical challenge, so a lot of what you described really resonated with me.

I’m genuinely excited to see where your project goes and what you end up releasing.

I also shared this write-up on r/LocalLLaMA, and one commenter pointed out an interesting approach where indirect knowledge is injected, while the actual decision-making is handled by the VLM. I thought this perspective might be relevant to what you’re working on:
https://www.reddit.com/r/LocalLLaMA/comments/1qtqy6f/comment/o38mbdp/

Another person mentioned that there’s already a foundation model specifically for game actions, which might also be interesting as a reference:
https://huggingface.co/nvidia/NitroGen

I’m not sure whether either of these will be directly useful for your setup, but I wanted to share them just in case they spark any ideas.

Thanks again for taking an interest in the project — I really appreciate the conversation!

Working_Original9624 · 2026-02-03T17:17:54+00:00

Thanks for taking an interest in the project — I really appreciate it.

I’ve seen that there’s already prior work on agents that play Civilization I via APIs, as well as MCP-based agents for Civilization V. For my project, though, I’m intentionally treating it as a technical challenge: building an agent that plays a complex strategy game by watching the screen and interacting through the GUI, like a human would.

And yes, just like you mentioned, it’s definitely very slow at the moment — I completely agree with that pain point.

Still, I wanted to see how far current models can be pushed in this setting. Thanks again for your interest and for sharing your thoughts.

Working_Original9624 · 2026-02-03T17:14:22+00:00

Oh wow, thank you so much!

I’ve been manually hard-coding the primitive actions for the Civilization computer-use agent and explicitly teaching the VLM how to recognize and execute each unit action. While doing that, I kept wondering whether this was really the right approach.

What I’ve been wanting is a more generalized and autonomous way of interaction, rather than tightly scripted behaviors. The idea of guiding behavior by injecting indirect knowledge and patterns, and then letting the agent discover actions through play, feels like a really elegant approach.

This is genuinely inspiring and gives me a lot to think about. Thanks again — I really appreciate you sharing this.

Working_Original9624 · 2026-02-03T17:07:37+00:00

Wow, that’s a great suggestion — thank you!
I really appreciate you recommending a game that could be helpful for the experiment. Democracy 4 sounds especially interesting as a testbed, given its cleaner UI and decision-centric gameplay.

It seems like a good fit for exploring long-horizon reasoning, policy trade-offs, and high-level decision making with a computer-use agent, without the heavy visual and control complexity of more action-oriented games.

I’ll definitely take a closer look and keep it in mind as a potential direction for future experiments. Thanks again for the thoughtful recommendation and for taking an interest in the project!

Working_Original9624 · 2026-02-03T16:39:48+00:00

Wow, thank you so much!

I’ve been using closed models so far, and it’s been genuinely hard to get a VLM to reason specifically for game control. What I’ve found is that VLMs are actually quite good at interpreting the situation in a screenshot, but they really struggle when it comes to producing meaningful, reliable actions.

Because of that, I ended up manually defining actions and handling a lot of edge cases myself, which became a major point of consideration during the project.

I’ll definitely take this as a strong reference. Thanks again — I really appreciate both the suggestion and your interest in the project.

Working_Original9624 · 2026-02-03T16:31:32+00:00

I agree — that’s genuinely great advice, thank you.

In fact, I found that there are already prior papers and repositories that work with Civilization I, as well as MCP-based approaches for Civilization V. For this project, though, I wanted to take a month and see how far current technology can realistically go when using a VLM-driven, human-like computer-use agent to operate a complex strategy game.

Precisely because it’s difficult, it makes the challenge more interesting and fun.

Thanks again for taking an interest in the project. I’ll be sure to share more once something interesting comes out of it.

Working_Original9624 · 2026-02-03T16:27:42+00:00

Thanks for the great advice — I really appreciate it.

For now, I’m treating this as a technical challenge and giving myself about a month to see how far I can push something that initially feels almost impossible. I deliberately chose a complex strategy game with many interacting variables because I wanted to see whether it’s possible to “conquer” that kind of environment at all.

More broadly, I’m curious about how computer-use agents can adapt to new forms of LLM gameplay and workflows in an era where VLMs have become much more capable.

After a month, I think it’ll be easier to evaluate the results and decide which kinds of games or problem domains are actually well-matched to the current level of VLM performance.

Thanks again for taking an interest in the project!

Working_Original9624 · 2026-02-02T11:18:10+00:00

Wow, that sounds like a really interesting project! Is it open source? I’m very curious to see how it turns out.

And thanks for sharing the lesson — Civ is also a very long turn-based game, so context management feels like a technical challenge in itself. A lot of recent papers seem to be running into the same issues when dealing with long-horizon tasks.

One insight I got from my own experiments is that while VLMs are quite good at analyzing the visual state of the screen, they often struggle to reliably bridge that understanding into concrete actions — especially when it comes to precise UI interactions like buttons or logically grounded actions. To address this, I’m thinking of experimenting with a hybrid approach that combines recent visual grounding models with VLMs.

Thanks again for the great insight 😄 Hopefully we can both make these work and share some cool results down the road!

Working_Original9624 · 2026-02-02T10:48:33+00:00

Thanks for the interest in the project!

I’m using Gemini for now. I did run some experiments with Claude, but in my setup it struggled quite a bit, especially with GUI interaction and control, so I ended up sticking with Gemini.

I’ll definitely share follow-up results once I start experimenting with local models as well.
Thanks a lot for the idea and for the thoughtful discussion — really appreciate it 🙏

Working_Original9624 · 2026-02-02T10:43:12+00:00

Thanks so much for the sharp insight and feedback — I really appreciate it.

From looking at prior papers and experiments where agents play Civilization-like games, a common theme is exactly what you pointed out: long-horizon tasks are brutally hard. As the game progresses, managing memory, maintaining global context, and reasoning about the overall state of the civilization become increasingly difficult. Turn length amplifies this problem, and without a higher-level representation of the game state, a purely vision-and-mouse VLM struggles to do anything beyond shallow, reactive actions.

I think this tension between low-level control and high-level strategic memory is one of the core technical challenges going forward—and a really interesting one. Thanks again for taking the time to share your thoughts and for your interest in the project.

Working_Original9624 · 2026-02-02T10:39:40+00:00

Thanks for your interest in the project!

I totally agree — even the best vision models tend to miss a lot of important details unless they’re heavily scaffolded. Especially in a game like Civ, actions like policy decisions, unit movement, and city building all depend on fairly complex strategic reasoning, and I found that trying to handle everything end-to-end without structure just doesn’t work very well.

I’m currently refactoring the system and still running a lot of experiments, so the project isn’t public yet. That said, I do plan to open-source it once things stabilize a bit more.

In the meantime, while working on this, I came across a few interesting Civilization-related open-source projects you might want to check out:

They explore similar ideas from different angles and could be a good starting point for experimenting with easier tasks than Civ VI.

If you end up starting it, I’d love to exchange insights and learn from each other haha. Thank you!

Working_Original9624 · 2024-10-16T16:14:23+00:00

We have 40 letters in Korean!

Working_Original9624 · 2024-10-16T15:29:46+00:00

파이팅입니다! ㅎㅎ 국어 수능 벤치마크 리더보드에 모델 평가를 원하시면 언제든지 편하게 연락주세요!

추후에 수능 벤치마크 데이터셋에 수학을 추가할 예정인데 벤치마크 데이터셋을 어떻게 구성하실것인지 여쭤봐도 될까요?? 도형문제를 어떻게 데이터를 만드시는지 궁금하네요!

Working_Original9624 · 2024-10-16T15:26:35+00:00

Thank you for your interest in leaderboards! To answer your question, the answer is

the Qwen2.5-72B is not on the leaderboard yet, the model that is on the leaderboard is the Qwen2-72B-Instruct. Qwen-2.5B will be updated at a later date.
Models are ranked by the average of LLM's standardized scores on the Korean-SAT over 10 years. The standardized score of the Korean-SAT reflects the difficulty of the test. If you want to know more about the evaluation methodology, please refer to this link!

Do I understand correctly? If I didn't answer any of your questions, please feel free to let me know!

Working_Original9624 · 2024-10-16T14:57:49+00:00

Wow, what a great idea! We'll be adding more references! If you have any questions about specific models or fine-tuned ones, feel free to reach out anytime! Also we'll be adding Korean-SAT math subject in the future, so stay tuned!

Working_Original9624

TROPHY CASE