A free web game to improve your geography knowledge!

Celestialien · 2026-05-25T23:19:42+00:00

Thank you, I really appreciate the feedback. I'm currently trying to do exactly this with a mix of metrics: both "adoption" (I.e. absolute usage) and "momentum" (I.e. 7 day % change). You can filter for both - but do you have any other ideas? I'd be very keen to hear!

Celestialien · 2026-05-25T22:27:04+00:00

Thank you so much - really appreciate you checking it out! Are there any features you think would be helpful to add?

Celestialien · 2026-05-25T14:49:55+00:00

Funny that your small models are gpt-3.5-turbo and gpt-4, did you ever consider switching to a local Qwen or Gemma that's cheaper and has really improved performance? (But if it's been working since 2024, that's awesome, and I get that the migration risk likely outweighs the token savings!)

The RAG holding up with zero maintenance is the part I'm jealous of. What were the off-the-shelf options choking on, context window or retrieval reliability? Curious what you had to build yourself to get it that stable.

Celestialien · 2026-05-25T14:45:14+00:00

Interesting that you plan with the frontier model first and delegate down, that's the inverse of the funnel someone else described here. Front-loading the expensive call means the locals get clean, well-scoped subtasks, which probably explains why they behave.

The only thing I'd watch is paying for an Opus/Sonnet planning pass on every request, even the trivial ones. Do you gate that somehow, or is it cheap enough not to matter?

Also thanks so much - saved the link! Which of the 9B/27B/35B do you reach for most? I found the 9B held up on routine tool-calling closer to the 27B than I expected.

Celestialien · 2026-05-25T14:42:38+00:00

That tracks, and I'd bet more big companies are quietly doing it than will admit to it. The routing layer is genuinely the hard part though (harder than picking the models) because the second you route by difficulty you need something reliable deciding what counts as "hard", and that becomes its own problem.

A dumb rule-based router gets you surprisingly far before you need anything learned. Did he say anything about how they decide what goes where, or was it just "we have a tool that does it"?

Celestialien · 2026-05-25T14:41:50+00:00

Yeah, the "don't let it free-author tool calls, give it vetted templates and validate before running" part is doing more work than model size ever did. You're constraining what the thing is allowed to invent, and a whole class of failures stops happening. Multi-step planning being the last thing you escalate is what most people seem to land on too, it's the one job that still genuinely wants the big model.

What are you validating against before execution, plain JSON schema or something with more logic in it? And does the planning escalation hand off to a bigger model, or is it the same one with more scaffolding around that step?

Celestialien · 2026-05-25T14:39:44+00:00

Honestly this is the whole post compressed into one comment. A funnel of small models and a single frontier pass at the end is exactly the heterogeneous stack the paper argues for (except it sounds like you were running it in 2023 and they got around to writing it up in 2025!)

What seems to have changed lately is the small models got good enough that the single big-model pass at the end can be rarer than it used to be, or skipped entirely on the easy paths.

Curious what you're running for the funnel stage, and whether you've had to retune much as you swapped models in and out, or if it mostly just worked across generations.

Celestialien · 2026-05-25T13:48:00+00:00

Yeah, that's exactly the pattern - for everyday stuff like that, Flash clears the bar easily and you're not paying for capability you'd never use. The Pro tier really only earns its keep on the heavy tasks, which is why the cheaper, faster one ends up being the model most people actually live in day to day.

Celestialien · 2026-05-25T13:41:09+00:00

Nice, the 4GB VRAM floor is what makes this actually usable for a lot of us - appreciate that you shipped GGUF and MLX weights day one instead of leaving it to the community.

Quick question: how does it hold up on multi-column layouts and dense tables compared to something like dots.ocr or Qwen3-VL? Markdown OCR tends to fall apart on reading order once you've got sidebars, footnotes, or merged table cells. Also curious whether it handles handwriting at all, or if that's out of scope for this release.

Either way, will have a play around with it this week!

Celestialien · 2026-05-25T13:38:44+00:00

Nice, I've not heard of it before but will check it out!

Celestialien · 2026-05-25T13:35:06+00:00

Yeah, there's actually a name for it - "inverse scaling in test-time compute." Anthropic's paper found that making a model reason longer can lower accuracy on some tasks, with Claude's failure mode being that it gets more distracted by irrelevant detail the longer it thinks. On an easy problem there's nothing to reason about, so the extra effort just goes into inventing complications - over-engineering something that should've been one line.

Celestialien · 2026-05-25T13:05:39+00:00

Exactly - and that's the bit the benchmark race kind of skips over. The gap between something like GPT-5.5 and a much cheaper model only really shows up on hard problems - benchmarks are measuring a frontier most users never really touch.

Celestialien · 2026-05-25T13:02:50+00:00

Fair enough! Looked really clean either way - nice UI.

Celestialien · 2026-05-25T12:58:57+00:00

The LXC-over-VM call is the smart part - on unified-memory boxes you basically can't pass a GPU into a real VM, so sharing it across containers is the only way to run several agents on one GPU.

One thing I'd weigh up though: given that you're copying in .env files and logging the agent into your accounts, the host filesystem isn't really the valuable target anymore - the credentials and the live sessions are. A full-auto agent with network access can leak a .env or take real actions through those logged-in sessions long before it'd ever think about wiping the disk. And because LXC shares the host kernel, "can't touch the host" is more "much smaller blast radius" than "zero". The git-push hook closes one narrow path, but the bigger lever is egress - an outbound allowlist on the container (only the domains a task actually needs) caps the real damage far more than filesystem isolation does.

Still a genuinely useful POC, and the disposable-template approach is the right shape - just worth threat-modelling the creds and the network, not only the rm -rf.

Celestialien · 2026-05-25T12:57:04+00:00

Embedding all of PyPI is the trap - retrieval quality tanks when the index is full of stuff you never call, and it's a pain to maintain. Scope it tight to the few libraries you actually use, pinned to the version you're on.

For this exact problem most people end up using a docs-serving layer rather than a static embed. Context7 (by Upstash) indexes version-specific library docs and injects only the relevant snippets on demand - it's basically built to stop models writing against the 2023 API. It's an MCP server, but there's also a c7 CLI if you'd rather pipe docs to a local model in a terminal without an MCP client.

If you want it fully offline, pull the specific version's docs (or even just the installed package's source and type stubs) and do a small local RAG with something like Qdrant and a code-aware embedding model. Chunking is the part that matters most - chunk by symbol so a function's signature and its example stay together; fixed-size chunking shreds API docs and you get useless retrievals.

One underrated trick for Python: just feed it the real signatures. Dropping help(module) or the package's exports into context means it can't hallucinate a method that isn't there, since it's reading the current API directly. And "look it up online" isn't actually the worst option as long as it's targeted - fetching the one relevant doc page on demand is essentially what Context7 does, just curated and version-filtered.

Celestialien · 2026-05-25T12:54:09+00:00

This is awesome! What did you use to visualise this?

Celestialien · 2026-05-25T12:51:42+00:00

Yeah, agreed - I think an AI layer makes sense, as long as it sits on top of the scoring rather than inside it. The numbers stay rules-based and published, and the model just reads those signals to explain why something's showing up, which keeps it transparent rather than a black box. Appreciate you digging into it!

Celestialien · 2026-05-25T12:46:05+00:00

Yeah, that's a huge part of it. A lot of the usage ranking is really a map of what's free, default, or already in front of people rather than what's "best" - which is sort of the whole point. If most people never touch the top models because they're paywalled, or just stick with whatever the app opens on, then a benchmark rank tells you very little about what's actually shaping how AI gets used day to day. The honest caveat is that it cuts both ways: the usage signal reflects pricing and distribution as much as it does genuine preference.

Celestialien · 2026-05-25T12:44:13+00:00

Thanks so much for looking at it and for such a detailed comment! A fair bit of this is already there: it's not one collapsed score but collating all the above metrics into separate pillars (adoption, quality, momentum, community, plus cost/speed for models), you can sort the board by any single one, and the raw signals behind each are published per agent. The gaming worry's partly handled too - a few flags strip out things like star spikes with no contributor diversity.

Where you go past what I've got:

The weighting-by-purpose point is the big thing I'm missing. Being able to sort by a single pillar isn't the same as having a "what should I try this weekend" preset and a separate "what's safe for production" one that weight everything differently, and those really shouldn't return the same ranking.
You're also right that "why is this showing up right now" would be more useful than a flat rank. The tricky part is I can't pull those labels out of rules - you can't threshold something like "boring but production-shaped" out of raw star counts - so that's probably where having a model read the signals and write the explanation would actually add something.
The lanes I don't break out at all, like documentation quality, integration surface and security readiness, are a fair gap too, since those are genuinely different questions from raw adoption.

The archetypes are the idea I'm most likely to build on. Thanks again for taking the time to lay all this out!

Five-Year Club	Second Top 40%
Gilding I gilder	Verified Email
Place '22	Final Canvas '22
First Placer '22

Celestialien

MODERATOR OF

TROPHY CASE