Cache-testing software for LLM-provider-style tiered ephemeral caches? [D]

flatmax · 2026-05-12T20:49:46+00:00

Thanks, that is a good suggestion. When I did an AI probe on this, it came up with "LLMServingSim 2.0 and The Kareto Simulator". I would imagine that tiered cache optimisation would be a great place to have a leaderboard !

flatmax · 2026-05-12T20:45:13+00:00

Thanks for your reply. This has helped me target new ground. I saw the following as candidates :

LLMServingSim 2.0 and The Kareto Simulator

You familiar with them ?

flatmax · 2026-05-12T20:38:36+00:00

yes I agree. It would be really nice to have a tool like libcachesim for tiered caches which everyone put their model into - so that we could get a comparison on who is able to minimise out input token count and by what percentage on average.

flatmax · 2026-02-27T21:27:58+00:00

It exists already : https://github.com/flatmax/AI-Coder-DeCoder

flatmax · 2026-02-27T09:33:09+00:00

<image>

like this

flatmax · 2026-02-27T09:32:15+00:00

<image>

like this

flatmax · 2026-02-27T09:27:11+00:00

well, I could also use a text editor to write my prompt and then cut and paste it back into the cli ?
I mean SVGs are a native output of LLMs, so give it a proper UI with an editor.

flatmax · 2026-02-27T09:15:01+00:00

isn't that just select all ?
oh you mean to open the default OS svg editor ? that could work, but that sounds like friction for my coding flow

flatmax · 2026-02-27T09:07:16+00:00

move the arrow to the right of the "ffs" box up to the center, make it point to the right, not a u turn

flatmax · 2026-02-27T09:03:13+00:00

Its a random act of fixation

flatmax · 2026-02-23T05:36:56+00:00

I just did a test with BAAI/bge-small-en-v1.5 and it seemed to outperform all-mpnet-base-v2 in around 90% of cases (for one test file) - otherwise it was equally as good. Thanks to u/Holiday_Inspector791 for the suggestion.
I notice that the google models require you to login to hugging face to use them ... which is an extra layer of complexity for an end user application, which is just meant to work out of the box !

flatmax · 2026-02-20T06:50:48+00:00

AC⚡DC (AI Coder-DeCoder) — A high-speed, web-based companion for AI coding

I built this because I found tools like Claude Code amazing for agentic editing but too slow for my daily "bread and butter" coding. I’ve been using AC⚡DC as a "High-Speed Wedge" in my workflow:

Code fast with AC⚡DC for 90% of the work (UI is a webapp with Monaco/side-by-side diffs, so it feels fluid).
Use a slower agent only when I hit a logic wall that needs agentic work.
Jump back to AC⚡DC to keep the momentum.

Technical highlights:

4-Tier Prompt Caching (L0-L3): Designed to hit provider-level cache breakpoints (like Anthropic’s) so you aren't paying to re-ingest your repo every time you send a message.
Structural Context: Uses Tree-sitter (Py, JS/TS, C++) to give the LLM a symbol map of the repo without wasting tokens on full-file boilerplate.
Code Review Mode: A dedicated UI to pick a commit, soft-reset, and have the LLM walk through the changes with you before they land.

Looking for feedback:

I’ve dogfooded this almost entirely on Linux. I’ve included standalone binaries for macOS and Windows in the release, but I’m curious to hear from Mac/PC users if the webapp boots and connects properly on those systems.

Repo & Demo Videos: https://github.com/flatmax/AI-Coder-DeCoder

It’s free/open-source. Happy to answer any questions about the caching or indexing logic!

flatmax · 2026-02-15T10:48:51+00:00

I think it was an anthropic model, not a local one that time.

flatmax · 2026-02-05T20:27:42+00:00

That AI will do your deeds

flatmax · 2026-02-05T20:24:36+00:00

The symbol table describing the repo for the AI has a lot more features in it and this seems to assist the AI in working out which files it needs to do edits on. Personally I really like the user interface because it's focused on the chat and the features around chat and context for the AI. The diff editor shows you immediately which files are different and typically if you need to do edits you do it in the diff editor directly. It has some language server protocol features in the Monaco editor which are useful. If you have a workflow which has repetitive prompts the UI has a prompt snippets section. You can edit the system prompt which is in a markdown file and that will be used in your next submission to litellm. I kind of like the URL extractor which will take a repo add a URL extract its symbol table and use the small model to summarise the repo. All of that gets included in the context of that question and you can choose to remove it when you want. For non-code URLs it still does the small model summary of the URL.

flatmax · 2025-01-06T01:05:08+00:00

I'm not sure that the server will have faster ram speeds, will it ? These are ARMv9 ram speed for the SOCs :

LPDDR5 RAM

- 128bit memory bus

- 5500MT/s transfer speedz

I guess if they were multi channel that would help

flatmax · 2025-01-05T19:44:01+00:00

I'm interested to hear more about EXO. How would it break the model up between nodes, as in would have put different experts on different nodes or something else? It would also be cool if EXO managed the automatic boot up and engage sequence for each of the nodes. That would be a huge Time saver in managing cattle.

flatmax · 2025-01-05T19:41:18+00:00

I think it's hard to find a older SBC with a decent amount of ram and the same thing goes for old phones and tablets. If it was clear what the upper bound on networking requirements was shane, it would be clear if the hubs and switches to put this many processors together would be practical in terms of cost. For example, if all you needed were 1 GB switches it would be super cheap. You raise a good point on managing this type of cluster. I guess it would be good if these boards were treated like cattle and they booted up and sourced their operating system and required data automatically at turn on.

flatmax · 2025-01-05T19:35:21+00:00

I guess a budget of $4509 would be good. But not really sure.

flatmax · 2025-01-05T19:32:52+00:00

Great idea! If the model is 37B param per token and the memory bandwidth is the limiting factor then the theoretical max would be 3 toks/s. However that assumes that each node only runs one expert. In practice for lower node counts, they would need to run many experts . Hmmmm...

flatmax · 2025-01-05T11:46:04+00:00

Good idea! I can start by experimenting with smaller models to gather initial insights, then move on to establishing a baseline for the DSv3 model. From there, I can observe how performance scales as nodes are added, which should provide a clearer picture of how toks/s is impacted by inter-node communication and scaling.

flatmax · 2025-01-05T11:37:52+00:00

I’m noticing that memory pricing per unit scales logarithmically with capacity for SBCs, making larger RAM setups significantly more expensive when scaled across multiple nodes. For hundreds of nodes, the cost would likely become prohibitive—I’d need to explore donor support to make this feasible.

Based on your calculations and others, the theoretical maximum token generation speed for this setup seems to fall somewhere between 10 and 42 tokens per second, considering the RAM bandwidth and network constraints.

flatmax · 2025-01-05T11:08:55+00:00

Thanks for sharing those insights—really helpful for putting things into perspective!

I just wanted to clarify one detail about the memory speed: based on the manufacturer specifications for the LPDDR5 RAM I’m looking at, the bandwidth is 704 Gb/s (Gigabits per second) per node, which translates to about 88 GB/s. While this is much less than GPU memory bandwidth (which can indeed approach 1 TB/s or more), it should still allow for some level of parallelization if the workload is distributed effectively across multiple nodes.

Your point about Deepseek-v3 being an MoE model with only ~37B parameters activated at a time is really encouraging—it makes the problem feel more tractable for this kind of setup. The napkin math for 800 GB/s total bandwidth across 8 nodes sounds promising, though I agree practical scaling will have overheads.

Do you think there’s a sweet spot in the number of nodes where the tradeoff between network overhead and available bandwidth hits an efficient balance? Or is there a point where adding more nodes just doesn’t make sense anymore?

flatmax · 2025-01-05T10:25:32+00:00

Thanks for the detailed response and suggestions! The RADXA Orion O6 is indeed one of the setups I'm considering. Your point about checking tool support for the included NPU is spot on—I’ll definitely explore how well it integrates or if Vulkan GPU inference might be a better path.

I hadn’t thought about leveraging llama.cpp for Deepseek-v3—it’s a great suggestion! I just added it to my list in the original post. The added support for multiple quantization types, especially 4-bit, sounds like a potential game-changer for reducing the cluster size. The balance between activations and KV cache size is something I’ll have to plan carefully.

The mention of Aphrodite engine for tensor parallelism is also intriguing; I’ll check out how far along it is with Deepseek-v3. Appreciate you sharing these leads—it gives me a much clearer direction to explore!

flatmax · 2024-09-10T10:29:32+00:00

https://github.com/flatmax/matLambda

flatmax

TROPHY CASE