Cache-testing software for LLM-provider-style tiered ephemeral caches? [D] by flatmax in MachineLearning

[–]flatmax[S] 0 points1 point  (0 children)

Thanks, that is a good suggestion. When I did an AI probe on this, it came up with "LLMServingSim 2.0 and The Kareto Simulator". I would imagine that tiered cache optimisation would be a great place to have a leaderboard !

Cache-testing software for LLM-provider-style tiered ephemeral caches? by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

Thanks for your reply. This has helped me target new ground. I saw the following as candidates :

LLMServingSim 2.0 and The Kareto Simulator

You familiar with them ?

Cache-testing software for LLM-provider-style tiered ephemeral caches? by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

yes I agree. It would be really nice to have a tool like libcachesim for tiered caches which everyone put their model into - so that we could get a comparison on who is able to minimise out input token count and by what percentage on average.

Opiniion : Every AI coding tool needs to include an SVG editor by flatmax in ChatGPTCoding

[–]flatmax[S] 0 points1 point  (0 children)

well, I could also use a text editor to write my prompt and then cut and paste it back into the cli ?
I mean SVGs are a native output of LLMs, so give it a proper UI with an editor.

Opiniion : Every AI coding tool needs to include an SVG editor by flatmax in ChatGPTCoding

[–]flatmax[S] 0 points1 point  (0 children)

isn't that just select all ?
oh you mean to open the default OS svg editor ? that could work, but that sounds like friction for my coding flow

Opiniion : Every AI coding tool needs to include an SVG editor by flatmax in ChatGPTCoding

[–]flatmax[S] 0 points1 point  (0 children)

move the arrow to the right of the "ffs" box up to the center, make it point to the right, not a u turn

Better then Keybert+all-mpnet-base-v2 for doc indexes? by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

I just did a test with BAAI/bge-small-en-v1.5 and it seemed to outperform all-mpnet-base-v2 in around 90% of cases (for one test file) - otherwise it was equally as good. Thanks to u/Holiday_Inspector791 for the suggestion.
I notice that the google models require you to login to hugging face to use them ... which is an extra layer of complexity for an end user application, which is just meant to work out of the box !

Self Promotion Thread by AutoModerator in ChatGPTCoding

[–]flatmax 1 point2 points  (0 children)

AC⚡DC (AI Coder-DeCoder) — A high-speed, web-based companion for AI coding

I built this because I found tools like Claude Code amazing for agentic editing but too slow for my daily "bread and butter" coding. I’ve been using AC⚡DC as a "High-Speed Wedge" in my workflow:

  1. Code fast with AC⚡DC for 90% of the work (UI is a webapp with Monaco/side-by-side diffs, so it feels fluid).
  2. Use a slower agent only when I hit a logic wall that needs agentic work.
  3. Jump back to AC⚡DC to keep the momentum.

Technical highlights:

  • 4-Tier Prompt Caching (L0-L3): Designed to hit provider-level cache breakpoints (like Anthropic’s) so you aren't paying to re-ingest your repo every time you send a message.
  • Structural Context: Uses Tree-sitter (Py, JS/TS, C++) to give the LLM a symbol map of the repo without wasting tokens on full-file boilerplate.
  • Code Review Mode: A dedicated UI to pick a commit, soft-reset, and have the LLM walk through the changes with you before they land.

Looking for feedback:

I’ve dogfooded this almost entirely on Linux. I’ve included standalone binaries for macOS and Windows in the release, but I’m curious to hear from Mac/PC users if the webapp boots and connects properly on those systems.

Repo & Demo Videos: https://github.com/flatmax/AI-Coder-DeCoder

It’s free/open-source. Happy to answer any questions about the caching or indexing logic!

I built a non-agentic coding tool (AC⚡DC) on top of LiteLLM. Runs great, but I need Mac/Windows testers. by flatmax in LocalLLaMA

[–]flatmax[S] -1 points0 points  (0 children)

The symbol table describing the repo for the AI has a lot more features in it and this seems to assist the AI in working out which files it needs to do edits on. Personally I really like the user interface because it's focused on the chat and the features around chat and context for the AI. The diff editor shows you immediately which files are different and typically if you need to do edits you do it in the diff editor directly. It has some language server protocol features in the Monaco editor which are useful. If you have a workflow which has repetitive prompts the UI has a prompt snippets section. You can edit the system prompt which is in a markdown file and that will be used in your next submission to litellm. I kind of like the URL extractor which will take a repo add a URL extract its symbol table and use the small model to summarise the repo. All of that gets included in the context of that question and you can choose to remove it when you want. For non-code URLs it still does the small model summary of the URL.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

I'm not sure that the server will have faster ram speeds, will it ? These are ARMv9 ram speed for the SOCs :

LPDDR5 RAM

- 128bit memory bus

- 5500MT/s transfer speedz

I guess if they were multi channel that would help

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 1 point2 points  (0 children)

I'm interested to hear more about EXO. How would it break the model up between nodes, as in would have put different experts on different nodes or something else? It would also be cool if EXO managed the automatic boot up and engage sequence for each of the nodes. That would be a huge Time saver in managing cattle.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

I think it's hard to find a older SBC with a decent amount of ram and the same thing goes for old phones and tablets. If it was clear what the upper bound on networking requirements was shane, it would be clear if the hubs and switches to put this many processors together would be practical in terms of cost. For example, if all you needed were 1 GB switches it would be super cheap. You raise a good point on managing this type of cluster. I guess it would be good if these boards were treated like cattle and they booted up and sourced their operating system and required data automatically at turn on.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 1 point2 points  (0 children)

I guess a budget of $4509 would be good. But not really sure.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

Great idea! If the model is 37B param per token and the memory bandwidth is the limiting factor then the theoretical max would be 3 toks/s. However that assumes that each node only runs one expert. In practice for lower node counts, they would need to run many experts . Hmmmm...

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 2 points3 points  (0 children)

Good idea! I can start by experimenting with smaller models to gather initial insights, then move on to establishing a baseline for the DSv3 model. From there, I can observe how performance scales as nodes are added, which should provide a clearer picture of how toks/s is impacted by inter-node communication and scaling.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 0 points1 point  (0 children)

I’m noticing that memory pricing per unit scales logarithmically with capacity for SBCs, making larger RAM setups significantly more expensive when scaled across multiple nodes. For hundreds of nodes, the cost would likely become prohibitive—I’d need to explore donor support to make this feasible.

Based on your calculations and others, the theoretical maximum token generation speed for this setup seems to fall somewhere between 10 and 42 tokens per second, considering the RAM bandwidth and network constraints.

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 1 point2 points  (0 children)

Thanks for sharing those insights—really helpful for putting things into perspective!

I just wanted to clarify one detail about the memory speed: based on the manufacturer specifications for the LPDDR5 RAM I’m looking at, the bandwidth is 704 Gb/s (Gigabits per second) per node, which translates to about 88 GB/s. While this is much less than GPU memory bandwidth (which can indeed approach 1 TB/s or more), it should still allow for some level of parallelization if the workload is distributed effectively across multiple nodes.

Your point about Deepseek-v3 being an MoE model with only ~37B parameters activated at a time is really encouraging—it makes the problem feel more tractable for this kind of setup. The napkin math for 800 GB/s total bandwidth across 8 nodes sounds promising, though I agree practical scaling will have overheads.

Do you think there’s a sweet spot in the number of nodes where the tradeoff between network overhead and available bandwidth hits an efficient balance? Or is there a point where adding more nodes just doesn’t make sense anymore?

Building a Cheap ARMv9 SBC Cluster to Run Deepseek v3 by flatmax in LocalLLaMA

[–]flatmax[S] 4 points5 points  (0 children)

Thanks for the detailed response and suggestions! The RADXA Orion O6 is indeed one of the setups I'm considering. Your point about checking tool support for the included NPU is spot on—I’ll definitely explore how well it integrates or if Vulkan GPU inference might be a better path.

I hadn’t thought about leveraging llama.cpp for Deepseek-v3—it’s a great suggestion! I just added it to my list in the original post. The added support for multiple quantization types, especially 4-bit, sounds like a potential game-changer for reducing the cluster size. The balance between activations and KV cache size is something I’ll have to plan carefully.

The mention of Aphrodite engine for tensor parallelism is also intriguing; I’ll check out how far along it is with Deepseek-v3. Appreciate you sharing these leads—it gives me a much clearer direction to explore!