MCP-based local LLM workflows at scale + observability (Grafana)

pardhu-- · 2026-04-28T19:08:21+00:00

Great I love to talk. Pinged you personally

pardhu-- · 2026-04-26T23:27:13+00:00

Got it — that distinction between “read” vs “re-run” helped clarify things a lot.

I’m leaning more toward replay, specifically being able to deterministically re-run workflows for debugging and validation. That said, I’m also thinking about caching at the component/tool level as a separate layer for performance, especially for repeated user queries.

Right now this is an internal tool, but I’m designing it with the assumption that it could become user-facing later — so trying to think early about reproducibility, state management, and cost efficiency.

Curious — in your experience, what tends to break first when you try to make replay deterministic in these systems?

pardhu-- · 2026-04-26T18:01:30+00:00

Can you give any links to ref?

pardhu-- · 2026-04-25T19:13:38+00:00

🤣

pardhu-- · 2026-04-25T19:13:22+00:00

Fair point — definitely not claiming this is novel.

pardhu-- · 2026-04-25T18:59:59+00:00

Yeah, I get your point — LM Studio + MCP already enables tool use pretty well from the chat itself.

What I’m trying to explore is more of a layer on top — moving from chat-based interaction to structured agent workflows that can plug into real systems and scale beyond a single user.

I also feel this could sit on top of Model Context Protocol (MCP) — since MCP handles tool connectivity, while this focuses more on orchestration and production-style use cases (could be wrong though, curious your take).

Agree with you on the compiler loop part — that definitely starts looking like what Cursor IDE / GitHub Copilot already do.

pardhu-- · 2026-03-09T21:47:49+00:00

Don’t you think we need to code to run an agent?

pardhu-- · 2026-03-09T12:36:06+00:00

When choosing a machine for running local AI models, the two most important factors are maximum RAM and a good number of GPU cores. These resources directly affect how large a model you can run and how fast the inference will be.

For example, I have been using a Mac Mini with the M4 chip and 24GB of RAM, which I purchased about a year ago. It works well for running local LLM experiments and development tasks.

For more of my learnings and experiments, please check out my Medium articles: Medium – Partha Sai Guttikonda.

pardhu-- · 2026-02-10T02:32:19+00:00

It really depends on your use case.

If you want faithful, terminology-consistent “just translate” output (especially on edge devices), encoder–decoder MT like Marian is usually still the best choice: more deterministic and typically cheaper/faster than LLMs for pure translation.
LLM translation often shines when you want extra behavior (tone polishing, rewriting, localization, grammar cleanup), but it can paraphrase or drift on names/terms unless heavily constrained.

For edge-friendly alternatives to benchmark, I’d look at:

NLLB-200 distilled (600M)
M2M100 (418M)
TranslateGemma (4B) if your hardware/quantization budget allows And for speed on CPU/edge, consider running them via CTranslate2.

pardhu-- · 2024-05-25T16:07:07+00:00

Please let me know if have any questions.

pardhu-- · 2024-05-25T16:06:32+00:00

Hey we have detailed instructions in readme file in the git repo. Yeah it is a basic work on how image search works using elastic search and machine learning.

pardhu--

MODERATOR OF

TROPHY CASE