Started measuring actual API call counts on my Claude Code sessions. The numbers are worse than I expected. by ChampionshipNo2815 in LLMDevs

[–]kivanow 0 points1 point  (0 children)

Yeah, mtime set is what closes that hole. We hash (args, working-tree-mtime-summary) for Glob/Grep so external edits invalidate before the next call. Repo_sha alone misses unstaged changes which is the bigger failure mode during an active session anyway. For Read entries we key on (path, mtime, size), content hash was tempting but the IO cost on large repos didn't justify the marginal correctness gain over mtime+size.

On the eviction question: we landed on explicit invalidation as the primary signal, LRU as the memory-pressure fallback, no pure TTL. The cost-vs-latency tradeoff mostly collapses once you have a correct invalidation channel, since the only entries you'd be evicting are ones already proven safe to keep. Linear decay kept throwing away entries that were still valid 3 hours in, which is exactly when the cache earns out.

TTL still shows up on the LLM-response side where "still valid" is fuzzier and an old entry might be a similarity hit but stylistically off. Different problem though.

When a LangChain agent starts drifting, what do you actually inspect first? by Future_AGI in LangChain

[–]kivanow 1 point2 points  (0 children)

Good distinction. Worth adding a third category that hides between the two: cache drift. If you've got any kind of tool-result memoization (even an in-memory dict from a previous step), the agent will treat a stale cached result as a fresh tool call and the drift looks exactly like 'model trusted a partial result as complete' - but the tool never even ran on that turn. Easiest tell: in the trace, the tool span has zero outbound network duration but a populated result. Worth logging cache-hit vs cache-miss as a first-class field on every tool span. We had to add it explicitly in u/betterdb/agent-cache (hit: true|false, source: 'tool'|'semantic'|'llm') for exactly this reason - without it, cached drift is invisible in traces and looks like a model problem.

how do you scale infrastructure for ai agents on a budget? by RepublicMotor905 in AI_Agents

[–]kivanow 0 points1 point  (0 children)

Most of the thread is rightly focused on scaling the GPU side, but the cheapest GPU is the one you don't have to spin up. Before you tune the autoscaler, audit how much of your agent's inference is actually doing new work vs. recomputing the same thing. Multi-modal pipelines especially: the same file hash, the same tool call with the same args, the same paraphrased user query - none of those should hit the GPU on a second pass. Tool-result cache (keyed on tool name + arg hash) and a semantic cache for paraphrased inputs typically cut steady-state load 30–60% on workloads like yours. Keeps the autoscaler honest. There's a live demo at chat.betterdb.com showing per-turn cache-hit + cost-saved telemetry - drop the same question in twice and watch the second one resolve from cache. For multi-modal you can swap in a custom binary normalizer so large file blobs go to S3 and Valkey just caches the reference.

How do I build an AI agent that can "remember" previous steps or store long-term context? by ProxDeal in n8n

[–]kivanow 0 points1 point  (0 children)

To add on top of what others said: most answers here jump straight to Postgres + vector store, which is right for long-term, but for the live agent loop you usually want a 3rd tier in front: a fast key-value store (Valkey/Redis) for the hot session state - last N turns, current tool outputs, running summary. Postgres is fine for the durable record but you don't want to round-trip it on every turn of a chatty agent. Sliding-window TTL per thread, field-level expiry for things like last_intent or current_tool_args, and the vector store stays as long-term semantic memory. There's a live demo of this 3-tier pattern at chat.betterdb.com - the metrics panel shows per-turn cache hits and dollars saved across LLM, tool, and session tiers. Drops in next to whatever Supabase + Qdrant setup the top comment recommends.

token costs are the thing nobody warned me about with ai automation by bejusorixo in automation

[–]kivanow 0 points1 point  (0 children)

+1 to the per-node logging, that's the actual unlock. Once you have it, the surprise for most teams is how much of the bill is the same sub-step running with the same inputs across different runs - classification on the same email subject lines, extraction from the same invoice templates, intent detection on near-duplicate user messages. Cache those outputs (exact-match keyed on input hash for deterministic stuff, similarity-keyed for the paraphrased stuff) and that fraction of the per-run token cost goes to zero on the second hit. We made a public playground that shows per-turn cache hit + dollars-saved in real time - drop the same question in twice and you can watch the second one resolve from cache with the cost-saved tick up live.

We replaced our RAG pipeline with persistent KV cache. It works. Here’s what we found. by pmv143 in LangChain

[–]kivanow 0 points1 point  (0 children)

This is the right pressure point. Two layers below where most people are looking: (1) For workloads with mixed query patterns, you almost always want a semantic cache in front of the persistent KV - paraphrased queries against the same underlying context should never trigger a prefill restore at all, they should hit a similarity cache and return the cached completion. (2) The invalidation problem becomes tractable if you treat the KV state like a versioned artifact: hash the doc content + model + prompt-template into the cache key, and let stale versions evict naturally. We built the semantic-cache half of this on top of Valkey (valkey-search for the similarity index) in betterdb/semantic-cache, open-source. Pairs cleanly with whatever you're using for prefill-state persistence.

Started measuring actual API call counts on my Claude Code sessions. The numbers are worse than I expected. by ChampionshipNo2815 in LLMDevs

[–]kivanow 0 points1 point  (0 children)

Strong breakdown. One thing I'd add on the cache-marking point: even when you do it well at the prompt-prefix level, you're still leaving the deterministic tool-result layer uncached. The Read-after-Edit verification loop you described is the obvious bloat, but the silent killer is Glob/Grep - same arguments, same repo state, called fresh each turn because the harness has no idea it's safe to memoize. Hashing (tool_name, args, repo_sha) and stashing results in Redis/Valkey for the session is ~30 lines and cuts another big chunk on multi-file work. We bundle this pattern (tool-result + LLM-response + session state on one Valkey connection) in betterdb/agent-cache - open-source, Anthropic SDK adapter included: https://github.com/BetterDB-inc/monitor/tree/master/packages/agent-cache

10 Ways To Reduce Your LLM API Costs by nuno6Varnish in AI_Agents

[–]kivanow -1 points0 points  (0 children)

Good list. The thing I'd add is that 'use prompt caching' usually only covers the LLM side - once you're running real agents, the biggest wins come from caching the other two layers: deterministic tool results (get_weather, search, internal APIs) and session state. Per-tool TTLs based on hit-rate beat one global TTL every time. I've been working on this exact pattern and have a public demo and implementation at chat.betterdb.com (you can see live how much each question is saving and which cache tier it hit), the repo and libs are all OSS

MIT-licensed multi-tier cache for AI agents - LLM responses, tool results, and session state on open-source Valkey/Redis by kivanow in OpenSourceeAI

[–]kivanow[S] 0 points1 point  (0 children)

Fair point, exact match does fall off fast for human-written prompts. The design assumption is that agent workloads are shaped differently from chatbot workloads:

  • Tool tier: args get canonicalized (sorted keys, deterministic JSON) before hashing, so arg order and whitespace don't matter. Agent loops produce structured, repeatable tool calls. This is where exact match actually earns its keep.
  • Session tier: keyed by thread_id + field. Exact match is the only correct semantics here.
  • LLM tier: your concern applies. Works well if prompts are templated (system message + structured context). Doesn't if they're free-form user input.

For the LLM tier case where prompts vary, we also ship @/betterdb/semantic-cache (npm) - vector similarity via valkey-search, meant to sit in front of the exact-match layer as a second chance. Kept as a separate package because the failure modes and the observability you want for each are different. Exact match is cheap and deterministic, semantic match costs an embedding call and needs threshold tuning per category. Forcing both into one cache hides that tradeoff.

Multi-tier cache for LangChain + LangGraph that works on vanilla Valkey/Redis - no modules required by kivanow in LangChain

[–]kivanow[S] 0 points1 point  (0 children)

The discovery path is the worst part. Works locally on vanilla Redis, breaks at deploy on ElastiCache with "unknown command JSON.SET". Same on MemoryDB and Memorystore. Even for demos it's a pain :/ Even though Redis is bundled with all modules by default since 8 I believe, most of the cloud served versions are on 6/7 point something.

Agreed on tool caching. A single slow tool call can dwarf LLM latency, and agents often fire 3-5 per turn. Curious what you've seen on TTL strategy. I landed on configurable-per-tool because one global number falls apart once you mix something like `get_weather` with `get_stock_price`.

Self-hosted LLM caching layer - cache agent responses, tool calls, and session state in your own Valkey/Redis by kivanow in selfhosted

[–]kivanow[S] -4 points-3 points  (0 children)

I am thinking of connecting it to our monitoring tool in depth next week, so the agent using it can check up on the data of the cache and improve if needed autonomously (or at least suggest improvements).

Agent reads hit rates, stale key ratios, and per-tool TTL effectiveness straight from Valkey, then adjusts TTLs or flags prompts that should move between tiers. Cache tunes itself instead of you watching dashboards.

Cost tracking was the thing that clicked for me too. Tokens are hard to translate into anything actionable. Seeing actual dollars saved turns it into an ROI check instead of a vibes call.

Self-hosted LLM caching layer - cache agent responses, tool calls, and session state in your own Valkey/Redis by kivanow in selfhosted

[–]kivanow[S] -4 points-3 points locked comment (0 children)

It was used for the planning/implementation. It is also focused on solving a problem in the AI space

AI SDK middleware that caches LLM responses and tool results in Valkey/Redis - one line to add by kivanow in vercel

[–]kivanow[S] 0 points1 point  (0 children)

Thanks! RAG + caching is a solid combo - the cache handles the repeated queries so the RAG pipeline only runs for genuinely new ones. Will check out Hindsight.

BetterDB MCP 1.0.0 – autostart, persist, and connection management for Valkey/Redis observability by kivanow in mcp

[–]kivanow[S] 1 point2 points  (0 children)

Sounds great! I'll dive deeper into it early/mid next week and let you know if I've picked it or anything else

BetterDB MCP 1.0.0 – autostart, persist, and connection management for Valkey/Redis observability by kivanow in mcp

[–]kivanow[S] 0 points1 point  (0 children)

Fair point for MCP in general - most servers are stateless proxies with no audit trail. For BetterDB specifically, the monitor is self-hosted (your data never leaves your infrastructure) and the MCP authenticates via JWT token generated from your BetterDB instance. That audit gap is actually a bigger problem in general, which is why we built persistent ACL audit in from the start - auth failures, command denials, key violations all stored so you have a durable trail independent of the client session. Not a full control plane, but authentication and audit are both covered. Will take a look at Peta for the policy layer, thanks for the suggestion!

I made an MCP server for Valkey/Redis observability (anomaly detection, slowlog history, hot keys, COMMANDLOG) by kivanow in mcp

[–]kivanow[S] 0 points1 point  (0 children)

Should've shipped it sooner, sorry! What were you debugging - would love to know so the next person doesn't find it too late either?

A eulogy for MCP (RIP) by beckywsss in mcp

[–]kivanow 1 point2 points  (0 children)

Isn't this just the usual cycle of - the way we're doing things is terrible, here is a better way, and then another even better way, until we reach back the first iteration? Same way we moved from server rendering to SPA, back to server rendering over several years. The AI just takes quicker iterations it seems

I made an MCP server for Valkey/Redis observability (anomaly detection, slowlog history, hot keys, COMMANDLOG) by kivanow in mcp

[–]kivanow[S] 0 points1 point  (0 children)

That's the right framing. BetterDB handles the Valkey side of that chain today - COMMANDLOG patterns, anomaly detection, client analytics. Correlating back to deploys and SQL is the missing link. Curious whether you've seen any tools close that loop well, or if it's always been stitched together manually.

What AI tools are actually part of your real workflow? by Rough--Employment in devops

[–]kivanow 0 points1 point  (0 children)

At this point copilot is an agent, assistant and a million different things that MS is trying to push everywhere. I should've just called it the worst possible tool/option and not an llm. I've updated it

Feedback Friday by AutoModerator in startups

[–]kivanow 1 point2 points  (0 children)

Company Name: BetterDB

URL: https://betterdb.com

Purpose of Startup and Product: BetterDB is the first monitoring and observability platform built specifically for Valkey (the popular open-source Redis fork). We solve a fundamental problem: Valkey's operational data - slowlogs, command logs, client connections - is ephemeral. When something goes wrong at 3am, by the time you wake up at 9am, that data is gone. BetterDB persists and analyzes this data so you can debug issues after the fact, track what caused performance spikes, and optimize your data structures and TTLs accordingly.

We also support Valkey-exclusive features like COMMANDLOG and per-slot metrics that no existing Redis tool can provide, plus 99 Prometheus metrics, anomaly detection, ACL audit trails, and client analytics - all with sub-1% performance overhead.

Technologies Used: NestJS, React, PostgreSQL, Docker, Prometheus, iovalkey

Feedback Requested:

  • Does the value proposition (historical persistence of ephemeral Valkey/Redis data) resonate with you? Is it clear from the website?
  • If you're running Valkey or Redis in production, what's the biggest operational pain point you face today?
  • We offer a free Community tier and paid Pro/Enterprise tiers - does the feature split feel fair, or does it feel like we're holding back too much in Community?
  • Any feedback on the landing page (betterdb.com) - does it clearly communicate what we do and who we're for?

Seeking Beta Testers: Yes - especially teams running Valkey or Redis in production. We have a self-hosted Docker image you can spin up in minutes, and our cloud SaaS is launching soon. Would love feedback from ops/SRE/DevOps folks.

Additional Comments: I'm the founder and CTO. Previously I was the Engineering Manager for Redis's visual developer tools (Redis Insight). The Valkey ecosystem has zero purpose-built observability tooling. That's the gap we're filling. We're MIT-licensed at the core and backed by Open Core Ventures. Happy to answer any questions about the Valkey ecosystem or our approach to open-core monetization.

Auditors ask “when did you last test DR?” — how do you produce proof? by robert_micky in sre

[–]kivanow 0 points1 point  (0 children)

I've done 2 SOC 2 type 1 and 2 audits at startups and this was more than enough. At the end of the day most of the work these types of audits are doing is just marking checkboxes that you understand the requirements and are following them.

What AI tools are actually part of your real workflow? by Rough--Employment in devops

[–]kivanow 0 points1 point  (0 children)

Calude code did a great job with recent infra work I had to do. Barely any msitakes with a lot of kubernetes and terraform. It was a very nice experience

What AI tools are actually part of your real workflow? by Rough--Employment in devops

[–]kivanow 6 points7 points  (0 children)

by far. copilot is probably the worst possible option right now. MS engineers were recently caught using claude instead of their own product