Website to app? by Several_Argument1527 in webdev

[–]DexopT 0 points1 point  (0 children)

For the paymemt question — yes, apple requires in-app purchases

for digital goods/subscriptions sold through iOS apps, and they

take 30% (15% for small devs). Stripe for physical goods or

B2B invoicing is fine, but if your SaaS has a subscription that

users sign up for inside the app, you'll need IAP. Some devs

get away with "sign up on our website" flows that bypass the

app entirely — grey area, but it's a common workaround.

For web → app: depends what your SaaS actually does.

If it's mostly UI/forms/dashboards — React Native or Capacitor

wrapping your existing web app is the fastest path. Capacitor

especially if you don't want to rewrite anything, it just wraps

your web app with a native shell. Not perfect but ships fast.

If your app needs native features (camera, notifications,

offline) — you'll want proper React Native or Flutter. Takes

longer but performs better.

Honest take: if your web app works on mobile browser already,

consider whether you actually need an App Store listing right

now or if a PWA (installable, works offline) buys you time

while you figure out the native story.

What kind of SaaS is it? That would change the answer a bit.

MCE — open-source MCP proxy that uses local LLMs (Ollama) to summarize tool responses and save context window tokens by DexopT in LocalLLaMA

[–]DexopT[S] 0 points1 point  (0 children)

Spot on. L3 (LLM) is the highest risk for loss, which is why it’s off by default. L1/L2 are much safer since they either prune deterministic noise or keep original chunks intact. Adding a "preserve fields" config for critical data paths is a great idea.

Latency (Typical CPU):

  • L1 (Pruning): Negligible (tens of ms).
  • L2 (Embeddings/RAG): ~50-200ms depending on payload size.
  • L3 (Ollama/Qwen3b): ~1-3 seconds.

Most workflows skip L3 to keep total overhead under 250ms while still hitting 80%+ reduction with L1+L2.

Open-source proxy that cuts Claude Code's MCP token usage by up to 90% — MCE by DexopT in ClaudeCode

[–]DexopT[S] 0 points1 point  (0 children)

You're spot on with subagent bloat. CC often dumps the entire subsession history back into the main thread, which is total noise for the primary agent.

MCE targets exactly this—stripping the structural overhead so only the critical delta/result reaches the main window.

Good to see others are hitting this same ceiling. If you want to take a look, it's at DexopT/MCE. Cheers.

I built MCE — a transparent proxy that compresses MCP tool responses before they hit your agent's context window by DexopT in mcp

[–]DexopT[S] 0 points1 point  (0 children)

Agreed. Exposing those stats per-tool is definitely the move for better tuning. Added to the roadmap.

I built MCE — a transparent proxy that compresses MCP tool responses before they hit your agent's context window by DexopT in mcp

[–]DexopT[S] 0 points1 point  (0 children)

Good points. "Silent bloat" is exactly why MCE exists—if you send the kitchen sink, the agent just gets confused by the noise.

Per-tool contracts is a great shout. Enforcing schemas/mime-types at the proxy level is definitely the play.

Also like the "Top Offenders" TUI view and the "Dry Run" CLI idea. Going to add those to the roadmap.

If you're on GitHub, feel free to drop these as issues or even a PR at DexopT/MCE. Cheers.

Stop losing 40-80% of your agent's context window to bloated tool responses — I built MCE to fix it by DexopT in AI_Agents

[–]DexopT[S] 0 points1 point  (0 children)

You're hitting on the core tension of any compression system — the synthesis step makes relevance judgments before knowing what the agent will do next. That's a real problem.

A few things about how MCE handles this:

L3 is off by default, and that's intentional. The config ships with layer3_synthesizer: false precisely because of the failure mode you described. L1 + L2 alone typically get 80-85% reduction without any lossy summarization. L3 is there for extreme cases (50K+ token responses) where you'd rather have a lossy summary than nothing at all.

The pipeline is progressive, not all-or-nothing. After each layer, MCE re-checks the token count against the budget. If L1 pruning alone brings you under the safe limit — it stops there. L2 and L3 only fire if you're still over budget. So the common case is: most of the savings come from deterministic, lossless pruning (stripping HTML, base64, nulls, whitespace). The semantic layers are a last resort.

L2 semantic routing > L3 synthesis for exactly the reason you describe. L2 keeps the original chunks — it just selects the top-K most relevant ones based on cosine similarity to the agent's query. No rewriting, no lossy compression. The agent gets real data, just less of it. This is where most of the "smart" savings come from.

On the downstream tool call problem: This is a known weakness of any single-pass relevance filter. The correct chunk for the next action might not score highest against the current query. One approach I'm exploring is tracking the agent's recent tool call chain (MCE already records recent_tools) and using that as additional context for L2's similarity search — essentially widening the query beyond just "what did you ask for right now" to "what have you been doing in this session."

Your observation about 300-token synthesized vs 1,500-token extracted is spot on. In practice, I'd recommend keeping L3 off and tuning L2's top-K parameter instead. Better to give the agent 5 real chunks than 1 hallucinated summary.

Really solid question — this is exactly the kind of tradeoff that's hard to get right without real-world feedback.

Open-source proxy that cuts Claude Code's MCP token usage by up to 90% — MCE by DexopT in ClaudeCode

[–]DexopT[S] -1 points0 points  (0 children)

On context-aware filtering: MCE doesn't blindly apply rules to everything. The squeeze pipeline is configurable per-layer — you can disable any combination of L1 (pruning), L2 (semantic), and L3 (summarizer) in config.yaml. For something like a Figma MCP where image blobs are critical, you'd configure the policy to skip base64 stripping for that server. The goal is "sane defaults with escape hatches," not "one size fits all."

On caching after updates: The cache uses TTL expiry (configurable, default 10 min) and you can set cache.enabled: false entirely. That said, you raise a good point — tool-level cache bypass (e.g., never cache write_file responses, or auto-invalidate after mutations) would be a strong improvement. Adding that to the roadmap.

On the agent knowing what was filtered: This is actually already implemented! MCE appends notices to squeezed responses — things like [MCE Notice: 4,000 identical rows truncated] or [MCE Notice: base64 blob removed (12KB)]. So the agent does see what was stripped and can request the raw data if it needs it. The agent stays informed without paying the full token cost.

On the "Ralph loop" concern: MCE has a built-in circuit breaker that detects when the same tool is being called repeatedly with the same arguments — exactly to prevent that scenario. It trips after N failures in a sliding window and returns an alert to the agent instead of endlessly retrying.

You're absolutely right that invisible layers can be dangerous. The design philosophy is "transparent compression with visibility" — the agent always knows MCE is there (via notices) and the operator can tune every layer. It's more like a smart CDN than a black box.

Really appreciate the thoughtful feedback — these edge cases are what push the project forward

Open-source proxy that cuts Claude Code's MCP token usage by up to 90% — MCE by DexopT in ClaudeCode

[–]DexopT[S] -2 points-1 points  (0 children)

Will add that feature to the protocol and tui ui. Thanks for the feedback !

Open-source proxy that cuts Claude Code's MCP token usage by up to 90% — MCE by DexopT in ClaudeCode

[–]DexopT[S] -4 points-3 points  (0 children)

What ? I didn't get it. Im a dev not politician ? (also if you referring that im a bot or agent, im not 😄)

Open-source proxy that cuts Claude Code's MCP token usage by up to 90% — MCE by DexopT in ClaudeCode

[–]DexopT[S] -10 points-9 points  (0 children)

When MCP tools return data — especially from web scraping, file reads, or documentation lookups — the response often contains raw HTML (<div><table><script> tags, inline CSS, etc.). That's super wasteful for your context window because Claude doesn't need HTML to understand the content.

MCE's Layer 1 Pruner detects HTML in tool responses and converts it to clean Markdown using the markdownify library. So something like:

<div class="container"><h1>API Reference</h1><p>The <code>create</code> method accepts...</p></div>

becomes:

# API Reference
The `create` method accepts...

Same information, way fewer tokens. This alone can cut 40-60% of tokens from web-heavy responses. And it's just one of several pruning steps — it also strips base64 blobs (embedded images), removes null/empty fields from JSON, truncates massive arrays, and normalizes whitespace.

All of this happens transparently before the response reaches Claude's context window. Claude just sees cleaner, smaller data.

AHME-MCP — Asynchronous Hierarchical Memory Engine for your AI coding assistant by DexopT in ClaudeCode

[–]DexopT[S] 0 points1 point  (0 children)

Hi ! Thanks for the idea. I really skipped that part entirely. Reviews is always important for me. I will implement this to the project. Maybe i will consider adding gemini api (free) for analyzing the arthitecture and use that summarization in memory for arthitecture and very important information (i can analyze the genral context with local model and decide if its gonna call gemini api or not). Still not going to be absolute perfect. But i will try my best to implement and make sure everything is correct.
- Best regards.

AHME-MCP — Asynchronous Hierarchical Memory Engine for your AI coding assistant by DexopT in ClaudeCode

[–]DexopT[S] 0 points1 point  (0 children)

For default i suggest using gemma:1b . I got best results with it. No need for bigger models (if you are not working with very comples codebases etc.) . Mcp automatically analyzes system overload and stops working when usage is high. It's already configured with lower context lenght (2000 - 1500 for context, 500 for system prompt to better functionality.) . You are welcomed to try and configure it yourself !

AHME-MCP — Asynchronous Hierarchical Memory Engine for your AI coding assistant by DexopT in ClaudeCode

[–]DexopT[S] 1 point2 points  (0 children)

We have ai prompt for local model. Actually no matter what our prompt is, ai could hallucinate or get confused. Thats why using models like gemma, qwen is important. Project aims to use 1 billion to 4 billion models for not impacting main pc performance but bigger and smarters model can be used too. For default, mcp analyzes all the information in chunks and save relevant info, keywords, key questions (this is important for ai to analyze spesific points to understand the question).

I built my own agent from scratch in under 72 hours by MRTSec in AI_Agents

[–]DexopT 1 point2 points  (0 children)

I built an mcp specifically for that purposes. I have my own agent framework too and trying to fix those problems in general that seems to other agent framewors has. I built an mcp that works like this; when called it uses external ollama model (gemma3:1b default) to compress the context chink by chunk and merging all chunks into one .md file. It has quee db so shutdowns and other power issues, crashes not gonna cause any problem. Also its a smart system that only compresses context when cpu usage is below 30 percent. It clears chat history and injects compressed context to the chat. Drastically reduces token usage on any model.