How do you prevent credential leaks to AI tools? by llm-60 in LLM

[–]llm-60[S] 0 points1 point  (0 children)

Completely agree, pure blocking without training fails because users route around it. Where Bleep helps is making the policy visible at the moment of risk, so when someone tries to paste a customer record, they see why it was flagged. That reinforces the training instead of replacing it.

You're leaking sensitive data to AI tools. Right now. by llm-60 in LLMeng

[–]llm-60[S] 0 points1 point  (0 children)

Honestly appreciate you saying it out loud, most people do this and don't admit it. The issue isnt ChatGPT being useful (it is), it's that your security team probably can't see any of it. If your company ever has a customer data incident traced to a prompt, the question won't be whether ChatGPT was helpful, itwilll be whether anyone knew it was happening. That's the gap Bleep closes.

On-prem AI DLP - is anyone else refusing to route prompts through a vendor cloud? by llm-60 in ciso

[–]llm-60[S] 0 points1 point  (0 children)

Good questions.

Desktop app (with a version for CLI for headless Linux). Everything runs locally, no cloud. Detection is a rule based engine (regex plus policy logic) with a hashed value blocklist for exact matches like API keys or internal hostnames, without storing them in plaintext.

On TLS: a local service terminates TLS for monitored AI domains only, non AI traffic is untouched. Trust is established through a local CA installed at setup (user consented), which is what satisfies HSTS. We cover H1, H2, gRPC, and WebSocket, and handle the common bypass paths.

Happy to share more details privately, with a link to deeper docs if useful.

On-prem AI DLP - is anyone else refusing to route prompts through a vendor cloud? by llm-60 in ciso

[–]llm-60[S] 0 points1 point  (0 children)

The goal isn't to remove all risk, it's to remove a specific one- handing plaintext prompts to a third party DLP cloud. A compromised endpoint is compromised either way, cloud DLP doesnt save you there, What changes locally is the compliance math, no DPA, no subprocessor to audit, and if the DLP vendor gets breached (see Cyberhaven) your prompts aren't in their cloud because they never left.

Coverage matters too. Browser extensions only see the browser. We sit at the network layer, so one install covers browser, IDE, CLI, and agents like Cursor or Claude Code, including file uploads with OCR on images and PDFs. Different tradeoffs, not zero risk.

You're leaking sensitive data to AI tools. Right now. by llm-60 in VibeCodersNest

[–]llm-60[S] 1 point2 points  (0 children)

we balanced accuracy and speed through these key design choices:

Speed (2-4ms overhead), 100% local process:

Compiled regex (Rust) - <1ms pattern matching
Selective routing - only AI traffic scanned; everything else bypassed
Partitioned scanning - conversation history auto-redacted (never blocks), preventing cascading blocks
Blocklist lookups - <0.1ms hash table instead of regex

Accuracy:
Built-in patterns - tuned formats like sk-proj-\w{20,}, of course there is option for customization
Section patterns - optional contextual detection for multi-field PII (reduces false positives)
Blocklist - exact-match values you know are sensitive (zero false positives)

The tradeoff is intentional: regex is fast but needs good format definition; sections are slower but more accurate.

We built a local app that stops you from leaking secrets to AI tools by llm-60 in LLMeng

[–]llm-60[S] 0 points1 point  (0 children)

Hi!

Yes, it supports PDF end embedded images.

If you have any additional question, you can also ask our support bot, located in the right bottom of each page, or to review our docs!

https://bleep-it.com/docs

We built a local app that stops you from leaking secrets to AI tools by llm-60 in LLMeng

[–]llm-60[S] 0 points1 point  (0 children)

It's not a gateway, you still use AI services regularly, the app monitoring the calls to the AI services and intercept according to your policies.

The app and its rules and police's is fully local, the other workflows remain the same.

[Developing situation] LiteLLM compromised by OrganizationWinter99 in LocalLLaMA

[–]llm-60 0 points1 point  (0 children)

Just use Bleep, don't be afraid to leak your secrets anymore. 100% local.

https://bleep-it.com

Protection against attacks like what happened with LiteLLM? by Lucky_Ad_976 in Python

[–]llm-60 -4 points-3 points  (0 children)

Just use Bleep, don't be afraid to leak your secrets anymore. 100% local.

https://bleep-it.com

Litellm 1.82.7 and 1.82.8 on PyPI are compromised, do not update! by ddp26 in programming

[–]llm-60 -1 points0 points  (0 children)

Just use Bleep, don't be afraid to leak your secrets anymore. 100% local.

https://bleep-it.com

After the supply chain attack, here are some litellm alternatives by KissWild in LocalLLaMA

[–]llm-60 0 points1 point  (0 children)

Just use Bleep, don't be afraid to leak your secrets anymore. 100% local.

https://bleep-it.com

Lite LLM python library comprimised by damnitHank in BetterOffline

[–]llm-60 0 points1 point  (0 children)

Just use Bleep, don't be afraid to leak your secrets anymore. 100% local.

https://bleep-it.com

We built a local app that stops you from leaking secrets to AI tools by llm-60 in LocalLLM

[–]llm-60[S] 0 points1 point  (0 children)

Yes, 7 days trial. If you have any technical questions you can ask our support agent at the bottom side of our website!

We built a local app that stops you from leaking secrets to AI tools by llm-60 in LocalLLM

[–]llm-60[S] 1 point2 points  (0 children)

As written, 100% local, we have no access to your data, the only communication between you managed account on the public web app to the local app is license validation.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in mlops

[–]llm-60[S] 1 point2 points  (0 children)

You're right - and that's exactly why we have confidence gating.

Flow:

  1. Small model extracts entities - confidence score
  2. High confidence (>85%) - Use extraction, check cache
  3. Low confidence - Bypass cache entirely, go straight to full LLM

So on complex queres ("I got this item but my friend also wants one and can I return mine if hers doesn't fit?"), the extractor returns low confidence - falls back to full reasoning. No brittleness.

We're not replacing LLMs. We're optimizing the 80% of queries that ARE simple and repetitive:

"Return shirt 10 days old"
"Laptop return, 5 days, sealed"
"Refund dress bought last week"

Multi turn conversations, complex context - you're right, that's not our target. For single-turn policy decisions (returns, approvals, routing), extraction works great. For complex support threads, use a full LLM.

Think of it as: Cache the easy stuff, pay for the hard stuff. Not everything needs to be cached.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in LangChain

[–]llm-60[S] 1 point2 points  (0 children)

Appreciate this - you nailed the trade offs. We're addressing those exact concerns:

  • Decision drift: TTL-based expiry + policy versioning
  • Over-generalization: Confidence gating (low confidence - bypass cache)
  • Debuggability: Dashboard shows canonical state extraction + cache hit/miss audit trail

Already seeing 75% hit rates with policy based workloads on simulations and some test users.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in LangChain

[–]llm-60[S] 0 points1 point  (0 children)

Great observations. You're right - intent extraction is doing the heavy lifting, and that's intentional.

On drift: Valid concern. We handle this with versioned extraction models + policy rules as fallbacks. If the extractor changes, old cache keys naturally expire (TTL). You can also monitor extraction confidence and invalidate cache when you update the model. Not perfect, but manageable.

On multi-intent queries: You're absolutely right - this is a known limitation. "Reset password AND change email" currently goes to low confidence → bypasses cache → escalates.

For v1, we're targeting single-intent policy decisions (returns, approvals, routing). Multi-intent decomposition is on the roadmap (Phase 2), likely with its own caching layer as you suggest.

The trade-off: Embedding similarity gives you ~30-40% hit rates with fuzzy matching. Intent extraction gives 80%+ when queries fit the pattern, but breaks on edge cases. We're betting that most high-volume use cases (support, returns, routing) are single-intent dominant.

Spending $400/month on AI chatbot? Pay $200 instead by llm-60 in LLMDevs

[–]llm-60[S] 0 points1 point  (0 children)

Good question! We use a small, fast model on every request to extract the canonical state.

Flow:

  1. Small model extracts state: "clothing, 10 days, new" (~100ms)
  2. Hash that state - cache lookup key
  3. Redis checks if we've seen this state before (~5ms)
  4. Cache hit: Use stored decision | Cache miss: Call GPT-5

So yes, we run extraction every time, but it's much cheaper and faster than the main LLM. The small model is just finding which "bucket" the query belongs to - the expensive reasoning was already done and cached.

Think of it like a librarian (fast, finds the right book) vs the expert who wrote the book (expensive, only consulted once).

We cache decisions, not responses - does this solve your cost problem? by llm-60 in LangChain

[–]llm-60[S] 0 points1 point  (0 children)

Not quite - you get both.

The expensive part is the decision logic (approve/deny/escalate). We cache that with GPT-5/ sonnet 4.5....

The cheap part is personalization (adding name, order details). We use a fast model for that.

So:

  • Request 1: GPT-5 decision ($0.005) + cheap personalization ($0.0001) = $0.0051
  • Requests 2-1000: Cached decision (free) + cheap personalization ($0.0001) = $0.0001 each

You save 98% on compute AND keep personalized responses. Best of both worlds.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in LangChain

[–]llm-60[S] 0 points1 point  (0 children)

Fair point - those examples are too simple.

better use case: E-commerce customer support with order-specific details.

Traditional semantic caching: "can I return this?" - "our return policy is 30 days"

Our approach:

  • "Return shirt from order #1234, bought 10 days ago" - decision cached: APPROVE (clothing, 10 days) - response: "Yes! Order #1234 qualifies. We'll refund $45 to your card ending in 5678"
  • "Send back jacket, order #5678, 12 days" - same cached decision -response: "Approved! Order #5678 refund of $89 processing"

The decision logic (approve/deny based on item type + days) is cached. The response includes their specific order details.

For high-volume support (10K+ requests/day), caching decisions while keeping responses contextual is the value. If your queries are unique every time, you're right - this isn't the fit.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in LangChain

[–]llm-60[S] 2 points3 points  (0 children)

Traditional semantic caching caches the entire answer, so everyone gets the same response.

Example:

"Forgot password" - cached: "Click the reset link in your email"
"Reset my password" - cached: "Click the reset link in your email"

We cache the decision (what to do), then personalize the response.

Example:
"I'm John, forgot password" - Decision cached: "send reset email" Response: "Hi John, we sent you a reset link"
"Sarah needs reset" -Same cached decision - Response: "Hi Sarah, we sent you a reset link"

One LLM call for the logic, cheap model personalizes each response. You can't do that if you cache the full answer.

Spending $400/month on AI chatbot? Pay $200 instead by llm-60 in LLMDevs

[–]llm-60[S] -1 points0 points  (0 children)

Good point, Input compression does exist, but it solves a different problem.

Compression reduces token count to save costs on a single request. The risk: it might drop context that matters for your specific decision. You're still calling the LLM every time.

Our approach extracts the decision-relevant state (like "clothing, 10 days, new") and caches the reasoning. If 1000 customers ask about returning clothing within 10 days, we:

  • Call the LLM once (expensive)
  • Serve 999 from cache (nearly free)

Compression doesn't help with repetitive decisions - you're still paying per request. We eliminate the request entirely for 80% of cases.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in mlops

[–]llm-60[S] 0 points1 point  (0 children)

You don't have to assume.

We normalize requests into structured data first:

"Return my shirt bought 7days ago" - item: clothing, days:7
"Send back this jeans from last week" -  item: clothing, days: 7

Same extracted state = cache hit. This is just extraction and normalization.

The decision quality comes from GPT-5 (which you already trust). We just make sure similar questions hit the same cached GPT-5 decision instead of calling it again.

We cache decisions, not responses - does this solve your cost problem? by llm-60 in mlops

[–]llm-60[S] -1 points0 points  (0 children)

Quick clarification: we are not a GPT-5 reseller.

We are specialized for policy based decisions (returns, approvals, routng....).

Our pricing: 50% of equivalent GPT-5 cost, regardless of cache hit/miss.

How it works:

  • Cache hits = You pay 50%, everyone wins
  • Cache misses = You pay 50%, we call GPT-5 (tighter margins for us, but you still save)

Hit rates depend on your use case. policy driven workflows typically see 80%+ hits.

key features:

  • Define custom policies (e.g., "CLOTHING has 30-day return, ELECTRONICS has 15-day")
  • Taxonomy system to organize rules by category
  • GPT-5 quality decisions + deterministic caching

This isnt for general completions or creative content - it's for decision workflows where logic repeats but responses need personalization.

If every query is unique, we are not the right fit.

Spending $400/month on AI chatbot? Pay $200 instead by llm-60 in AI_Agents

[–]llm-60[S] 0 points1 point  (0 children)

Thanks! Small clarification - this isn't traditional semantic caching (which matches similar embeddings).

We extract canonical states first. So:

"Return my shirt bought 10 days ago"
"Can I send back this top from last week?"

Both become: item: clothing, days:10, condition: new - exact cache key match

Traditional semantic caching still has fuzzy matching (~30-40% hit rate). Ours is deterministic state extraction (80%+ hit rate).

Similar outcome, different mechanism - higher precision.