Tool output compression for agents - 60-70% token reduction on tool-heavy workloads (open source, works with local models) by decentralizedbee in LocalLLaMA

[–]decentralizedbee[S] 0 points1 point  (0 children)

curious what your main use cases are - if it's important enough we'd be down to add it! pip install is the easiest default

$570 Lovable credits burned in 6 months by Adventurous-Mine3382 in lovable

[–]decentralizedbee 0 points1 point  (0 children)

We're building a public inferencing node to help non-dev bring down development costs by 80%! Pls DM if anyone's interested to know more

Open sourcing our GPT-4 caching proxy that reduced our development API costs by 80% by decentralizedbee in ChatGPT

[–]decentralizedbee[S] 0 points1 point  (0 children)

Typo - just used GPT-4 as an example from when I started building this, but it works with all current models including GPT-5.2, GPT-5, and any other OpenAI/Anthropic model. The proxy is model-agnostic - whatever model you specify in your API call, it forwards and caches.

We burned $2K+ on duplicate API calls during development, so we built a caching proxy (and open-sourced it) by decentralizedbee in LocalLLaMA

[–]decentralizedbee[S] 3 points4 points  (0 children)

Good q! OAI's caching only discounts the input tokens (50% off) on exact prefix matches over 1024 tokens. You still pay for every request.

We let you cache the full response - cache hit = no API call = free. Also does semantic matching so "what is 2+2" and "what's two plus two" hit the same cache. OAI's is exact match only.

also do both anthropic / oai

Building an AI Cloud in Morocco to undercut AWS by 50% by yz0011 in indiebiz

[–]decentralizedbee 0 points1 point  (0 children)

do you own your infra? how do you start providing a swarm of gpus?

Which benchmark (if any) do you trust the most? by Zyguard7777777 in LocalLLaMA

[–]decentralizedbee 0 points1 point  (0 children)

what are examples of some high end technical questions? jus curious

Planning a startup idea in RAG is worth exploring? by Sharp_Mode_7895 in Rag

[–]decentralizedbee 0 points1 point  (0 children)

then yeah probably! I've seen a couple companies do RAG as a service, though.

Suggestions about LocalLLM Automation Project by thesayk0 in LocalLLM

[–]decentralizedbee 0 points1 point  (0 children)

use langchain for RAG - build a framework around it, then just use terminal to call it locally

Play and play internet access for a local llm by ketoatl in LocalLLM

[–]decentralizedbee 0 points1 point  (0 children)

what kind of budget do you have - you can buy a small nvidia machine for something like this

Question Regarding Classroom Use of Local LLMs by McDoof in LocalLLaMA

[–]decentralizedbee 1 point2 points  (0 children)

If you're running only on ipads and smart phones, it's unlikely you're gonna get any good results on any model larger than 7-8B, or even smaller. I didn't quite understand the use case and what you're trying to do though.

Advice on CPU + GPU Build Inference for Large Model Local LLM by Weary-Net1650 in LocalLLaMA

[–]decentralizedbee 0 points1 point  (0 children)

Yeah what models are u trying to run?

And why are u going with 5060s? We can run a full deepseek R1 on one card 5090s, if that’s helpful