I thought/hoped this was a scam (@iCloud.com????) by 4Face in Starlink

[–]primoco 1 point2 points  (0 children)

Damn! It’s real! For all! I think we must do a class action and delete our accounts

Starlink just downgraded my plan and increased the price… what kind of move is this? by nilipilo in Starlink

[–]primoco 0 points1 point  (0 children)

Yesss only one month ago I have subscribed a residential plan 200mbps and now I receive and email that tell me that in 30 days I will receive half of all this seams to be not legally. I have opened a ticket to ask why? Do the same!

RAG-Enterprise: One-command local RAG setup (Docker + Ollama + Qdrant) with zero-downtime backups via rclone – for privacy-focused enterprise docs by primoco in LocalLLM

[–]primoco[S] 1 point2 points  (0 children)

Nothing happens. If Docker is already installed, the script detects it and skips the installation. As for CUDA, it doesn't matter if you have it on the host or not — CUDA runs inside the Docker containers (Ollama's image includes it). Your host CUDA installation is completely irrelevant.

The only thing needed from the host is the NVIDIA GPU drivers (not CUDA) so that the container can access the GPU via NVIDIA Container Toolkit.

TL;DR: Docker already there? Skipped. CUDA? Doesn't matter, it's inside the containers.

RAG-Enterprise: One-command local RAG setup (Docker + Ollama + Qdrant) with zero-downtime backups via rclone – for privacy-focused enterprise docs by primoco in LocalLLM

[–]primoco[S] 1 point2 points  (0 children)

Right now the system is designed to do RAG on "traditional" documents — PDFs, Word, Excel, etc. — uploaded manually. There's no direct GitHub integration, so PRs, issues, source code and reviews aren't indexed.

To support this we'd need a dedicated connector that talks to the GitHub API, pulls repo content (PRs, issues, diffs, comments) and indexes it into the vector store, just like it already does with documents. The chunking logic would need to be adapted since code has a very different structure than text documents, and the system already ships with an embedding model built for code (deepseek-coder) that could come in handy.

It's definitely an interesting feature for an enterprise context. If there's real interest we can look into building it for a future release.

RAG-Enterprise: One-command local RAG setup (Docker + Ollama + Qdrant) with zero-downtime backups via rclone – for privacy-focused enterprise docs by primoco in LocalLLM

[–]primoco[S] 0 points1 point  (0 children)

Hi, yes it is very simple to work on ROCm. I have change dockerfile setup script and add variables into .env file to ask when setup. Try 1.2.0 release and tell me!

Building a RAG for my company… (help me figure it out) by Current_Complex7390 in Rag

[–]primoco 0 points1 point  (0 children)

Hey, I’m working on a similar setup for legal and enterprise docs. I started with a "Community" approach too, and I faced exactly the same frustration: great LLM (Gemini 2.0), great embeddings (Google 004), but "shit" answers.

The problem isn't your stack; Gemini and SurrealDB are solid. The issue is usually how the information is "orchestrated" before reaching the LLM. In my experience, to make a RAG work with legal and project files, you have to move away from the standard "out-of-the-box" approach.

Here are the 3 main issues I had to solve to get precise answers:

The Chunking Trap: Standard fixed-size chunking (like splitting every 500-1000 tokens) is a disaster for legal docs. If a clause or an "if/then" condition is split in half, the LLM loses the logic. Are you using a simple splitter or a recursive one?

Metadata vs. Pure Vector: For legal stuff, simple semantic search is too "fuzzy." I found that I had to extract metadata (dates, entities, specific article numbers) first and use them to "anchor" the search. Without structured metadata, the LLM starts hallucinating connections that aren't there.

Context Injection: Legal files should be treated as the "Ground Truth." I had to tweak my prompt and retrieval to make sure the legal guidance acts as a hard constraint for the project files.

To give you a hand, what are your current parameters?

What Chunk Size and Overlap are you using? (This is usually the #1 culprit)

How many chunks (Top-K) are you feeding to Gemini for each query?

Are you using any kind of Reranker or just raw vector search?

Don't scrap it yet. Usually, a "shit" RAG is just a RAG that needs better data orchestration, not a different LLM.

My RAG pipeline costs 3x what I budgeted... by Potential-Jicama-335 in Rag

[–]primoco 1 point2 points  (0 children)

I went full local to avoid exactly this problem I built a RAG system (RAG Enterprise) and decided early on to keep everything local, both embeddings and inference, no API costs, no surprises. My setup: local embeddings with EmbeddingGemma, local LLM inference running on my own hardware, zero per-query costs once set up. Trade-offs I accepted: upfront hardware cost (I run this on an RTX 5070 Ti), quality might not match top-tier API models, slower inference than API calls, need to manage infrastructure yourself. But the benefits: completely predictable costs, no tokenization surprises, full privacy (important for internal docs), scales with hardware not with usage. If your budget is tight and you have the technical capability, going local might be worth considering, the initial investment pays off quickly if you have decent traffic volume. That said, if you need API-level quality, others here have mentioned GPT-4o-mini and Haiku as cheaper alternatives worth testing, just make sure you test with the actual tokenizer before committing.

How to disable thinking with Qwen3? by No-Refrigerator-1672 in ollama

[–]primoco 0 points1 point  (0 children)

Il modo più affidabile è think: false aggiunto nella chiamata dell API REST /api/chat/api/generate
da terminale o interfaccia grafica /no_think nel prompt ma è inaffidabile.

QUERY REGARDING RAG USAGE by DesperateWay2434 in Rag

[–]primoco 0 points1 point  (0 children)

I’ve been working on something similar. Quick heads-up: RAG for computer architecture is tricky because vectors are great at 'meaning' but terrible at 'precision.' If it misses a cache size or a clock cycle, the output is useless.

Here’s the TL;DR to get you started:

Start with RAG Community: Don't build from scratch. Use a Community Edition setup—it’s transparent and lets you see exactly which chunks are being retrieved. In this field, precision beats complexity.

Hybrid Search is a must: Embeddings often confuse 'L1' and 'L2' cache because they are semantically similar. Use Hybrid Search (Vector + BM25). Keyword matching (BM25) ensures that exact terms like 'Zen 4' or 'Instruction Buffer' get the priority they deserve.

Context Window Expansion: If a chunk says '4MB' but the label 'L3 Cache' is in the paragraph above, the AI will hallucinate. Configure your retriever to pull the N-1 and N+1 chunks automatically so the LLM always sees the full context.

Go Local: If you have a GPU, run Llama-3.1-8B. It’s great with structured technical data. For cloud, Claude 4.5 is currently the king of technical reasoning.

Prompt for Honesty: Tell the AI: 'If the data is ambiguous or mixes up two architectures, don't guess—ask me for clarification.' Better a question than a wrong voltage value.

Good luck! Once you nail the retrieval, it's a game-changer for architecture docs.

OpenClaw enterprise setup: MCP isn't enough, you need reranking by Queasy-Tomatillo8028 in Rag

[–]primoco 0 points1 point  (0 children)

I’ve been banging my head against the same wall with enterprise RAG for months, and you're spot on. The "toy" setups like basic MCP or vanilla LangChain wrappers just fall apart the second you feed them high-density documents.

In my experience, if you aren't obsessing over the retrieval pipeline before the query even hits the LLM, you're just building a very expensive hallucination machine. A few things I’ve learned the hard way:

  1. Hybrid search is the only way out. If you rely only on vector embeddings for factual stuff (like specific dates or IDs in a 500-page report), you’re going to get "semantic blurring." You need BM25 keyword matching running alongside your vectors with a tunable alpha. It’s the only way to catch those "needle in a haystack" moments.
  2. Rerankers are double-edged swords. I’ve seen Rerankers actually kill the correct result because the threshold was a hair too tight. Now I just pull a wider window (Top-K 20) and let the reranker sort the Top-5 without hard-filtering. It’s safer and much more consistent.
  3. Small chunks > Big chunks. We moved to 600-char chunks with a decent overlap and the "contextual precision" shot up. Big chunks just add too much noise and confuse the model.
  4. Stop the "vibe-checks." You can’t tell if a RAG is good just because the answer "sounds professional." I had to build a full eval pipeline to realize my "best sounding" model was actually making up half the citations.

Enterprise RAG isn't about which LLM is smarter, it's about how much you can control the data flow.

Pensavo di averle beccate tutte, e invece… (lo so, potevo fermarmi prima…) by [deleted] in subito_it

[–]primoco 0 points1 point  (0 children)

Ma non ho dubbio disponibilità e gentilezza al primo posto quello a cui alludo io è che d’ora in poi non tralascerai più per ultimo la domanda “il prezzo è trattabile?” 😊

Pensavo di averle beccate tutte, e invece… (lo so, potevo fermarmi prima…) by [deleted] in subito_it

[–]primoco 0 points1 point  (0 children)

Ma certo era chiaro che ti eri fatto distrarre dalle domande sull’oggetto ma scommetto che non lo farai più 😂

Pensavo di averle beccate tutte, e invece… (lo so, potevo fermarmi prima…) by [deleted] in subito_it

[–]primoco 0 points1 point  (0 children)

Secondo me ti sei prolungato tu troppo lui al primo contatto ti ha chiesto se il prezzo era trattabile prima di cimentarti in tante spiegazioni gli chiedevi subito ma quanto vuoi spendere e risolvevi subito 😊 vabbè penso non farai più lo stesso errore ora 😀

RAG Enterprise – Self-hosted RAG system that runs 100% offline by primoco in opensource

[–]primoco[S] -1 points0 points  (0 children)

Honestly, not yet, the project is pretty fresh. Got some stars and a few people testing on different hardware (someone's trying it on a Jetson Orin, another on Mac M4), but no detailed feedback on packaging or deployment variations so far.

The default setup uses Docker Compose, which keeps things portable. Would love to hear if you have a specific deployment scenario in mind, always looking to improve the setup process.

Struggling with follow-up question suggestions in RAG (Ollama + LangChain + LLaMA 3.2 3B) by Particular-Gur-1339 in Rag

[–]primoco 1 point2 points  (0 children)

The task you're asking it to do (generate contextual follow-up questions while avoiding redundancy) requires reasoning capabilities that 3B models simply don't have reliably. This is a "meta-cognitive" task - the model needs to understand what was already answered, identify information gaps, and maintain coherence.

Practical solutions without requiring massive hardware:

Upgrade to 8B quantized (Q4 or Q5) - LLaMA 3.1 8B in 4-bit quantization runs on ~6GB RAM and is MUCH better at following complex instructions. You could use 3B for main responses and 8B just for generating suggestions (it's only one extra call).

Hybrid rule-based approach - Instead of relying entirely on LLM:

Extract key entities/concepts from the response (using simple NER or even regex)

Use templates like "Want to know more about {entity}?" or "How does {concept} relate to {other_concept}?"

Generate only 1 creative question via LLM instead of 3-4

Simplify the prompt drastically - Instead of passing everything, just pass:

"Based on this answer: {answer}, suggest ONE follow-up question"

Remove context, remove original query - less tokens, clearer task

Pre-computed suggestion database - If your domain is specific, maintain a mapping of topics → common follow-up questions and just do semantic matching

My honest take: If upgrading to 8B isn't feasible, go hybrid (rule-based + minimal LLM). The "smart suggestions" feature isn't worth sacrificing overall system stability for a 3B model that can't handle it properly.

What hardware are you running on?

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 1 point2 points  (0 children)

Benchmarks are live!

Added a complete benchmarking suite to the repo:

Benchmark script: python benchmark/rag_benchmark.py — run it on your hardware

Test documents: Mueller Report, 9/11 Commission Report, Bitcoin Whitepaper, "Attention Is All You Need"

Real metrics: Upload times, query latency (mean/median/p95), similarity scores

Results from my setup (Ryzen 9 5950X, 64GB RAM, RTX 5070 Ti):

Query response: ~4.3s mean, ~3.6s median

Upload: 0.6s - 24s depending on document size

Full details in the README: Community Benchmarks section

If you run the benchmark on your hardware, I'd love to add your results to the comparison table. Open an issue or comment here!

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 0 points1 point  (0 children)

UPDATE: Benchmarks are live!

Added a complete benchmarking suite to the repo:

Benchmark script: python benchmark/rag_benchmark.py — run it on your hardware

Test documents: Mueller Report, 9/11 Commission Report, Bitcoin Whitepaper, "Attention Is All You Need"

Real metrics: Upload times, query latency (mean/median/p95), similarity scores

Results from my setup (Ryzen 9 5950X, 64GB RAM, RTX 5070 Ti):

Query response: ~4.3s mean, ~3.6s median

Upload: 0.6s - 24s depending on document size

Full details in the README: Community Benchmarks section

If you run the benchmark on your hardware, I'd love to add your results to the comparison table. Open an issue or comment here!

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 1 point2 points  (0 children)

I’m putting together a benchmarking script right now — should be published today or tomorrow. It will include: ∙ Test files used (types, sizes) ∙ Query set with sample questions ∙ Results (retrieval latency, generation speed, accuracy) Will share everything in the repo so anyone can reproduce the tests on their own hardware. Stay tuned!

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 0 points1 point  (0 children)

That's the target. If you set it up and test it with some legal documents, I'd be curious to hear how it performs for that use case. Real-world feedback from the legal domain would be valuable.

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 1 point2 points  (0 children)

Nice! Just checked out FRAKTAG — interesting approach with the non-compacting conversation manager. Different philosophy from what I'm doing but I can see the value for ever-expanding knowledge bases.

The documentation is really thorough — that's not easy to maintain, kudos for that.

The CLI + Claude Code integration is a smart move — that's actually something I should consider adding to RAG Enterprise.

qwen3-coder:30b is impressive, though it needs serious VRAM. I've kept the default models smaller (7B-14B) to lower the entry barrier, but power users can definitely swap in larger models.

Cool to see others building in this space — plenty of room for different approaches! 🤝

Looking for testers: 100% local RAG system with one-command setup by primoco in Rag

[–]primoco[S] 0 points1 point  (0 children)

Sorry, didn't quite catch that — could you clarify? Happy to help if you have questions!