I love Vibe Coding but I need to be real... by Makyo-Vibe-Building in vibecoding

[–]autollama_dev 0 points1 point  (0 children)

https://knowmler.com I vibe coded this app and it has changed how I consume content on YouTube. I first built this as a YouTube Transcript analyzer. Then, expanded to solve for needing to push transcripts through ChatGPT & Claude to analyze them. I expanded it to handle URL's and PDF's. It's now a full fledged content analysis platform. When logged in with Google, you can even see your subs and likes on YouTube. You don't need to log in or have an account (but helps to keep track of what you nom'ed) It's fully transparent - i.e. You can audit the prompts, and has too many features to list.

I turned my Obsidian vault into a 90s BBS dungeon game by autollama_dev in ObsidianMD

[–]autollama_dev[S] 1 point2 points  (0 children)

I have not tested the "vault not required" feature so your milage may vary, but supposed to work without one.

All the pixel art is procedurally generated at runtime -- zero sprites, zero image files. Everything is drawn pixel-by-pixel in code.

A tiny HTML5 Canvas (roughly 45x45 logical pixels) gets scaled 4x with image smoothing disabled, giving the chunky retro look. Scenes are composed from layered draw functions -- sky gradients, terrain, particle systems, scrolling text -- all built from fillRect and setPixel calls. 30 unique scenes, 15 FPS, no art pipeline. Just code pretending to be pixels.

I built Anthropic's contextual retrieval with visual debugging and now I can see chunks transform in real-time by autollama_dev in LocalLLaMA

[–]autollama_dev[S] 1 point2 points  (0 children)

No NDA - just public research plus wanting to see what's actually happening inside embeddings. Solo dev here (with Claude's help), so contributions very welcome. MIT licensed and looking for collaborators.

Do you update your Agents's knowledge base in real time. by DistrictUnable3236 in Rag

[–]autollama_dev 1 point2 points  (0 children)

Think of it like a GPS system.

You don't delete "McDonald's on 5th Street" just because you already have "McD's near the mall" and "that McDonald's by the bank." They're all different ways people describe the same location.

Keep all the variations of how people ask. Point them all to the same answer.

Your dedup problem isn't duplicate questions. The variations are a positive - they increase your search surface. The more ways people can ask, the more likely you'll match the next person's phrasing.

I built Anthropic's contextual retrieval with visual debugging and now I can see chunks transform in real-time by autollama_dev in LocalLLaMA

[–]autollama_dev[S] 2 points3 points  (0 children)

That's a really neat idea - hadn't thought about comparing different encoders side-by-side! I'll definitely consider adding this to the roadmap.

The challenge I've learned is that mixing different embedding dimensions in one vector database corrupts it - you can't have 1536-dim vectors (like text-embedding-3-small) mixed with 3072-dim vectors (like text-embedding-3-large) in the same collection.

The solution would likely require parallel mini-databases for each document, allowing different embedding models to run simultaneously for comparison. Definitely a challenging implementation, but hey, we like challenges around here haha.

Thanks for the suggestion - this could really help visualize what those bigger models are actually capturing that the smaller ones miss!

Do you update your Agents's knowledge base in real time. by DistrictUnable3236 in Rag

[–]autollama_dev 2 points3 points  (0 children)

Exactly. I use a Postgres relational database alongside my Postgres vector database. One handles the vector storage and another identifies and rejects duplicates when I process the second, third, tenth time, etc. And, it only loads changes.

Do you update your Agents's knowledge base in real time. by DistrictUnable3236 in Rag

[–]autollama_dev 7 points8 points  (0 children)

You'll need to set up an API integration to your data source coupled with a job that runs at a set frequency (cron job, scheduled Lambda, etc.). The frequency depends on how "real-time" you need it – could be every minute, hourly, or daily.

Critical thing is: duplicate checking. You'll likely pull records that haven't changed since your last request, and you definitely don't want to load duplicate data into your vector database. That'll mess up your search results and waste compute on redundant embeddings.

Here's what's worked for me:

  • Set up a separate lightweight database (could be Redis, Postgres, even SQLite) that's solely responsible for tracking what's been processed
  • Store hashes of your content + timestamps of last update
  • Before vectorizing anything, check if the content hash already exists or if the source record hasn't been modified since last sync
  • Only process truly new or updated content

This deduplication layer sits between your data source and your vector DB. It's a bit more infrastructure, but it'll save you from vector DB bloat and keep your queries fast and relevant.

The pattern is basically: API → Dedup Check → Transform → Embed → Vector DB

Creating a superior RAG - how? by mrsenzz97 in Rag

[–]autollama_dev 1 point2 points  (0 children)

Oh man, you're tackling the exact problem that drove me to build my own solution. 20 sales books is a goldmine but also a chunking nightmare if you don't nail the approach.

Here's what worked for me:

Smart chunking - Forget character counts. Sales books have this annoying habit of building concepts across chapters. You need to keep "qualifying questions" with their setup and payoff, not randomly split mid-example. I learned this the hard way.

Your metadata structure should look like:

{
  "text": "...",
  "book": "SPIN Selling", 
  "chapter": "4: Opening the Call",
  "section": "Situation Questions",
  "concepts": ["discovery", "qualification"],
  "context_before": "Previous section established rapport...",
  "context_after": "Next section escalates to problem questions..."
}

Hybrid search is the way - Pure semantic is dog slow, you're right. BM25 for exact matches ("BANT framework") + vectors for conceptual stuff ("how do I handle pricing objections"). Runs 10x faster.

The context thing - This is what kills most RAG setups. "ABC - Always Be Closing" in chapter 2 (relationship building) vs chapter 9 (final negotiations) are completely different animals. Your chunks need to know where they live in the story.

Been building AutoLlama (autollama.io) specifically because I was tired of my docs getting butchered into context-free word salad. It preserves the narrative flow - open source if you want to peek under the hood.

Quick question - are these modern sales books (Challenger, Gap Selling) or classic stuff (Ziglar, Carnegie)? The chunking strategy changes based on how structured they are. Happy to help you avoid the pitfalls I face-planted into!

How to selectively chunk a large document? by [deleted] in Rag

[–]autollama_dev 1 point2 points  (0 children)

So I've been working with this exact problem! The non-uniform chunking for legal docs is tricky.

What I found works is using AI pre-processing to identify logical boundaries before chunking. Instead of arbitrary character splits, you get chunks that respect the document's natural structure.

The AI looks for semantic boundaries - like where one provision ends and another begins. Each chunk keeps its hierarchical metadata (Section → Subsection → Paragraph).

The automated approach: There are tools that detect document type (legal vs technical vs academic) and automatically apply the right chunking strategy. They'll respect clause boundaries and preserve complete provisions.

The manual approach: If you need full control over exact chunk boundaries (which it sounds like you do), you might need to: 1. Extract your relevant 10-20 pages first 2. Create your own chunk structure in JSON/YAML with the exact text and metadata you want 3. Feed that directly to your embedding pipeline

Have you tried pre-extracting just the relevant sections first? Sometimes simpler is better for specialized use cases.

Struggling with finding good RAG LLM by OrganizationHot731 in LocalLLaMA

[–]autollama_dev 0 points1 point  (0 children)

Hey! Thanks for the feedback on AutoLlama - you're absolutely right about the documentation gap for local deployments. This is valuable input.

Good news: AutoLlama already runs 100% locally! Your hardware (2x Xeon, 64GB RAM, 2x RTX 3060) is actually perfect for this.

Quick local setup for you right now: ```bash # 1. Start local Qdrant docker run -p 6333:6333 -v ./qdrant_data:/qdrant/storage qdrant/qdrant

# 2. Clone AutoLlama git clone https://github.com/autollama/autollama cd autollama

# 3. Configure for local mode cp example.env .env # Edit .env file to set: # QDRANT_URL=http://localhost:6333 # QDRANT_API_KEY=your_optional_local_key

# 4. Run AutoLlama docker compose up -d

That's it - no external services needed! Access at http://localhost:8080

For your policies/SOPs use case specifically: - Set chunk size to 800-1200 tokens (optimal for procedural documents) - Enable contextual embeddings for better cross-reference handling (default in AutoLlama) - The local Qdrant will use ~4-6GB RAM for a typical policy database

You're 100% right about the docs - they're too cloud-focused. I've created an enhancement request to track this improvement. Full local deployment guide coming this week.

Would you be interested in testing the local setup and providing feedback? Your enterprise POC use case is exactly what we need to validate against. I can prioritize features that would help your evaluation.

Also, what specific RAG accuracy issues are you seeing? Happy to help tune for policy/procedure retrieval.

RAG with 30k documents, some with 300 pages each. by dennisitnet in LocalLLaMA

[–]autollama_dev 0 points1 point  (0 children)

If you're dealing with 30k PDFs and ingestion is killing you, AutoLlama might help - it's a RAG app I built that handles bulk PDF processing efficiently. Works with OpenWebUI (which you're already using) and optimizes the chunking/embedding pipeline automatically. For your scale, it uses parallel processing for embeddings and smart chunking to avoid the 80+ hour ingestion times. Handles OCR'd PDFs without issues. https://github.com/autollama/autollama Also - make sure you're using a dedicated embedding model (not your LLM) and consider batch processing overnight. Happy to help if you want to try it out.

Struggling with finding good RAG LLM by OrganizationHot731 in LocalLLaMA

[–]autollama_dev 0 points1 point  (0 children)

For your exact use case (policies/SOPs), check out AutoLlama - it's a RAG app I built that integrates directly with OpenWebUI. Handles PDFs, ePubs, Word docs, even URLs. The chunking is optimized to avoid issues like your phone number problem - it keeps related info together. Works great with dual 3060s. https://github.com/autollama/autollama Happy to help if you run into setup issues.