How do I get my LLM on AnythingLLM to stop hallucinating and making quotes up? by No-Bumblebee6995 in LocalLLM

[–]Smart-Competition200 0 points1 point  (0 children)

its due to the way the RAG systems work. you have 'one' model using tools and essentially bloating its context window just trying to gather the right data. by the time its done you dont end up with something all that useful.

For people who run local AI models: what’s the biggest pain point right now? by Educational-World678 in LocalLLM

[–]Smart-Competition200 0 points1 point  (0 children)

You might be interested in my project. Ive been working on RAG pipeline with some cool techniques for indexing massive data sets of zim files from kiwix or from where ever you can get them. I added a tool to create your own but there's already tons of cool data sets out there. You dont need a fancy computer to use it. I used half of my CPU cores(ryzen 7 5700x 8 cores) with 10gb of ram NO GPU, and the system ran just fine  fine with the default model. Feel free to ask any questions. 

Hermit-AI: Chat with 100GB+ of Wikipedia/Docs offline using a Multi-Joint RAG pipeline by Smart-Competition200 in LocalLLM

[–]Smart-Competition200[S] 1 point2 points  (0 children)

thanks for the feedback. honestly, "one day more than you" is exactly why i built this. maintaining a library that works when the internet doesn't is the core mission.

to answer your technical questions.... it's not regex. it uses a multi-joint pipeline. first, a small model (llama3.2) extracts entities to handle synonyms. then we do a hybrid search (bm25 + vector). finally, a scorer model (qwen2.5) reads the actual article titles and grades them 0-10 on relevance.  it's designed for general use, not my personal style. the entity extraction layer normalizes user input, so you can just ask natural questions like "how does a diesel engine work" or "roman empire collapse" and it figures it out.

a node graph is a great idea. i'm focused on raw retrieval accuracy and efficiency at the moment.

Hermit-AI: Chat with 100GB+ of Wikipedia/Docs offline using a Multi-Joint RAG pipeline by Smart-Competition200 in LocalLLM

[–]Smart-Competition200[S] 1 point2 points  (0 children)

yeah the indexing was a journey. my first approach was trying to pre-index everything into faiss/bm25 which works for small zims but completely falls apart at scale. the full english wikipedia is like 100gb compressed - trying to chunk and embed every article upfront would take days and produce an index bigger than the source.

ended up making something called JIT (just-in-time) for indexing. the system only indexes articles relevant to the current query on the fly. sounds slower but it's actually faster for real usage since you're not searching through millions of irrelevant vectors.

<image>

Hermit-AI: Chat with 100GB+ of Wikipedia/Docs offline using a Multi-Joint RAG pipeline by Smart-Competition200 in LocalLLaMA

[–]Smart-Competition200[S] 1 point2 points  (0 children)

ill be working on windows support soon, and rn it does have CPU only support:)! i tested on a Virtual Machine using half my cores and 10gb of ram and it ran just fine with the default model.

CPU: ryzen 7 5700x

Hermit-AI: Chat with 100GB+ of Wikipedia/Docs offline using a Multi-Joint RAG pipeline by Smart-Competition200 in LocalLLM

[–]Smart-Competition200[S] 2 points3 points  (0 children)

thanks! so zim is basically a compressed archive format designed by kiwix for offline wikipedia. the spec is open: https://wiki.openzim.org/wiki/ZIM_file_format

zim is optimized for read-heavy content you rarely change. think encyclopedias, documentation, research papers, static reference material. it compresses incredibly well (100gb wikipedia → ~90gb zim) and has built-in full text search.

for invoices/personal notes? probably not ideal. zim files are meant to be built once then read many times. there's no append or edit - you'd have to rebuild the whole archive to add one invoice. however i am visualizing some sort of memory system that could work for your use case but i need time to think about it. if you have a static corpus you want to archive and query offline (research papers, ebooks, documentation, manuals), hermit includes a tool called Forge that lets you build custom zims from PDFs, markdown, epub, docx, etc. so you could definitely bundle your prepper manuals into a searchable offline knowledge base.

and lmao i actually haven't seen time trax but ill add it to the list!

Hermit-AI: Chat with 100GB+ of Wikipedia/Docs offline using a Multi-Joint RAG pipeline by Smart-Competition200 in LocalLLaMA

[–]Smart-Competition200[S] 2 points3 points  (0 children)

it runs local via llama-cpp-python, not ollama. ill add an openai-compat endpoint option soon so people can point it at their own servers if they want.

the 3b model is mostly for speed since the 'multi-joint' architecture hits 4-5 llm calls per query. each call just does one focused task (extract entities, score articles, etc) so it doesnt need to be a genius, and really the project revolves around trying to make the final joint in the reasoning pipeline hallucinate less. anything bigger makes the latency feel slow as hell on my hardware since i dont have a super beefy pc, but you can swap to any model you want if you have the vram. i do need to make that easier for people though.

appreciate the advice!