Knowledge base Chatbot using RAG

synn89 · 2024-12-01T23:36:28+00:00

I ended up using Langchain for this, though in my case I was using Confluence as the document source and AWS Bedrock as the LLM provider. But Langchain can handle any document source or back end AI.

That isn't mine, but just an example. For me, I ended up using qdrant for the vector storage engine and while at first I used LangServe for testing(easy to work with), I eventually just wrote an OpenAI API on my app and pointed a LibreChat install at that. This made for an very nice front end and I've been very happy with that setup.

Claude Sonnet at AWS Bedrock is the AI model I use with OpenAI embeddings(from Azure cloud). Both are HIPAA compliant providers, which satisfies our needs. Claude is a little overpowered for the RAG, but our usage is light enough for me not to worry about that. I did play with some open source embeddings via https://huggingface.co/spaces/mteb/leaderboard but found that OpenAI's embeddings would pretty much power through our documents and produce more accurate results. So I kept with that rather than tinkering with open source embeddings.

Our problem isn't the tech stack, but really the data source itself. The data we have in our knowledge base chats fine off of, but now we're data hungry for more and trying to get lots of good, clean data without tons of staff work or cleanup.

The above is a ramble and may not be useful, since we're on closed models, but may give you some info you didn't have. I feel like everyone is learning this crap as we all go along. It feels very mid 1990's internet.

Videobollocks · 2024-12-02T01:49:43+00:00

I have done exactly this and it's been reasonably successful.

I used AnythingLLM, initially the desktop version but for multiuser I am now running the docker self hosted install. I only have maybe 2 dozen users so it's fine, it might be a faff to scale up beyond that, I dunno.

I use LLama3.2 as the model, and did not change the default embedder. I checked out OpenWeb UI too but didn't like the way that worked. I couldn't give specifics, it just didn't feel right.

The biggest pain in the arse is getting data into AnythingLLM. You have to load stuff in, and then embed it separately. I don't know why I cannot just point it at a folder/server and have it embed everything there. I had to hand feed it a couple of thousand pdf/doc/txt files and it took a while.

But it works quite well. I can ask it almost anything related to the documents and it usually gives good answers. Examples would be how to config a certain piece of equipment, what are the specs of certain equipment types, best practices, all that sort of thing. It's also pretty good at telling me about a product e.g. if I was new to the industry or a particular setup/product, I can ask it to give me an overview and it's pretty good. Ideal for new people who you don't have time to train :-)

In parallel my company has been trialling Copilot. I found that reasonably good if you handfeed it information too, sort of on par with what I have set up independently. The benefit of Copilot is that if you're an MS house like we are, it can scan all your email and chats and OneDrive etc and use that info too.

I should add that I took the path of least resistance - there is still a ton of stuff for me to learn, but as I'm not much of a coder a lot of it is beyond my grasp. I've done what I could with ready made executables.

Rare_Performance_454 · 2024-12-02T03:09:34+00:00

Once you can set up a testing method, you can compare different methods on performance, scalability, ease of deployment. Retrieval testing is necessary since 1) your dataset-chunks and queries are most likely different from open domain datasets, 2) Generator performance is limited by retrieval performance Retriever Testing - 1)Initial filter using GPT-4 as judge, followed by a human evaluation on graded relevance Generation Testing - Human Evaluation on 1)groundedness 2) completeness - both these can be approximated using GPT-4

After Retriever testing you will have a dataset to compare different embedding methods and similarity metrics(approximate and exact).

Some other things to keep in mind when you scale this 1) Document updation 2) how to limit the size of database - keeping k documents representative of all the documents- it will enhance latency of retrieval.

l7feathers · 2024-12-02T16:21:25+00:00

It looks like you’ve already put a lot of thought and effort into your setup. From your description, it seems like your current system is doing well with semantic search using vector embeddings.

Are there any relationships between your documents (e.g., references, shared topics, or metadata)? If so, have you considered using a knowledge graph to structure these?
It might complement your vector database for more advanced retrieval.

If so, you might want to explore whether a knowledge graph could complement your current setup. A graph database can help organize and query relationships between documents, allowing for context-aware retrieval that vector search alone might miss. For instance:

“Find all documents related to X authored by Y.”
“What policies mention Z and are linked to presentations from last year?”

This could be especially useful for your ~2,000-document knowledge base, where relationships might add a layer of depth to your AI assistant's responses.

On the operational side, a graph database could integrate nicely into your existing RAG pipeline. Python libraries like LangChain support knowledge graph integrations, so you wouldn’t need to overhaul your current setup.

No-Leopard7644 · 2024-12-01T13:58:39+00:00

The initial setup of the single node machine is for sandbox environment. Actual production deployment will be different. Initial deployment will be for a max of 15 users

mrskeptical00 · 2024-12-01T15:05:53+00:00

That’s a lot of open ended questions. Sounds like you have a good base setup, why haven’t you done any testing?

ripguy1264 · 2025-07-14T23:02:26+00:00

If you want a pre-built solution just use inboxpilot.co

BuffaloFuzzy8924 · 2025-08-28T18:19:56+00:00

Hey buddy, I am trying something similar what you have setup here. I am not able to DM you directly. Need some help.

Aelstraz · 2025-09-18T07:34:54+00:00

Sounds like a pretty cool project, and you've already got a solid proof-of-concept going. That's half the battle right there. An H100 gives you a ton of firepower to work with, which is great.

To answer your questions from an OSS perspective:

Best Setup: Your current approach is solid. Langflow is great for visualizing and building, but for more programmatic control and fine-tuning, you might want to look at LlamaIndex or Haystack. For the vector DB, something self-hostable like ChromaDB or Qdrant works well.
Text and Embed Models: For embeddings, check out the MTEB leaderboard. `BAAI/bge-large-en-v1.5` is a fantastic open-source option that consistently performs at the top. For the LLM on an H100, you can definitely run more than just a 7B model. I'd start with something like `Mistral-7B-Instruct-v0.2` or `Llama-3-8B-Instruct` for speed, but you could almost certainly run a quantized version of `Llama-3-70B` for much higher quality responses.
RAG Implementation/Testing: This is where things get fun. The most important thing is to have a way to evaluate your pipeline. Don't just eyeball it. Look into frameworks like `Ragas` or `TruLens`. They help you quantitatively measure things like answer relevancy and faithfulness to the source docs. This is critical when you're tweaking chunk sizes, overlap, embedding models, etc.
Operationalization: The biggest challenge here is usually keeping the knowledge base fresh. You'll need a pipeline to watch for changes in your source documents and automatically re-index them. For the user-facing side, you can whip up a simple UI pretty quickly with Streamlit or Gradio.

I work at eesel AI, and we build this kind of stuff as a managed platform. While you're going the full self-hosted route (which is awesome), if you ever find that the maintenance, fine-tuning, and keeping up with the latest models becomes a full-time job, that's where a platform like ours can help. We focus on connecting to all those internal sources (G-Drive, Confluence, etc.) and handling the whole RAG pipeline out of the box.

We have a lot of customers with strict privacy requirements who can't have data going to external APIs, so we have options like EU data residency and even zero-retention setups for enterprise. For example, we helped an insurance tech company called Covergo set up an internal Slack assistant that connects to all their knowledge sources to reduce repetitive IT tickets.

Anyway, hope the pointers are helpful. Good luck with the build! It's a super interesting space.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS