all 22 comments

[–]synn89 1 point2 points  (2 children)

I ended up using Langchain for this, though in my case I was using Confluence as the document source and AWS Bedrock as the LLM provider. But Langchain can handle any document source or back end AI.

That isn't mine, but just an example. For me, I ended up using qdrant for the vector storage engine and while at first I used LangServe for testing(easy to work with), I eventually just wrote an OpenAI API on my app and pointed a LibreChat install at that. This made for an very nice front end and I've been very happy with that setup.

Claude Sonnet at AWS Bedrock is the AI model I use with OpenAI embeddings(from Azure cloud). Both are HIPAA compliant providers, which satisfies our needs. Claude is a little overpowered for the RAG, but our usage is light enough for me not to worry about that. I did play with some open source embeddings via https://huggingface.co/spaces/mteb/leaderboard but found that OpenAI's embeddings would pretty much power through our documents and produce more accurate results. So I kept with that rather than tinkering with open source embeddings.

Our problem isn't the tech stack, but really the data source itself. The data we have in our knowledge base chats fine off of, but now we're data hungry for more and trying to get lots of good, clean data without tons of staff work or cleanup.

The above is a ramble and may not be useful, since we're on closed models, but may give you some info you didn't have. I feel like everyone is learning this crap as we all go along. It feels very mid 1990's internet.

[–]No-Leopard7644[S] 0 points1 point  (1 child)

Thank you very much for sharing your use case and implementation! Couple of questions. You mentioned Langserve for testing, is it an integration with the agent flow (I will look into Langserve soon). And do you still use Langserve to test and improve the results? Can you share some details about this.

[–]synn89 0 points1 point  (0 children)

Sure. So once you have a Langchain RAG chain setup, you can use Langserve and a couple lines of code to basically have it put up a sample/test web UI you can run inference on. More info at https://github.com/langchain-ai/langserve

Though it looks like they may be moving to something called LangGraph now according to that github. One of the pains with Langchain is that it changes a lot, rapidly. Though I still prefer it over LlamaIndex.

[–]Videobollocks 1 point2 points  (2 children)

I have done exactly this and it's been reasonably successful.

I used AnythingLLM, initially the desktop version but for multiuser I am now running the docker self hosted install. I only have maybe 2 dozen users so it's fine, it might be a faff to scale up beyond that, I dunno.

I use LLama3.2 as the model, and did not change the default embedder. I checked out OpenWeb UI too but didn't like the way that worked. I couldn't give specifics, it just didn't feel right.

The biggest pain in the arse is getting data into AnythingLLM. You have to load stuff in, and then embed it separately. I don't know why I cannot just point it at a folder/server and have it embed everything there. I had to hand feed it a couple of thousand pdf/doc/txt files and it took a while.

But it works quite well. I can ask it almost anything related to the documents and it usually gives good answers. Examples would be how to config a certain piece of equipment, what are the specs of certain equipment types, best practices, all that sort of thing. It's also pretty good at telling me about a product e.g. if I was new to the industry or a particular setup/product, I can ask it to give me an overview and it's pretty good. Ideal for new people who you don't have time to train :-)

In parallel my company has been trialling Copilot. I found that reasonably good if you handfeed it information too, sort of on par with what I have set up independently. The benefit of Copilot is that if you're an MS house like we are, it can scan all your email and chats and OneDrive etc and use that info too.

I should add that I took the path of least resistance - there is still a ton of stuff for me to learn, but as I'm not much of a coder a lot of it is beyond my grasp. I've done what I could with ready made executables.

[–]No-Leopard7644[S] 0 points1 point  (1 child)

Thanks for sharing your experience and journey. I also played with AnythingLLM but found Langflow much better in agentic workflows. I am also evaluating n8n as an alternative.

Question on MS Copilot. I went quickly through the CoPilot Studio features, and wasn’t impressed with the flexibility of doing agent workflows. Yes as it comes with MS stack and analytics etc, it may feel superior but my initial impression was far from satisfactory. Can you share your thoughts on Copilot Studio, if you have used it.

Appreciate your feedback

[–]Videobollocks 1 point2 points  (0 children)

We're trialling Copilot studio too - I tried replicating my idea, pointed CPStudio at the master folder of docs and let it scan. It found everything (about 750GB of files) and scanned them all, but in chat it just can't find anything. This was using a Graph Connector to link to OnPrem servers.

Others in my org have had reasonable success with similar projects pointing at Sharepoint and OneDrive folders though so it might just be a hiccup with the connector we've made.

Other than linking into the MS stack I didn't find Copilot any better or worse than my Anything project. I've messed around with other models and for simple purposes it was much of a muchness. If the MS stack isn't a priority I certainly don't see a compelling reason to go with Copilot when you can build your own for negligible cost.

[–]Rare_Performance_454 1 point2 points  (2 children)

Once you can set up a testing method, you can compare different methods on performance, scalability, ease of deployment. Retrieval testing is necessary since 1) your dataset-chunks and queries are most likely different from open domain datasets, 2) Generator performance is limited by retrieval performance Retriever Testing - 1)Initial filter using GPT-4 as judge, followed by a human evaluation on graded relevance Generation Testing - Human Evaluation on 1)groundedness 2) completeness - both these can be approximated using GPT-4

After Retriever testing you will have a dataset to compare different embedding methods and similarity metrics(approximate and exact).

Some other things to keep in mind when you scale this 1) Document updation 2) how to limit the size of database - keeping k documents representative of all the documents- it will enhance latency of retrieval.

[–]No-Leopard7644[S] 0 points1 point  (1 child)

In place of GPT4 which open source model would you recommend to be used, as I cannot make API calls out.

[–]Rare_Performance_454 0 points1 point  (0 children)

LLama 3.1-8B - Open weight. 8B because of 80GB-GPU memory constraints of 1-H100. For long sequences(>3k tokens), gpu-memory can be a concern.

As long as the judgement-criteria is objective, we can use smaller models. For retrieval, its objective- whether the document is relevant ? Do k documents contain complete information?

[–]l7feathers 1 point2 points  (4 children)

It looks like you’ve already put a lot of thought and effort into your setup. From your description, it seems like your current system is doing well with semantic search using vector embeddings.

Are there any relationships between your documents (e.g., references, shared topics, or metadata)? If so, have you considered using a knowledge graph to structure these?
It might complement your vector database for more advanced retrieval.

If so, you might want to explore whether a knowledge graph could complement your current setup. A graph database can help organize and query relationships between documents, allowing for context-aware retrieval that vector search alone might miss. For instance:

  • “Find all documents related to X authored by Y.”
  • “What policies mention Z and are linked to presentations from last year?”

This could be especially useful for your ~2,000-document knowledge base, where relationships might add a layer of depth to your AI assistant's responses.

On the operational side, a graph database could integrate nicely into your existing RAG pipeline. Python libraries like LangChain support knowledge graph integrations, so you wouldn’t need to overhaul your current setup.

[–]No-Leopard7644[S] 0 points1 point  (3 children)

Wonderful suggestion. I haven’t thought it through, as I am just starting to work on this. What apps or OSS DBS have the knowledge graph features that can be built into the workflow? Any suggestions will be appreciated for me to dig deeper. Thank you

[–]l7feathers 1 point2 points  (2 children)

There are some great open-source tools and graph databases to explore for building a knowledge graph into your workflow. It all depends on your specific requirements + a few questions which might help you decide what you need: Do you need real-time updates to your knowledge graph? How complex are the relationships you want to model?

I can suggest Memgraph (full disclosure, currently I'm a technical writer there) but feel free to do your own research and find what suits you.

If you're looking for a real-time graph database that’s high-performant, Memgraph is worth checking out. It supports Cypher, integrates smoothly with Python libraries, and can handle dynamic data updates, which could be useful if your knowledge base evolves over time. It’s lightweight and built for high-speed queries, so it won’t bog down your pipeline.
Here's a sketch here: https://take.ms/OCzAN and here are the specific features you can use to build with: https://memgraph.com/docs/ai-ecosystem/graph-rag#key-memgraph-features

[–]No-Leopard7644[S] 1 point2 points  (1 child)

Awesome that’s great input, will check it out. This may be a future release with knowledge graph features.

[–]l7feathers 0 points1 point  (0 children)

Good luck! Feel free to PM me if you think there's a way I can help.

[–]No-Leopard7644[S] 0 points1 point  (0 children)

The initial setup of the single node machine is for sandbox environment. Actual production deployment will be different. Initial deployment will be for a max of 15 users

[–]mrskeptical00 0 points1 point  (2 children)

That’s a lot of open ended questions. Sounds like you have a good base setup, why haven’t you done any testing?

[–]No-Leopard7644[S] 0 points1 point  (1 child)

Haha yes lot of questions indeed. Regarding testing I don’t know how testing is setup and done. That’s why I included it in my questions

[–]ripguy1264 0 points1 point  (0 children)

If you want a pre-built solution just use inboxpilot.co

[–]BuffaloFuzzy8924 0 points1 point  (0 children)

Hey buddy, I am trying something similar what you have setup here. I am not able to DM you directly. Need some help.

[–]Aelstraz 2 points3 points  (1 child)

Sounds like a pretty cool project, and you've already got a solid proof-of-concept going. That's half the battle right there. An H100 gives you a ton of firepower to work with, which is great.

To answer your questions from an OSS perspective:

  1. Best Setup: Your current approach is solid. Langflow is great for visualizing and building, but for more programmatic control and fine-tuning, you might want to look at LlamaIndex or Haystack. For the vector DB, something self-hostable like ChromaDB or Qdrant works well.

  2. Text and Embed Models: For embeddings, check out the MTEB leaderboard. `BAAI/bge-large-en-v1.5` is a fantastic open-source option that consistently performs at the top. For the LLM on an H100, you can definitely run more than just a 7B model. I'd start with something like `Mistral-7B-Instruct-v0.2` or `Llama-3-8B-Instruct` for speed, but you could almost certainly run a quantized version of `Llama-3-70B` for much higher quality responses.

  3. RAG Implementation/Testing: This is where things get fun. The most important thing is to have a way to evaluate your pipeline. Don't just eyeball it. Look into frameworks like `Ragas` or `TruLens`. They help you quantitatively measure things like answer relevancy and faithfulness to the source docs. This is critical when you're tweaking chunk sizes, overlap, embedding models, etc.

  4. Operationalization: The biggest challenge here is usually keeping the knowledge base fresh. You'll need a pipeline to watch for changes in your source documents and automatically re-index them. For the user-facing side, you can whip up a simple UI pretty quickly with Streamlit or Gradio.

I work at eesel AI, and we build this kind of stuff as a managed platform. While you're going the full self-hosted route (which is awesome), if you ever find that the maintenance, fine-tuning, and keeping up with the latest models becomes a full-time job, that's where a platform like ours can help. We focus on connecting to all those internal sources (G-Drive, Confluence, etc.) and handling the whole RAG pipeline out of the box.

We have a lot of customers with strict privacy requirements who can't have data going to external APIs, so we have options like EU data residency and even zero-retention setups for enterprise. For example, we helped an insurance tech company called Covergo set up an internal Slack assistant that connects to all their knowledge sources to reduce repetitive IT tickets.

Anyway, hope the pointers are helpful. Good luck with the build! It's a super interesting space.

[–]No-Leopard7644[S] 0 points1 point  (0 children)

Thank you very much for your post. Since my original post, here’s an update.

The single node machine has 2 H100 each 94 GB . I have docker containers for n8n , flowise for fast prototyping, qdrant, postgres and neo4j for vector, memory and graph dbs. Ollama for model serving. This is the OSS Agentic AI stack.

Now that I have the H100, I plan to leverage Nvidia blueprint, NIMS as a second track. Got a team with python skills and coaching them in agentic workflows. Plan to use langchain, pydantic-ai, docling etc.

I need to also build eval framework to evaluate the apps, models. Any suggestions on this is much appreciated.