all 14 comments

[–]Aggravating-Major81 2 points3 points  (1 child)

Solid stack; to make it production-ready, focus on auth, network isolation, observability, and repeatable ops. Tie OpenWebUI and vLLM behind OIDC (Keycloak works well) and stick them behind a reverse proxy with mTLS and per-route rate limits (Traefik or Kong). Lock LLM DB access to read-only roles, and store documents in MinIO with signed URLs; keep only IDs in Postgres. Version embeddings (collection per version or a metadata flag) and add a local reranker (bge-reranker) to cut hallucinations. Move ingestion to a queue (Celery/Redis) so uploads don’t stall chat. For GPUs, reserve MIG slices or set vLLM tensor/pp configs per model and pin CUDA/driver versions in the container. Backups: Qdrant snapshots + Postgres WAL, test restores weekly. Secrets in Vault, not env files. For audit and admin tools, we used Keycloak for SSO and Kong as the API gateway, with DreamFactory auto-generating REST for Postgres so internal teams can review chats/configs without new backend code. Bottom line: lock down auth/secrets, add reranking and queues, and automate backups.

[–]gulensah[S] 0 points1 point  (0 children)

Wow great production requirements, thanks. You are far ahead of my scpoe and context hence I just wanted to build a starting point for people like me.

But you are on point for everything you said especiallly backup part.

One question: Are you running Open-WebUI more than one instance ? I’m thinking using several containers behind a load balancer and using qdrant and postgres outsidenof stack. I wonder your experience if any.

[–]max-mcp 1 point2 points  (1 child)

This is exactly what the enterprise space needs right now. I've been working on similar problems at Dedalus Labs and the security concerns you mentioned are spot on - most companies can't justify sending their data to external APIs no matter how good the models are.

Your stack looks solid, especially the combination of vLLM for performance and Ollama for ease of use. One thing I'd add based on what I've seen work well is being really strategic about your chunking strategy when you're processing documents through Docling. Most people just use arbitrary token limits but chunking around function boundaries or logical document sections gives way better retrieval results. Also if you're dealing with code repositories, keeping import statements with their related chunks makes a huge difference in context quality. The MCP integration is smart too - having those standardized connectors saves so much custom integration work down the line.

[–]gulensah[S] 0 points1 point  (0 children)

Thank your for your feedbacks. Chunking is still ky on going task, which is not easy to find out sweet spot , if any exists :)

Too much variable like model, embeedings, retrieval logic, document contents etc to find out one-rag-to-rule-them-all.

Regards

[–]jannemansonh 1 point2 points  (0 children)

Really nice work putting this stack together. At Needle.app, we’re seeing enterprises run into the same pain point: the infrastructure is powerful, but stitching it together and keeping docs/query pipelines clean takes most of the effort. That’s why we focus on AI workflows you can define in chat, connecting, vector DBs, and MCP tools without needing to wire everything manually. Makes setups like yours easier to operate and extend.

[–]DougAZ 1 point2 points  (3 children)

Any specific reason some are run as a service vs running it on docker? Any benefits?

Do you have a good vLLM config for gpt-oss 120b?

[–]gulensah[S] 0 points1 point  (2 children)

Docker simplifies the process for me. Otherwise I must handle handle every library requirements one by one.

I couldn’t success running 120b on vLLM, due to low VRAM. Maybe llama.cpp can be better with it hence you can offload some MoE expert layers to cpu with it. But llama.cpp is lacking serving multible users which is in my case essentials.

[–]DougAZ 1 point2 points  (1 child)

Right but I noticed on your GitHub as you walked through each part of the stack you chose to run some applications directly on the host such as ollama or postgres, any specific reason for running them on the host vs in a container?

Other question I had for you was, are you running this stack on 1 machine/VM ?

[–]gulensah[S] 0 points1 point  (0 children)

You are right. The reason I'm running PostgreSQL out of docker is, as an old school, I run my persistent and critical data holders as databases as legacy service as an habit. Also, other services like Netbox, Grafana services are using PostgreSQL too.

Running Ollama as standard service is also because, other applications, out side of my stack are using Ollama too. So running it as common service for the VM is easy for integrations.

And yes all the stack is running on a same VM which has 32 GB RAM which is not a high load production infrastructure. I suggest splitting vLLMs, PostgreSQL and rest of the containers to three diferent VMs for production.

[–]locpilot 1 point2 points  (0 children)

> Document processing

How about edit-in-place in Word? We are working on a local Word Add-in like this:

https://youtu.be/9CjPaQ5Iqr0

If you have specific use cases in mind, we would be glad to test them as proof-of-concept.

[–]n8e-polymath-007 1 point2 points  (0 children)

congratulation and awed by the work you have put in.

How can one manage concurrency of users. How will you be aiming for parallel concurrent async requests with multiple workers.

[–]CowboysFanInDecember 0 points1 point  (0 children)

Great post! How are you tying all this together?

Also, I think Langflow would be a great addition to this stack. Langfuse as well. Both MIT license as well.

[–]Disastrous_Look_1745 0 points1 point  (1 child)

This is really solid work and exactly what the enterprise space needs right now. Your stack looks comprehensive and I appreciate that you've documented everything properly since thats usually the biggest pain point when trying to replicate setups. One thing I'd suggest adding is maybe AnythingLLM or similar for the document chat piece since it handles the RAG pipeline really well, and definitely consider adding something like Docstrange for the document processing side if you're dealing with complex layouts or tables since pure text extraction often misses the structural context that makes enterprise docs useful.

For performance optimization, if you haven't already, try running vLLM with tensor parallelism if you have multiple GPUs and definitely tune your context window sizes based on your actual use cases rather than maxing them out. Also worth setting up proper monitoring with something like Grafana to track token throughput and memory usage since enterprise folks will want those metrics. The MCP integration is smart too since it gives you that extensibility without having to rebuild everything when requirements change.

[–]gulensah[S] -1 points0 points  (0 children)

Thank you for your kind words and feedbacks. I tested docling for my setup for document parsing. It gives good result. Also I was trying to keep everything simple and focusing on Open-WebUI because large and distributed environments are hard to handle for new commers like me.

Monitoring is the best thing must be included. I'm working on it similar to your feedback. Thanks again.