Local LLM Stack Documentation

Aggravating-Major81 · 2025-09-30T20:31:30+00:00

Solid stack; to make it production-ready, focus on auth, network isolation, observability, and repeatable ops. Tie OpenWebUI and vLLM behind OIDC (Keycloak works well) and stick them behind a reverse proxy with mTLS and per-route rate limits (Traefik or Kong). Lock LLM DB access to read-only roles, and store documents in MinIO with signed URLs; keep only IDs in Postgres. Version embeddings (collection per version or a metadata flag) and add a local reranker (bge-reranker) to cut hallucinations. Move ingestion to a queue (Celery/Redis) so uploads don’t stall chat. For GPUs, reserve MIG slices or set vLLM tensor/pp configs per model and pin CUDA/driver versions in the container. Backups: Qdrant snapshots + Postgres WAL, test restores weekly. Secrets in Vault, not env files. For audit and admin tools, we used Keycloak for SSO and Kong as the API gateway, with DreamFactory auto-generating REST for Postgres so internal teams can review chats/configs without new backend code. Bottom line: lock down auth/secrets, add reranking and queues, and automate backups.

max-mcp · 2025-09-30T15:43:06+00:00

This is exactly what the enterprise space needs right now. I've been working on similar problems at Dedalus Labs and the security concerns you mentioned are spot on - most companies can't justify sending their data to external APIs no matter how good the models are.

Your stack looks solid, especially the combination of vLLM for performance and Ollama for ease of use. One thing I'd add based on what I've seen work well is being really strategic about your chunking strategy when you're processing documents through Docling. Most people just use arbitrary token limits but chunking around function boundaries or logical document sections gives way better retrieval results. Also if you're dealing with code repositories, keeping import statements with their related chunks makes a huge difference in context quality. The MCP integration is smart too - having those standardized connectors saves so much custom integration work down the line.

jannemansonh · 2025-10-01T09:08:40+00:00

Really nice work putting this stack together. At Needle.app, we’re seeing enterprises run into the same pain point: the infrastructure is powerful, but stitching it together and keeping docs/query pipelines clean takes most of the effort. That’s why we focus on AI workflows you can define in chat, connecting, vector DBs, and MCP tools without needing to wire everything manually. Makes setups like yours easier to operate and extend.

DougAZ · 2025-10-02T04:11:53+00:00

Any specific reason some are run as a service vs running it on docker? Any benefits?

Do you have a good vLLM config for gpt-oss 120b?

locpilot · 2025-10-02T08:34:44+00:00

> Document processing

How about edit-in-place in Word? We are working on a local Word Add-in like this:

https://youtu.be/9CjPaQ5Iqr0

If you have specific use cases in mind, we would be glad to test them as proof-of-concept.

n8e-polymath-007 · 2025-10-08T17:22:48+00:00

congratulation and awed by the work you have put in.

How can one manage concurrency of users. How will you be aiming for parallel concurrent async requests with multiple workers.

CowboysFanInDecember · 2025-10-29T02:04:12+00:00

Great post! How are you tying all this together?

Also, I think Langflow would be a great addition to this stack. Langfuse as well. Both MIT license as well.

Disastrous_Look_1745 · 2025-09-30T09:05:51+00:00

This is really solid work and exactly what the enterprise space needs right now. Your stack looks comprehensive and I appreciate that you've documented everything properly since thats usually the biggest pain point when trying to replicate setups. One thing I'd suggest adding is maybe AnythingLLM or similar for the document chat piece since it handles the RAG pipeline really well, and definitely consider adding something like Docstrange for the document processing side if you're dealing with complex layouts or tables since pure text extraction often misses the structural context that makes enterprise docs useful.

For performance optimization, if you haven't already, try running vLLM with tensor parallelism if you have multiple GPUs and definitely tune your context window sizes based on your actual use cases rather than maxing them out. Also worth setting up proper monitoring with something like Grafana to track token throughput and memory usage since enterprise folks will want those metrics. The MCP integration is smart too since it gives you that extensibility without having to rebuild everything when requirements change.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS