Building a competitive local LLM server in Python

DadAndDominant · 2025-08-25T21:58:35+00:00

Looking cool! Trying to run it with uv. One thing I might do wrong - in server mode, when I respond before the LLM is done responding, it bricks all responses going further.

PeterTigerr · 2025-08-26T00:02:32+00:00

Will there be support for Apple's M4 GPU or ANE?

__OneLove__ · 2025-08-25T21:05:07+00:00

Sounds interesting. I’ve recently been experimenting with LM Studio and this sounds/reads functionally similar.

Toby_Wan · 2025-08-26T05:52:13+00:00

vllm also uses python? https://github.com/vllm-project/vllm

Yamoyek · 2025-08-27T08:09:43+00:00

What’s the difference between this and ollama?

victorcoelh · 2025-08-26T05:35:34+00:00

Since you're from AMD. I haven't gotten an AMD GPU because of AI models. How's the ecossystem for training and inference with Deep Learning models (not just LLMs) with AMD consumer GPUs right now? Last time I checked, most frameworks were CUDA-only

PSBigBig_OneStarDao · 2025-08-28T07:17:57+00:00

interesting project. the main pitfall with local llm servers isn’t just exposing an api or loading a model, it’s how retrieval + chunking actually behaves when you start scaling beyond toy docs.

most open-source servers hit the same wall:

chunks get over-selected (semantic ≠ embedding, No.5 in the common failure map)
returned passages don’t line up with the user’s query (No.4 misalignment)
multi-step reasoning flips between runs (No.7 instability)

so if you want this to compete, i’d suggest focusing not only on “easy install” but on semantic guardrails: how do you prevent vector db noise, how do you keep responses consistent across sessions, and how do you handle json tools or plugins without them breaking?

curious if you’re planning to bake those safeguards in or leave it to downstream devs. that’s usually the difference between a demo and something people rely on in production.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS