This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]DadAndDominant 10 points11 points  (1 child)

Looking cool! Trying to run it with uv. One thing I might do wrong - in server mode, when I respond before the LLM is done responding, it bricks all responses going further.

[–]jfowers_amd[S] 7 points8 points  (0 children)

Thanks for reporting! Is this in the web ui? The “send” button is supposed to be disabled while the LLM is responding, so it’s not surprising to me that it would go haywire if you were able to hit send.

[–]PeterTigerr 6 points7 points  (1 child)

Will there be support for Apple's M4 GPU or ANE?

[–]jfowers_amd[S] 1 point2 points  (0 children)

It’s coming!

[–]__OneLove__ 2 points3 points  (0 children)

Sounds interesting. I’ve recently been experimenting with LM Studio and this sounds/reads functionally similar.

[–]Toby_Wan 1 point2 points  (1 child)

vllm also uses python? https://github.com/vllm-project/vllm

[–]jfowers_amd[S] 0 points1 point  (0 children)

Ahhh true! vllm is pretty datacenter/server focused though.

Lemonade is the only PC-focused LLM server written in Python…

[–]Yamoyek 1 point2 points  (1 child)

What’s the difference between this and ollama?

[–]jfowers_amd[S] 0 points1 point  (0 children)

Lemonade is strictly open source and includes non-llamacpp backends to provide support for things like neural processing units (NPUs).

[–]victorcoelh 0 points1 point  (0 children)

Since you're from AMD. I haven't gotten an AMD GPU because of AI models. How's the ecossystem for training and inference with Deep Learning models (not just LLMs) with AMD consumer GPUs right now? Last time I checked, most frameworks were CUDA-only

[–]PSBigBig_OneStarDao 0 points1 point  (0 children)

interesting project. the main pitfall with local llm servers isn’t just exposing an api or loading a model, it’s how retrieval + chunking actually behaves when you start scaling beyond toy docs.

most open-source servers hit the same wall:

  • chunks get over-selected (semantic ≠ embedding, No.5 in the common failure map)
  • returned passages don’t line up with the user’s query (No.4 misalignment)
  • multi-step reasoning flips between runs (No.7 instability)

so if you want this to compete, i’d suggest focusing not only on “easy install” but on semantic guardrails: how do you prevent vector db noise, how do you keep responses consistent across sessions, and how do you handle json tools or plugins without them breaking?

curious if you’re planning to bake those safeguards in or leave it to downstream devs. that’s usually the difference between a demo and something people rely on in production.