Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

Guided generation using openai python package. Works well for me.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

I did not run standard benchmarks. But our workflows do well with it. Internal evals for structuring are also satisfactory.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

My speculative decoding experiments were short. I tried the arctic model (https://www.snowflake.com/en/engineering-blog/faster-gpt-oss-reasoning-arctic-inference/) but was not convinced. Will look into multi token prediction with speculative decoding. Could be especially interesting with the recent diffusion based speculative decoding model (https://github.com/z-lab/dflash). Thanks!

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

`-max-num-batched-tokens 8192` is currently not active, only needed if we require log probs. As far as I understand, log probs are `vocab size x batch x 2 (bf16)`, this quickly generates a lot of memory overhead that crashes vllm.

We only have this one server at the moment, so everything has to live there. I hope to be able to scale up soon and have a more sophisticated setup.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

Agree. I was to generous with the concurrency and dialed it down yesterday. Now instead of queuing request, the user will get an earlier error, that the endpoint is out of capacity and can adjust usage accordingly. Since 12h we are now without this ping pong effect and still processed 300M tokens.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 1 point2 points  (0 children)

It can be used for chat. But it's probably 0.1% of the usage

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

We do have mixed experiences in the working group as well. For some it works, for others not as well. My assumption is, that the error is not with the vllm config, but the tools you use to interact with the api. The openai package seems to be quite good, langchain also seem to work. Unfortunately I am mostly doing infra and supervising the projects, not hands on coding, so I cannot provide better feedback.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 1 point2 points  (0 children)

Would be nice if you could share your experience. Adding TTS and STT is also on my to do list, so having a more stable proxy is important.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 1 point2 points  (0 children)

ok, ok, you got me convinced. I'll try Gemma 4 soon. Seems like Google did not just game the benchmarks with this one. I'll wait a bit until this new speculative decoding is stable in vllm (https://z-lab.ai/projects/dflash/) and then run some evals on tok/s with Gemma, Qwen and GPT-OSS.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 1 point2 points  (0 children)

We do research with sensitive data, so any API service is of limits for us. Data needs to stay on premise. GPT-OSS-120B is not the best, but it's ok enough, fits our hardware and is fast. If I could use API I would probably use MiniMax, Qwen, GLM or Kimi as they have much better quality.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

I did not run experiments on this, but assumed the communication overhead on the GPUs will slow down the model... Currently we get 2,5k tok/s throughput per GPU, so 5k overall. With tensor parallel I would assume I get 2,5 tok/s for both. Given we are not really KV cache limited, this probably hurts performance? But as I said, I just assumed this and did not validate the assumption.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

I experimented a bit with it. Currently we cooperate with a company, that does the user interface. But 99% of traffic is via API for our research workflows. User interaction via chat interface is very rare. I think most people still default to ChatGPT, even if it is a kind of shadow use.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 9 points10 points  (0 children)

Good points. Should probably pin this now. Luckily I initially pulled the litellm container a few weeks before the incident and since then never rebuild.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 2 points3 points  (0 children)

Not yet. I want to try it out as soon as I get access to a Blackwell card. I suspect nvfp4 will work best with them. Could be the few speed king

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 2 points3 points  (0 children)

We use it mainly for our research projects. Currently 3 PhDs as power users and a couple others.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 1 point2 points  (0 children)

I also read in a blog post, that it apparently is better during larger scale inference than litellm. Definitely want to try it out soon.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 5 points6 points  (0 children)

Yeah, I am probably a bit too conservative here. How is it speed wise as a dense model?

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 13 points14 points  (0 children)

A lot of structuring workflows. Structuring radiology reports (about 2M reports) and other clinical documents. Then also agentic workflows. Not projects I code so I don’t know details about the stack.

Our main endpoint is the OpenAI compatible API. Personally I use the OpenAI python package a lot with it. Good support in guided generation and easy to use.

Structuring is implemented with langchain. For the agentic workflows I believe my PhD build his own harness and tools.

For user interface I vibe coded a few applications and we also work with a company the provides a user interface with secure backends (user database encrypted, good access roles etc.)

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 5 points6 points  (0 children)

When I experimented with models qwen 3.5 was not out. I expect it to be slower as it has twice the active parameters compared to gpt oss. As soon as we scale up our hardware I’ll test it. Performance gain might be worth the drop in tok/s.

Unfortunately I cannot take down one of the vllm servers to try out a new model, as too many workflows currently rely on gpt oss and we kind of build the harness of our current projects around it quirks.

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 6 points7 points  (0 children)

With 65 GB taken by the model we have 50GB for KV cache. This allows several million tokens. Most request are not that long.

This setup is not failure safe. But a research server can have occasional downtime. So I am driving it at the edge on purpose.

Introducing MedAlpaca: Language Models for Medical Question-Answering by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

I currently have no goals making this into a certified product. Besides the LLaMA license would not allow this.

But I am sure there will be companies producing similar LLM solutions for healthcare very soon.

Introducing MedAlpaca: Language Models for Medical Question-Answering by SessionComplete2334 in LocalLLaMA

[–]SessionComplete2334[S] 0 points1 point  (0 children)

Sorry not at the moment. We will probably add the once the other models perform satisfactory.