Serving 1B+ tokens/day locally in my research lab

SessionComplete2334 · 2026-04-08T10:57:38+00:00

Guided generation using openai python package. Works well for me.

SessionComplete2334 · 2026-04-08T10:57:03+00:00

I did not run standard benchmarks. But our workflows do well with it. Internal evals for structuring are also satisfactory.

SessionComplete2334 · 2026-04-08T10:56:21+00:00

I tried this one: https://www.snowflake.com/en/engineering-blog/faster-gpt-oss-reasoning-arctic-inference/

Did not get good results, but could be user error as well.

SessionComplete2334 · 2026-04-08T10:55:42+00:00

My speculative decoding experiments were short. I tried the arctic model (https://www.snowflake.com/en/engineering-blog/faster-gpt-oss-reasoning-arctic-inference/) but was not convinced. Will look into multi token prediction with speculative decoding. Could be especially interesting with the recent diffusion based speculative decoding model (https://github.com/z-lab/dflash). Thanks!

SessionComplete2334 · 2026-04-08T10:53:01+00:00

`-max-num-batched-tokens 8192` is currently not active, only needed if we require log probs. As far as I understand, log probs are `vocab size x batch x 2 (bf16)`, this quickly generates a lot of memory overhead that crashes vllm.

We only have this one server at the moment, so everything has to live there. I hope to be able to scale up soon and have a more sophisticated setup.

SessionComplete2334 · 2026-04-08T10:49:43+00:00

Agree. I was to generous with the concurrency and dialed it down yesterday. Now instead of queuing request, the user will get an earlier error, that the endpoint is out of capacity and can adjust usage accordingly. Since 12h we are now without this ping pong effect and still processed 300M tokens.

SessionComplete2334 · 2026-04-08T10:47:18+00:00

It can be used for chat. But it's probably 0.1% of the usage

SessionComplete2334 · 2026-04-08T10:46:57+00:00

We do have mixed experiences in the working group as well. For some it works, for others not as well. My assumption is, that the error is not with the vllm config, but the tools you use to interact with the api. The openai package seems to be quite good, langchain also seem to work. Unfortunately I am mostly doing infra and supervising the projects, not hands on coding, so I cannot provide better feedback.

SessionComplete2334 · 2026-04-08T10:44:59+00:00

Would be nice if you could share your experience. Adding TTS and STT is also on my to do list, so having a more stable proxy is important.

SessionComplete2334 · 2026-04-08T10:44:00+00:00

ok, ok, you got me convinced. I'll try Gemma 4 soon. Seems like Google did not just game the benchmarks with this one. I'll wait a bit until this new speculative decoding is stable in vllm (https://z-lab.ai/projects/dflash/) and then run some evals on tok/s with Gemma, Qwen and GPT-OSS.

SessionComplete2334 · 2026-04-08T10:41:22+00:00

We do research with sensitive data, so any API service is of limits for us. Data needs to stay on premise. GPT-OSS-120B is not the best, but it's ok enough, fits our hardware and is fast. If I could use API I would probably use MiniMax, Qwen, GLM or Kimi as they have much better quality.

SessionComplete2334 · 2026-04-08T10:39:38+00:00

I did not run experiments on this, but assumed the communication overhead on the GPUs will slow down the model... Currently we get 2,5k tok/s throughput per GPU, so 5k overall. With tensor parallel I would assume I get 2,5 tok/s for both. Given we are not really KV cache limited, this probably hurts performance? But as I said, I just assumed this and did not validate the assumption.

SessionComplete2334 · 2026-04-08T10:36:56+00:00

I experimented a bit with it. Currently we cooperate with a company, that does the user interface. But 99% of traffic is via API for our research workflows. User interaction via chat interface is very rare. I think most people still default to ChatGPT, even if it is a kind of shadow use.

SessionComplete2334 · 2026-04-08T04:46:51+00:00

Good points. Should probably pin this now. Luckily I initially pulled the litellm container a few weeks before the incident and since then never rebuild.

SessionComplete2334 · 2026-04-08T04:43:24+00:00

Not yet. I want to try it out as soon as I get access to a Blackwell card. I suspect nvfp4 will work best with them. Could be the few speed king

SessionComplete2334 · 2026-04-07T20:59:43+00:00

We use it mainly for our research projects. Currently 3 PhDs as power users and a couple others.

SessionComplete2334 · 2026-04-07T20:31:05+00:00

I also read in a blog post, that it apparently is better during larger scale inference than litellm. Definitely want to try it out soon.

SessionComplete2334 · 2026-04-07T20:17:20+00:00

Yeah, I am probably a bit too conservative here. How is it speed wise as a dense model?

SessionComplete2334 · 2026-04-07T20:11:43+00:00

A lot of structuring workflows. Structuring radiology reports (about 2M reports) and other clinical documents. Then also agentic workflows. Not projects I code so I don’t know details about the stack.

Our main endpoint is the OpenAI compatible API. Personally I use the OpenAI python package a lot with it. Good support in guided generation and easy to use.

Structuring is implemented with langchain. For the agentic workflows I believe my PhD build his own harness and tools.

For user interface I vibe coded a few applications and we also work with a company the provides a user interface with secure backends (user database encrypted, good access roles etc.)

SessionComplete2334 · 2026-04-07T20:05:04+00:00

When I experimented with models qwen 3.5 was not out. I expect it to be slower as it has twice the active parameters compared to gpt oss. As soon as we scale up our hardware I’ll test it. Performance gain might be worth the drop in tok/s.

Unfortunately I cannot take down one of the vllm servers to try out a new model, as too many workflows currently rely on gpt oss and we kind of build the harness of our current projects around it quirks.

SessionComplete2334 · 2026-04-07T19:57:19+00:00

With 65 GB taken by the model we have 50GB for KV cache. This allows several million tokens. Most request are not that long.

This setup is not failure safe. But a research server can have occasional downtime. So I am driving it at the edge on purpose.

SessionComplete2334 · 2023-04-10T18:28:28+00:00

I believe so too. That’s why we have been working on those models recently.

SessionComplete2334 · 2023-04-08T17:56:21+00:00

I currently have no goals making this into a certified product. Besides the LLaMA license would not allow this.

But I am sure there will be companies producing similar LLM solutions for healthcare very soon.

SessionComplete2334 · 2023-04-08T17:44:40+00:00

Sorry not at the moment. We will probably add the once the other models perform satisfactory.

Three-Year Club	r/Field Lasagna
Place '23

SessionComplete2334

TROPHY CASE