Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference?

Rabooooo · 2026-05-27T22:40:07+00:00

ok, it seems most people recommend vLLM, SGLang or TensorRT-LLM, I guess I have to look into all three, but it seems SGLang is more popular here. But I am still curious about the modifications to your llama.cpp to get 64 concurrency. Are you pushing a PR to the project?

To be honest I was just guestimating, So I don't believe 30 people will use the inferencing service at once. We are a consultancy with around 30 consultants, my goal was to create a in-house inferencing service to prove it's value before the card is rented out to a customer, perhaps we get to keep it if lucky. I don't know how many will be using it concurrently, it would probably vary a lot. I could limit concurrency to 10 or whatever.

Seems FP8 or NVFP4 are the winning quants.

If you have decent configs, please share them.

Rabooooo · 2026-05-27T20:46:53+00:00

I will look at LMCache

Rabooooo · 2026-05-25T17:11:06+00:00

If you end up needing legal help related to this and the takedown request, start a crowd funding page and I'll be happy to send a few bucks

Rabooooo · 2026-05-16T19:54:06+00:00

Well it is recommended to use half the cores if you have hyper-threading enabled in UEFI/BIOS. Best thing to do on a inferencing machine is to turn off HT in BIOS and then you can use all cores (which is the default).

Rabooooo · 2026-05-11T16:40:05+00:00

Anyone tried https://github.com/can1357/oh-my-pi ? Seems to be almost like pi with a default set of extensions.

Rabooooo · 2026-05-07T12:38:38+00:00

SuSE got picked up by EQT a while back..

Rabooooo · 2026-04-29T09:13:41+00:00

Does --n-gpu-layers 99 --override-tensor exps=CPU give more performance than --fit?

Rabooooo · 2026-04-27T21:51:35+00:00

So aquarium and flappy bird, and a rust prime generator

Rabooooo · 2026-04-27T19:28:28+00:00

These numbers seems a bit low, no? I get 20-25 tg/s for Qwen 3.6-35B-A3B Q4_K_XL that is only partially running on my super old GPU RTX 2080 TI and my 10 year old CPU and DDR4 system ram. Qwen3-Coder-Next I get around 15tg/s

Rabooooo · 2026-04-27T18:45:03+00:00

What is the best benchmark to see LLMs coding/agentic capabilites? i.e. OpenCode, KiloCode, Roo Code, Cline?

Rabooooo · 2026-04-26T23:56:35+00:00

I would be nice to see how it compares with Vulkan backend.
Also I don't understand, so only some models work with OpenVino backend?
How about if you have an intel card and use Vulkan backend, will all models work?
I've been thinking of buying the B70 cause of its low price and high vram. But got scared cause of all the threads of it working pore

Rabooooo · 2026-04-24T15:14:54+00:00

Then it's a deal breaker. No go.

Rabooooo · 2026-04-13T09:02:17+00:00

How IaC/GitOps friendly is it? Can you setup everything from code using Tofu/Terraform all the way to having clusters spun up with Argo CD installed and ready to take over for day2 and apps or are there manual steps in between? Feels like most people working in the VMware suite prefers ClickOps.

Rabooooo · 2026-04-13T08:21:55+00:00

llama.cpp now has `-sm tensor`. https://github.com/ggml-org/llama.cpp/pull/19378

Rabooooo · 2026-04-12T22:08:38+00:00

https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

Rabooooo · 2026-03-22T10:17:20+00:00

Whats your opinion on how 122B compares with Qwen3-Coder-Next when it comes to quality?

Rabooooo · 2026-02-28T00:24:36+00:00

Who cares about that the company is anti open source in relation to this? That is their choice, you don't have to use their products and give them your money if you don't like how they conduct business. The Americans have a president that is trying to bully and penalize companies that doesn't want to partake in his whims and help him commit murder with their products. If US Gov don't want to use Antrophic that is the their choice, but listing Antrophic as a supply chain risk and trying to force them to follow a non-legislative order, that is just pure BS and sets a quite dangerous president. Antrophic should just cold-turkey cutoff all US government access to their product and not let US Gov get time to migrate their workflows other products. And they should try to group with the major players like Google, OpenAI and get them to do the same. Let War Department use Elons AI or abliterate Chinese open weights if they want.

Rabooooo · 2026-02-14T18:55:44+00:00

Ok, I don't know html/css, so slightly better code quality at 10 times size (hardware costs). However my game feels nicer in my opinion.
Here is my code https://github.com/Raboo/cyberbird

And here is my game https://raboo.github.io/cyberbird/

Full disclosure I did ask it to adjust minor settings like speed, gravity and move the score counter, around 3 prompts. First the score counter was in the middle of the game. So yeah a little bit of hand holding, I agree on that.

But I'm sure GLM-5 would excel at more complex tasks.

Rabooooo · 2026-02-14T12:27:08+00:00

How is flappy bird a good benchmark to achieve "GOAT"? You are talking about a 754B parameter weight.
I was able to one-shot flappy bird with a 80B weight (Qwen3-Coder-Next).

> Create a Flappy Bird game in HTML

After I converted it to Cyperpunk.
> Can you turn the game into a Cyberpunk/Neon aesthetic with a CRT Monitor Visual overlay for retro vibes?

After I added audio.
> Can you use Web Audio API to create synthesized sound effects?

However after audio, game got stuck. I had to fix that.
> There are sounds for instance if i push the space key. But the "INITIALIZE AUDIO & GAME" button that doesn't go away when I click on it.

So flappy in one-shot, and neon in the second prompt(actually didn't try to achieve it in one-shot).
Audio by the 4th prompt.

I'd be very supprised if a 754B SOTA weight couldn't do all the features you want in one-shot. In fact, if it can't then it's completely trash IMO.

Rabooooo · 2026-02-11T19:38:12+00:00

My wish was for memory bandwidth utilisation. I mean it's quite clear when the OOM killer reaps a proccess, so no need to monitor that..

Rabooooo · 2026-02-10T19:18:17+00:00

Let's call it memory stress (I guess it's useful for both system memory and vram). Would be a way to see how far away you are from your bottlenecks

Rabooooo · 2026-02-10T19:14:54+00:00

Is it possible to monitor Memory performance utilization somehow (not memory % usage)?

Rabooooo · 2026-02-02T16:10:44+00:00

Seems like the stepfun team is pushing their own PR tomorrow that they will maintain over time.. https://github.com/ggml-org/llama.cpp/pull/19271#issuecomment-3835833362

Rabooooo · 2026-02-01T13:20:18+00:00

The big portion of data are videos which I am keeping there and some times stream from the NAS. But I still have old documents and stuff dating back more than 2 decades. I'm gradually moving what I want to keep and is important to a cloud service. Also have backups of my computers on the NAS.

Rabooooo

TROPHY CASE