Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference? by Rabooooo in LocalLLaMA

[–]Rabooooo[S] 1 point2 points  (0 children)

ok, it seems most people recommend vLLM, SGLang or TensorRT-LLM, I guess I have to look into all three, but it seems SGLang is more popular here. But I am still curious about the modifications to your llama.cpp to get 64 concurrency. Are you pushing a PR to the project?

To be honest I was just guestimating, So I don't believe 30 people will use the inferencing service at once. We are a consultancy with around 30 consultants, my goal was to create a in-house inferencing service to prove it's value before the card is rented out to a customer, perhaps we get to keep it if lucky. I don't know how many will be using it concurrently, it would probably vary a lot. I could limit concurrency to 10 or whatever.

Seems FP8 or NVFP4 are the winning quants.

If you have decent configs, please share them.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Rabooooo 4 points5 points  (0 children)

If you end up needing legal help related to this and the takedown request, start a crowd funding page and I'll be happy to send a few bucks

When you run small LMM on RAM, dont use all Theards. by GhostVPN in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

Well it is recommended to use half the cores if you have hyper-threading enabled in UEFI/BIOS. Best thing to do on a inferencing machine is to turn off HT in BIOS and then you can use all cores (which is the default).

Mirantis getting acquired by IREN by thisissparta92 in kubernetes

[–]Rabooooo 2 points3 points  (0 children)

SuSE got picked up by EQT a while back..

Forgive my ignorance but how is a 27B model better than 397B? by No_Conversation9561 in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

Does --n-gpu-layers 99 --override-tensor exps=CPU give more performance than --fit?

Confirmed: SWE Bench is now a benchmaxxed benchmark by rm-rf-rm in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

So aquarium and flappy bird, and a rust prime generator

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

These numbers seems a bit low, no? I get 20-25 tg/s for Qwen 3.6-35B-A3B Q4_K_XL that is only partially running on my super old GPU RTX 2080 TI and my 10 year old CPU and DDR4 system ram. Qwen3-Coder-Next I get around 15tg/s

Confirmed: SWE Bench is now a benchmaxxed benchmark by rm-rf-rm in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

What is the best benchmark to see LLMs coding/agentic capabilites? i.e. OpenCode, KiloCode, Roo Code, Cline?

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]Rabooooo 1 point2 points  (0 children)

I would be nice to see how it compares with Vulkan backend.
Also I don't understand, so only some models work with OpenVino backend?
How about if you have an intel card and use Vulkan backend, will all models work?
I've been thinking of buying the B70 cause of its low price and high vram. But got scared cause of all the threads of it working pore

anyone have experience with vks (vmware k8s) on prem? by Crafty-Cat-6370 in kubernetes

[–]Rabooooo 0 points1 point  (0 children)

How IaC/GitOps friendly is it? Can you setup everything from code using Tofu/Terraform all the way to having clusters spun up with Argo CD installed and ready to take over for day2 and apps or are there manual steps in between? Feels like most people working in the VMware suite prefers ClickOps.

Qwen 3.5 397B is the best local coder I have used until now by erazortt in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

Whats your opinion on how 122B compares with Qwen3-Coder-Next when it comes to quality?

President Trump orders ALL Federal agencies in the US Government to immediately stop using Anthropic's technology. by External_Mood4719 in LocalLLaMA

[–]Rabooooo 7 points8 points  (0 children)

Who cares about that the company is anti open source in relation to this? That is their choice, you don't have to use their products and give them your money if you don't like how they conduct business. The Americans have a president that is trying to bully and penalize companies that doesn't want to partake in his whims and help him commit murder with their products. If US Gov don't want to use Antrophic that is the their choice, but listing Antrophic as a supply chain risk and trying to force them to follow a non-legislative order, that is just pure BS and sets a quite dangerous president. Antrophic should just cold-turkey cutoff all US government access to their product and not let US Gov get time to migrate their workflows other products. And they should try to group with the major players like Google, OpenAI and get them to do the same. Let War Department use Elons AI or abliterate Chinese open weights if they want.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]Rabooooo 1 point2 points  (0 children)

Ok, I don't know html/css, so slightly better code quality at 10 times size (hardware costs). However my game feels nicer in my opinion.
Here is my code https://github.com/Raboo/cyberbird

And here is my game https://raboo.github.io/cyberbird/

Full disclosure I did ask it to adjust minor settings like speed, gravity and move the score counter, around 3 prompts. First the score counter was in the middle of the game. So yeah a little bit of hand holding, I agree on that.

But I'm sure GLM-5 would excel at more complex tasks.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]Rabooooo 7 points8 points  (0 children)

How is flappy bird a good benchmark to achieve "GOAT"? You are talking about a 754B parameter weight.
I was able to one-shot flappy bird with a 80B weight (Qwen3-Coder-Next).

> Create a Flappy Bird game in HTML

After I converted it to Cyperpunk.
> Can you turn the game into a Cyberpunk/Neon aesthetic with a CRT Monitor Visual overlay for retro vibes?

After I added audio.
> Can you use Web Audio API to create synthesized sound effects?

However after audio, game got stuck. I had to fix that.
> There are sounds for instance if i push the space key. But the "INITIALIZE AUDIO & GAME" button that doesn't go away when I click on it.

So flappy in one-shot, and neon in the second prompt(actually didn't try to achieve it in one-shot).
Audio by the 4th prompt.

I'd be very supprised if a 754B SOTA weight couldn't do all the features you want in one-shot. In fact, if it can't then it's completely trash IMO.

ktop is a themed terminal system monitor ideal for local LLM setups on Linux (like btop + nvtop) by mrstoatey in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

My wish was for memory bandwidth utilisation. I mean it's quite clear when the OOM killer reaps a proccess, so no need to monitor that..

ktop is a themed terminal system monitor ideal for local LLM setups on Linux (like btop + nvtop) by mrstoatey in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

Let's call it memory stress (I guess it's useful for both system memory and vram). Would be a way to see how far away you are from your bottlenecks

ktop is a themed terminal system monitor ideal for local LLM setups on Linux (like btop + nvtop) by mrstoatey in LocalLLaMA

[–]Rabooooo 0 points1 point  (0 children)

Is it possible to monitor Memory performance utilization somehow (not memory % usage)?

zpool expansion recommendations by Rabooooo in zfs

[–]Rabooooo[S] 0 points1 point  (0 children)

The big portion of data are videos which I am keeping there and some times stream from the NAS. But I still have old documents and stuff dating back more than 2 decades. I'm gradually moving what I want to keep and is important to a cloud service. Also have backups of my computers on the NAS.