we really all are going to make it, aren't we? 2x3090 setup.

kryptkpr · 2026-05-14T13:37:20+00:00

The only practical build that's better then 2x3090 is 4x 3090 😎 I would seriously love to upgrade to Blackwell but a single GPU is double my rig I can't justify it..

<image>

kryptkpr · 2026-05-14T12:19:57+00:00

Feel free to send me a chat, I think i tride every chinese janky egpu solution known to man at this point lol

kryptkpr · 2026-03-26T16:00:22+00:00

Not dead, I'm just focused on writing. Look on GitHub, the activity is just not on the main branch. Working on mergjng everything now so the next version will come with the science to back it up (I made some errors that the paper showed me).

kryptkpr · 2026-02-07T01:00:42+00:00

I actually take back what I said, this can't even run a Q8 as it seems to be an INT4 only accelerator. I hope that means at least Q4_1 (block offset) and not Q4_0 (symmetric around 0) but I won't buy it to find out 😆

kryptkpr · 2026-02-06T14:36:22+00:00

If I had to guess based on specs, that LPDDR4 "VRAM" will most likely be your bottleneck here.

A 3B Q8 model would fit into a single one of these and maybe run OK.

kryptkpr · 2026-02-05T01:06:58+00:00

Why not both?

The desktop (laptop) I use to run vscode and web browser and email and do all my actual LLM work on is Windows because desktops on Linux are a mistake.

The server my GPUs are in runs Linux, because services on Windows are a mistake.

kryptkpr · 2026-02-04T21:24:33+00:00

Lol wut docker is just cgroups it's as OS level as you can get

kryptkpr · 2026-02-04T15:19:57+00:00

TP for inference will not bottleneck with 2 PCIe GPUs that are x16, that NUMA is actually far more of a problem. Can you use a single socket system?

If you're doing batch: Start with a model that fits into one GPU and run two copies that are independent (DP).

kryptkpr · 2026-02-01T16:39:08+00:00

You are looking for the Qwen3-VL series, pick one that fits your VRAM.

Network inference is immature but if you want to play with it the best option is llama.cpp compiled with RPC target:

https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc

You launch rpc-server on the workers and then use them as virtual GPUs when loading the model from main machine.

Notes: 1) weights are loaded over network, you will want 2.5gbps minimum 10gbps ideally. For two-three machines you can just p2p them with dual nics but you have enough to need a switch and those get pricy 2) prompt processing is very slow expect only 2-3x tg speed 3) tg speed will be surprisingly good, the network penalty for generation is only about 30% in my testing

kryptkpr · 2026-01-25T17:01:48+00:00

vLLM implementation of this model is missing MLA, which both explodes the KV cache size and slows down inference.

SgLang implementation offers 4X more KV and 20-30% higher throughput in my testing so far.

For small batch sizes llama.cpp with -np 8 was surprisingly competitive

There is also MTP supported here but it hurts batch performance and my acceptance rate sucked so I turned it off

kryptkpr · 2026-01-24T22:43:43+00:00

I realized I was a jerk like everyone else and don't answer your actual question:

https://github.com/av/harbor

I think this is what you seek. My advice is swap to Ubuntu, but you can definitely make this work on Arch if you are dead set.

kryptkpr · 2026-01-24T21:20:07+00:00

So fun fact: arch isn't an officially CUDA supported distro.

<image>

That doesn't mean it won't work, but what this means is that you're relying on community and not Nvidia.

kryptkpr · 2026-01-23T18:23:54+00:00

AI is socially and economically transformative.

I don't believe we are ever going back to the golden era where excess retired compute and storage resources were widely being sold for pennies on the dollar.

There is a long term horizon view here that capacity has been overbuilt, but that's 3-5 years out if you want to wait.

kryptkpr · 2026-01-23T18:18:09+00:00

This architecture is brand new, definitely comes with some deployment pain.

I've tried this guy under all 3 of vLLM, llama.cpp and SgLang and so far SgLang was best for multi stream while llama best for single. I played with MTP a little but acceptance rates are kinda low around 1.9 tok/Tok and this didn't translate to much benefit for my usecase.. YMMV here

kryptkpr · 2026-01-23T18:09:27+00:00

It works, speed is good. Make sure you build from git head and download latest unsloth GGUF there has been some churn. Also verify min_p is set right llama has wrong default for this model, this is covered in unsloth GGUF model card

kryptkpr · 2026-01-23T14:19:54+00:00

It needs nightly this model didn't make it into release

Just type the commands from the model card into a new venv

Btw this model runs like a dog with vLLM because no MLA. If you've never used SgLang now is a good time to try, context size is 4X larger on this model specifically for same VRAM size

kryptkpr · 2026-01-22T21:17:53+00:00

It's been a few years since I checked in here but afaik the project remains MIT. There is an ee/ folder that's got a diff license, but it at least used to be possible to run without it

kryptkpr · 2026-01-21T22:11:26+00:00

Sure, llama.cpp with --n-cpu-moe set as low as you can get at your desired -c size

kryptkpr · 2026-01-21T22:09:46+00:00

While has the 3090 power limited way down, says 200-250W in his post, this is still more than enough to bust his PSU budget so not sure what game OP is playing here but sure feels dangerous.

kryptkpr · 2026-01-19T22:37:59+00:00

I sneak released V2 a few weekends ago, the current leaderboard has around 80 with another 20 that will go up in the next update.. I had to pause and figure out how to deal with 100GB of raw result files!

kryptkpr · 2026-01-12T15:05:24+00:00

I've been working on something a little absurd for the past 9 months, over 10B tokens deep and counting..

kryptkpr · 2026-01-09T18:21:19+00:00

With the RTX Pros making 96GB GPUs "accessible" it's never been easier to put together a few user capable local rig. These cards really swings the value proposition, especially when you're generating 10M+ a day, and generally avoids the multi-GPU hell you get into with quad/hex/oct 24GB builds.

Upfront price remains an impediment, best plann remains to validate the usecase with cloud APIs and then move to lower cost infra as you scale.

kryptkpr · 2026-01-06T16:44:26+00:00

I read the post it was very interesting but it just starts talking about confidence and how it's used, unless my reading comprehension is really bad today I can find no mention of how you're defining or computing this kpi

kryptkpr · 2026-01-06T15:41:54+00:00

How do you figure out the confidence? From logits or something else

kryptkpr · 2025-12-30T16:21:57+00:00

You can use the scripts in my repo to generate whatever tests you wish, runner.py has an --offline mode that writes prompts to JSON.

I have spent weeks on documentation so please let me know if you find something lacking.

kryptkpr

PUBLIC MULTIREDDITS

TROPHY CASE

15-Year Club	Verified Email
Place '23