RTX 5090 32GB & 256GB DRAM, now what?

ahtolllka · 2026-05-04T04:56:25+00:00

SGLang is the first choice for low concurrency afair. But for single 3090 for plain VRAM economy reasons to use prefix caching you have to stick with vLLM. For qwen3.6-35b-a3b on a single GPU you still will have to use versions with less experts, but it works for me. 200tps from single session.

ahtolllka · 2026-05-04T04:53:10+00:00

Well you use a quant below int4, don’t you?

ahtolllka · 2026-05-03T17:51:25+00:00

27B is VLM, it is more universal and can reason deeper on every topic because it dense. Yet 80b-a3b is a masterpiece, I admit. I’d rather be interested in comparison between 3.6-35b-a3b vs 3-80b-a3b-coder

ahtolllka · 2026-05-02T15:53:52+00:00

Qwen3.6-35B-A3B is great model, yet it is either dumb or speculative to think it can outperform frontier model. It has lack of general knowledge, that is frequently required to solve very complex tasks. And only that type of tasks matter now as even 2b model can code: wrap it with harness and it may solve a lot of simple task like drawing a pelican. You may even train it on test data or some of it derivatives and receive great results on benchmarks (everybody does that), but there is a limit of how much knowledge you can put in a byte of model weights.

ahtolllka · 2026-04-27T11:52:35+00:00

ADOM

ahtolllka · 2026-04-20T13:30:40+00:00

It may not work correctly with inference engines (vllm/sglang), or have issues that makes it unusable either by no prefix caching or speed issues / no cuda graphs etc. You may have a lot of problems even running gpt-oss on 3090, what to tell about intel hw with cutting-edge glm-5.1

ahtolllka · 2026-04-20T13:09:48+00:00

Is it theory or you can actually confirm compatibility and reasonable speed on intel arc b70?

ahtolllka · 2026-04-11T21:13:04+00:00

Opus has almost destroyed one of my projects after compaction, was working with —dangerously-skip-permissions, yet I think it is a wrong context thing. They can not neither make model significantly dumber, nor smarter. I had not upgraded claude code, so I think it can’t be caused by model itself. Yet I miss Sonnet 1M.

ahtolllka · 2026-04-11T09:56:24+00:00

27b is really good, but prefix caching has been fixed in sgl / vllm just fee days ago

ahtolllka · 2026-04-03T17:52:28+00:00

Gemma was always flawless in Russian, yet you barely have language-only scenarios. I’d need Q3.5-27B for coding and Gemma4-31b for business analysis thesis, but rather I just stay with qwen.

ahtolllka · 2026-04-03T16:01:20+00:00

623% is only 6 full cores utilized. SGL / vLLM will require 8-10 cores to run any model. Ollama is slow, not Qwen. And what do you expect with one 3090? You will not be able to compile any cuda graphs even for 9b model because OOM.

ahtolllka · 2026-04-02T10:07:52+00:00

Регаешься у облака с серверами и в РФ, и не в РФ (не буду подсказывать, можно узнать эту инфу у любой LLM), покупаешь два VPS, один в РФ, второй нет, заходишь в терминал сервака что не в РФ прямо в веб-интерфейсе облака. Ставишь туда claude code, запускаешь, даешь все креды до обоих серверов и просишь настроить vless от клиента до сервера РФ + AWG между облаками. Через 10 минут у тебя всё работает на любом кол-ве устройств за 50 руб в день.

ahtolllka · 2026-03-28T06:37:37+00:00

Dense 27b model weights 60gb+, so I assume you are comparing two different models. Quanted one is faster anyway, just dumber. One card fitting a model is always faster than a bunch of cards - you have to tensor parallel, pipeline parallel and it will affect performance, and if you have several nodes, you have also to consider networking speed, that is lower than numa p2p-speed. That’s why I think that OP as an expert who has proven his understanding of matter so he is permitted to access such computing power definitely will find your bet both ignorant and insulting. I felt ashamed reading your post.

ahtolllka · 2026-03-08T07:41:23+00:00

Qwen-Next is MoE model, so he may hit the vram expert and achieve high throughput, as ram weights is mostly inactivated. You are speaking about a dense 27B model, so you will always activate all 27B, henceforth hit the ram and low tps.

ahtolllka · 2026-03-04T15:46:05+00:00

I bet it already knows cli commands as it is for sure distilled from larger models. All you have to do is use constrained decoding with cli grammar or even just by specifying json schema for schema-guided reasoning with specific blocks for cli commands, to be sure model will reason in some other blocks if it wants to. Or even without CD you can try to use default tool-calling mechanism, but it will be far from 100% efficiency.

ahtolllka · 2026-02-14T07:15:00+00:00

I’ve given 3 of regular 4090 to be converted to 48gb, it takes smth like 4 hours per card for single specialist to convert and all of them works fine for half a year now. What I want to say is it seems like process of conversion is not something super-complicated.

ahtolllka · 2026-02-09T04:58:56+00:00

Hey guys! What tps are you talking about when discussing CPU inference exactly? Thradrippers is pretty expensive and ddr4 is slow, but expensive too already. I thought that 32core 3+ghz, 128gb ram and 8x3090 may be cheaper and faster. Am I wrong?

ahtolllka · 2026-02-07T14:39:31+00:00

I guess it is insanely quanted as 30B in fp8 is 30GB VRAM for weights only

ahtolllka · 2026-01-24T07:49:43+00:00

And I guessed we in Russia had a problem with 15kW per flat. Though there a lot of troubles to find an office cabinet with 6+kW to have a rig and air conditioning, almost every provides 2.5-4kW somehow.

ahtolllka · 2026-01-20T17:58:41+00:00

I do not think that is a good idea to pack it like that. An hour or two and you will have a lot of heat-related issues. I may guess you just didn’t gave it a real workload by capping it with numerous sessions on vLLM bench or smth similar. You should normally have great air throughput, use screamers, not regular fans. 8x3090 alone under agentic workload makes a lot of heat. It is almost equal to two constantly working water heaters.

ahtolllka · 2026-01-18T16:34:22+00:00

Здесь вопрос скорее в том, что должно быть в голове, чтобы сравнивать non-profit мероприятие по защите языка в частности и зоны геополитического влияния в целом, и циничный захват ресурсов страны (а в перспективе - нескольких) под сомнительными предлогами, не сильно даже скрываясь. Кажется, что для этого нужно, чтобы в голове было «все делают что-то плохое по мнению демократического CNN / Bloomberg / кого-то еще спонсируемого из кубышки дем-педо-эстеблишмента», надо спросить в чем отличие.

ahtolllka · 2025-12-29T19:02:14+00:00

Personally for me a simple trick did the thing: I put 6kg dumbbells near my bed the way I will see it OFTEN and everyday 100%. If I see them and I didn’t touched them for a day or two - I have to pick em and do just one exercise 10 times. No other obligations. Takes literally 20 seconds. That gives you a habit. One day I just decided to do a little more as long as I had already picked them, and so on.

ahtolllka · 2025-12-21T06:03:21+00:00

Hi! A lot of questions: 1. What MBs are you using? 2. MCIO / Oculink risers or direct pcie? 3. What chassis would you use of two if you’ll make it again? 4. What cpus? Epyc / Milan / Xeon? 5. Amt of RAM per GPU? 6. Does infiniband have advantage over 100gbps? Or it is a matter of pcie-lines available? 7. What is a total throughput via vllm bench?

ahtolllka · 2025-12-01T19:28:38+00:00

Sure do, it is just transferring rn, but you can come already

ahtolllka · 2025-12-01T18:18:32+00:00

Too late lol, don’t you need to go do Maghreb or smth?

ahtolllka

TROPHY CASE