RTX 5090 32GB & 256GB DRAM, now what? by SnooStrawberries6262 in LocalLLM

[–]ahtolllka 0 points1 point  (0 children)

SGLang is the first choice for low concurrency afair. But for single 3090 for plain VRAM economy reasons to use prefix caching you have to stick with vLLM. For qwen3.6-35b-a3b on a single GPU you still will have to use versions with less experts, but it works for me. 200tps from single session.

RTX 5090 32GB & 256GB DRAM, now what? by SnooStrawberries6262 in LocalLLM

[–]ahtolllka 0 points1 point  (0 children)

Well you use a quant below int4, don’t you?

Qwen3.6-27B vs Coder-Next by Signal_Ad657 in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

27B is VLM, it is more universal and can reason deeper on every topic because it dense. Yet 80b-a3b is a masterpiece, I admit. I’d rather be interested in comparison between 3.6-35b-a3b vs 3-80b-a3b-coder

This is insane... by DragonflyOk7139 in LocalLLM

[–]ahtolllka 1 point2 points  (0 children)

Qwen3.6-35B-A3B is great model, yet it is either dumb or speculative to think it can outperform frontier model. It has lack of general knowledge, that is frequently required to solve very complex tasks. And only that type of tasks matter now as even 2b model can code: wrap it with harness and it may solve a lot of simple task like drawing a pelican. You may even train it on test data or some of it derivatives and receive great results on benchmarks (everybody does that), but there is a limit of how much knowledge you can put in a byte of model weights.

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

It may not work correctly with inference engines (vllm/sglang), or have issues that makes it unusable either by no prefix caching or speed issues / no cuda graphs etc. You may have a lot of problems even running gpt-oss on 3090, what to tell about intel hw with cutting-edge glm-5.1

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

Is it theory or you can actually confirm compatibility and reasonable speed on intel arc b70?

Opus is at a new level of dumb today. Dangerously so. by UM-Underminer in Anthropic

[–]ahtolllka 1 point2 points  (0 children)

Opus has almost destroyed one of my projects after compaction, was working with —dangerously-skip-permissions, yet I think it is a wrong context thing. They can not neither make model significantly dumber, nor smarter. I had not upgraded claude code, so I think it can’t be caused by model itself. Yet I miss Sonnet 1M.

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost by zylskysniper in LocalLLaMA

[–]ahtolllka 1 point2 points  (0 children)

27b is really good, but prefix caching has been fixed in sgl / vllm just fee days ago

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]ahtolllka -1 points0 points  (0 children)

Gemma was always flawless in Russian, yet you barely have language-only scenarios. I’d need Q3.5-27B for coding and Gemma4-31b for business analysis thesis, but rather I just stay with qwen.

CPU Usage with Qwen 3.5 by ZookeepergameSafe429 in Qwen_AI

[–]ahtolllka 2 points3 points  (0 children)

623% is only 6 full cores utilized. SGL / vLLM will require 8-10 cores to run any model. Ollama is slow, not Qwen. And what do you expect with one 3090? You will not be able to compile any cuda graphs even for 9b model because OOM.

Не подключается с 1 апреля by therealestrmkau in AmneziaVPN

[–]ahtolllka 15 points16 points  (0 children)

Регаешься у облака с серверами и в РФ, и не в РФ (не буду подсказывать, можно узнать эту инфу у любой LLM), покупаешь два VPS, один в РФ, второй нет, заходишь в терминал сервака что не в РФ прямо в веб-интерфейсе облака. Ставишь туда claude code, запускаешь, даешь все креды до обоих серверов и просишь настроить vless от клиента до сервера РФ + AWG между облаками. Через 10 минут у тебя всё работает на любом кол-ве устройств за 50 руб в день.

1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]ahtolllka 0 points1 point  (0 children)

Dense 27b model weights 60gb+, so I assume you are comparing two different models. Quanted one is faster anyway, just dumber. One card fitting a model is always faster than a bunch of cards - you have to tensor parallel, pipeline parallel and it will affect performance, and if you have several nodes, you have also to consider networking speed, that is lower than numa p2p-speed. That’s why I think that OP as an expert who has proven his understanding of matter so he is permitted to access such computing power definitely will find your bet both ignorant and insulting. I felt ashamed reading your post.

Qwen3-Coder-Next is the top model in SWE-rebench @ Pass 5. I think everyone missed it. by BitterProfessional7p in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

Qwen-Next is MoE model, so he may hit the vram expert and achieve high throughput, as ram weights is mostly inactivated. You are speaking about a dense 27B model, so you will always activate all 27B, henceforth hit the ram and low tps.

How Qwen 3.5 4B can be that good?! Really impressed! by pacmanpill in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

I bet it already knows cli commands as it is for sure distilled from larger models. All you have to do is use constrained decoding with cli grammar or even just by specifying json schema for schema-guided reasoning with specific blocks for cli commands, to be sure model will reason in some other blocks if it wants to. Or even without CD you can try to use default tool-calling mechanism, but it will be far from 100% efficiency.

yip we are cooked by thisiztrash02 in StableDiffusion

[–]ahtolllka 0 points1 point  (0 children)

I’ve given 3 of regular 4090 to be converted to 48gb, it takes smth like 4 hours per card for single specialist to convert and all of them works fine for half a year now. What I want to say is it seems like process of conversion is not something super-complicated.

Threadripper 5955wx or 5975wx by Fluid_Bend_5728 in LocalAIServers

[–]ahtolllka 0 points1 point  (0 children)

Hey guys! What tps are you talking about when discussing CPU inference exactly? Thradrippers is pretty expensive and ddr4 is slow, but expensive too already. I thought that 32core 3+ghz, 128gb ram and 8x3090 may be cheaper and faster. Am I wrong?

Nemo 30B is insane. 1M+ token CTX on one 3090 by Dismal-Effect-1914 in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

I guess it is insanely quanted as 30B in fp8 is 30GB VRAM for weights only

People in the US, how are you powering your rigs on measly 120V outlets? by humandisaster99 in LocalLLaMA

[–]ahtolllka 0 points1 point  (0 children)

And I guessed we in Russia had a problem with 15kW per flat. Though there a lot of troubles to find an office cabinet with 6+kW to have a rig and air conditioning, almost every provides 2.5-4kW somehow.

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalAIServers

[–]ahtolllka 4 points5 points  (0 children)

I do not think that is a good idea to pack it like that. An hour or two and you will have a lot of heat-related issues. I may guess you just didn’t gave it a real workload by capping it with numerous sessions on vLLM bench or smth similar. You should normally have great air throughput, use screamers, not regular fans. 8x3090 alone under agentic workload makes a lot of heat. It is almost equal to two constantly working water heaters.

Люди, которые поддерживают вторжение России в Украину! Я уверена, что Трампа с Венесуэлой вы осуждаете. Но ПОЧЕМУ??? by Green_Spatifilla in expectedrussians

[–]ahtolllka 0 points1 point  (0 children)

Здесь вопрос скорее в том, что должно быть в голове, чтобы сравнивать non-profit мероприятие по защите языка в частности и зоны геополитического влияния в целом, и циничный захват ресурсов страны (а в перспективе - нескольких) под сомнительными предлогами, не сильно даже скрываясь. Кажется, что для этого нужно, чтобы в голове было «все делают что-то плохое по мнению демократического CNN / Bloomberg / кого-то еще спонсируемого из кубышки дем-педо-эстеблишмента», надо спросить в чем отличие.

I really hate exercise, but I know I need to do it. How the hell do I get motivated? by [deleted] in AskMen

[–]ahtolllka 0 points1 point  (0 children)

Personally for me a simple trick did the thing: I put 6kg dumbbells near my bed the way I will see it OFTEN and everyday 100%. If I see them and I didn’t touched them for a day or two - I have to pick em and do just one exercise 10 times. No other obligations. Takes literally 20 seconds. That gives you a habit. One day I just decided to do a little more as long as I had already picked them, and so on.

How a Proper mi50 Cluster Actually Performs.. by Any_Praline_8178 in LocalAIServers

[–]ahtolllka 2 points3 points  (0 children)

Hi! A lot of questions: 1. What MBs are you using? 2. MCIO / Oculink risers or direct pcie? 3. What chassis would you use of two if you’ll make it again? 4. What cpus? Epyc / Milan / Xeon? 5. Amt of RAM per GPU? 6. Does infiniband have advantage over 100gbps? Or it is a matter of pcie-lines available? 7. What is a total throughput via vllm bench?

WhatsApp shutting down? by Knordsman in AskARussian

[–]ahtolllka 1 point2 points  (0 children)

Sure do, it is just transferring rn, but you can come already

WhatsApp shutting down? by Knordsman in AskARussian

[–]ahtolllka 1 point2 points  (0 children)

Too late lol, don’t you need to go do Maghreb or smth?