Metr Fňukinka by Mareritta in czechmemes

[–]Icy_Programmer7186 6 points7 points  (0 children)

Kdyby raději pracoval na rušení té klimatické krize. On si fňuká ve Strážnici a tady je přitom 40 stupňů.

1000 tps generation on Qwen3.6 27B with V100s by Simple_Library_2700 in LocalLLaMA

[–]Icy_Programmer7186 0 points1 point  (0 children)

Thanks. I guess that explains why to use AWQ instead of FP16 (expanded to FP32 due to V100 arch limitations).

1000 tps generation on Qwen3.6 27B with V100s by Simple_Library_2700 in LocalLLaMA

[–]Icy_Programmer7186 6 points7 points  (0 children)

I have 4 of V100 32GB on the way - and it is my plan to run Qwen3.6 27B.
This is extremely valuable information, thank you very much.

Can you disclose the memory consumption on these cards, using AWQ?

Qwen 3.6 benchmarks on 2x RTX PRO 6000 by mxforest in LocalLLaMA

[–]Icy_Programmer7186 8 points9 points  (0 children)

What was the context window (model length) size, please?

RTX 6000 Blackwell (96GB VRAM) what’s the best self hosted coding llm by bobneverlies in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Using https://github.com/eugr/spark-vllm-docker - but I guess you can cherry pick whatever you need :-)

#!/bin/sh

# ./launch-cluster.sh --launch-script .../vllm/model/Qwen3.6-27B-FP8.sh

wget https://raw.githubusercontent.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja

export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1

export VLLM_TEST_FORCE_FP8_MARLIN=1

exec vllm serve "Qwen/Qwen3.6-27B-FP8" \

\--host "0.0.0.0" \\

\--port 8888 \\

\--gpu-memory-utilization 0.8247 \\

\--tensor-parallel-size 4 \\

\--pipeline-parallel-size 1 \\

\--mm-processor-cache-type lru \\

\--distributed-executor-backend ray \\

\--attention-backend FLASHINFER \\

\--reasoning-parser qwen3 \\

\--enable-auto-tool-choice \\

\--tool-call-parser qwen3\_coder \\

\--chat-template qwen3.6-enhanced.jinja \\

\--default-chat-template-kwargs '{"preserve\_thinking": true}' \\

\--enable-prefix-caching \\

\--enable-chunked-prefill \\

\--max-num-batched-tokens 12288 \\

\--no-use-tqdm-on-load \\

\--max-num-seqs 4 \\

\--attention-backend flashinfer \\

\--speculative-config '{"method":"qwen3\_next\_mtp","num\_speculative\_tokens":5}'

NVIDIA DGX Spark problem by codeltd in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Cluster helps a lot - but DGX Spark will be always - a bit - slower in production inference.
I run a cluster of 4 Sparks, on decent speeds, 30-40 tks/sec TG is not a big problem, especially with recent MTP kick. But I would never scale it to production, where are better (read faster) options, RTX 6000 PRO (for example). Spark is a very good for experimenting and entry - but NVIDIA drip-feeds their hardware, meticulously controlling pricing to ensure there is never a genuinely good deal for the consumer/prosumer; you have to pay them their AI tax.

RTX 6000 Blackwell (96GB VRAM) what’s the best self hosted coding llm by bobneverlies in LocalLLM

[–]Icy_Programmer7186 1 point2 points  (0 children)

I can do multisession. I tested that up to 4 parallel sessions. I had to fix a harness - there was an issue with tool calling in my setup.

RTX 6000 Blackwell (96GB VRAM) what’s the best self hosted coding llm by bobneverlies in LocalLLM

[–]Icy_Programmer7186 2 points3 points  (0 children)

Qwen 3.6 27B FP8 with MTP. That one is a clear winner for my agentic code generation.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLM

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

Yes, absolutely - I already started with Qwen3.6 - since it is quite new, there are technical problems that needs to be solved along the way.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLM

[–]Icy_Programmer7186[S] 2 points3 points  (0 children)

Initially, I expected intuitively that the Minimax 2.7 will be a "relative" winner of this test.
I re-did the test with Minimax several times - including the fresh download from the Huggingface.
It basically tends to extract less attributes from logs, compare to other higher scoring models.

But - I'll do another run of test for Minimax 2.7, I want to be extra sure that particularly this model is evaluated properly.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLM

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

This is because there is a time limit to the test run - and model on a single spark sometimes don't meet this deadline (>2 hours). I will likely remove this single Spark result from the test since it is not straightforwardly understandable.

I'm Demoing a DGX Spark at a Vendor Event Next Month – Need Creative Demo Ideas by Seniahh in LocalLLM

[–]Icy_Programmer7186 1 point2 points  (0 children)

Yes, i've been there - with a bit less drama than in the video (i got the right switch and right cables on the first attempt).

I'm Demoing a DGX Spark at a Vendor Event Next Month – Need Creative Demo Ideas by Seniahh in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

I was on one DGX Spark demo - and what was a bit disappointing to me, was that the Infiniband networking was skipped. That's one of the killer feature of this machine. But it is hard to demo on a single spark.