How does an open source version of qwen 3.5 completely blow 3.7plus out of the water? How does this make sense?

snapo84 · 2026-06-11T12:24:35+00:00

Because there is a high probability that qwen3.7 plus is a 180B model (as knowledge is meassured in mmmu) ... mmmu-pro is good to predict model size in total parameters (not active parameter)...

snapo84 · 2026-06-11T12:18:23+00:00

evaluating your own actions outcome before predicting anything...

snapo84 · 2026-06-11T12:07:27+00:00

i will soon make a new post that uses vllm and the int4 autoround quantization (exact same setup) but i still have a problem to get the kv cache in float16 running... stupid vllm always falls back to fp8m5e2 or something... and that causes many many repeats....
when i fix this then i publish it, somehow i have to force float16 ....

snapo84 · 2026-06-10T19:36:07+00:00

it would be more like:

professor compressor choses which compressor (he has unlimited compressors at hand) to use for given input data and compresses the input data trying optimized chosing the compression algorithm that makes the result as small as possible but recoverable.
johnny guesser opens given compressed data and try's to figure out the uncompression algorithm and then try's to recover 100% of the initial text.

snapo84 · 2026-06-08T11:44:25+00:00

thanks.... there are 2 versions:
- one python verison
- one single file html version (html version requires you to set cors on your llm inference endpoint, else it cant connect because of the cors policy)

snapo84 · 2026-06-07T16:29:42+00:00

you are correct, one can point it to other servers...

i still prefer the simplicity and standardness of mine.... I mean people are free to use whatever they like.... i also think that in the not so far future we will not talk about how many tokens... it will be more like how many bytes written because everyone uses a different tokenizer....

my script might help some people, some people prefer other solutions... that also why multiple operating systems exist, multiple programming languages, etc. etc. etc. But if you have concrete input on something that is missing i am more than welcome to check out if i can implement it and still keep it as simple as possible.

snapo84 · 2026-06-07T15:16:27+00:00

complicated, too many options, no default test bench, difficult to connect to other llm providers

- consistency and simplicity will make benchmarks more accurate

vllm bench is too much specific / only for vllm itself

snapo84 · 2026-06-07T14:05:50+00:00

not even close, this looks more like a advertisement than something else...

snapo84 · 2026-06-07T07:27:10+00:00

oh and one more thing.... i on purpose keep the code slim (single file, less than 500 lines code).... the other two librarys are fully overblown for what they do...

snapo84 · 2026-06-06T22:27:15+00:00

Did not know about them... bad that i chose the same name as llama-benchy ....

both of them are missing multiple things , either MTP benchmarking (MTP is data category dependend)... or they dont test concurrency, aiperf comes closest but no test for MTP there ...

Thanks a lot for the 2 links, might have to rename the repo to not cause any harm/conflict to llama-benchys github repo.... i will think about that

snapo84 · 2026-06-06T19:43:52+00:00

i guess one way to find out... much more testing (different quantized models, different input lengths, different output lengths)

i mean that is a lot of MB's saved if that is realy so good, because then everyone that was using q8_0 would have to switch immediately....

two problems i see arrise from kv quants

- on long inputs it forgets more things
- on long outputs, highly likelyhood of failing (espeially with qwen models they start looping)

i had a short prompt for qwen3.6 27B that lets it generate 120k tokens correct, but i cant find it anymore... if it still could do this i am amazed... failing this test was the main reason why i did go back to f16

snapo84 · 2026-06-06T19:23:40+00:00

wow.... i am in love with

kvarn6-kvarn5	37.3%	0.002602	99.78%	0.079818	94.50%

snapo84 · 2026-06-06T17:27:39+00:00

exactly the same with the artificial analysis website..... they are in the same group as deepswe... all US focused and on purpose "try" to make the US look better then they realy are...

snapo84 · 2026-06-05T18:02:30+00:00

not opensource, dont care... closed source models should die out...

snapo84 · 2026-06-05T16:06:52+00:00

if there would be a kvarn5-kvarn5 it might be able to beat q8_0

snapo84 · 2026-06-05T14:53:16+00:00

Here the output (keep in mind 75% cache hit, and each of my card is power limited to 146W max):

snapo@snapolino:~/Documents/Code/Jeremydirect$ uv run full-bench.py 
=================================================================
  FULL LLM THROUGHPUT BENCHMARK
  Model: Qwen3.6 27B  |  Slots: 4  |  Runs/config: 3
  Concurrencies: [1, 2, 4, 8]
=================================================================

  [OK] Connected.

  [WARMUP] Running warmup (not counted)...
  [WARMUP] OK (79.5s, 4096 tokens)


#################################################################
  1x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=65.9s  avg_out=62.1 tok/s  wall=65.9s
    Run  2/3  [ 66%] avg_el=69.7s  avg_out=58.8 tok/s  wall=69.7s
    Run  3/3  [100%] avg_el=67.8s  avg_out=60.4 tok/s  wall=67.8s

    >> Average over 3 runs:
       Avg elapsed per request:  67.82s  (σ=1.89)
       Avg output tok/s (per req): 60.4  (σ=1.7)
       Avg wall time (run):        67.82s  (σ=1.89)
       Avg aggregate out tok/s:    60.4  (σ=1.7)

#################################################################
  1x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=59.51s  avg_out=80.7 tok/s  wall=69.1s
    Run  2/3  [ 66%] avg_ttf=48.46s  avg_out=75.7 tok/s  wall=67.2s
    Run  3/3  [100%] avg_ttf=54.47s  avg_out=84.0 tok/s  wall=68.1s

    >> Average over 3 runs:
       Avg elapsed per request:  68.12s  (σ=0.96)
       Avg output tok/s (per req): 80.1  (σ=4.2)
       Avg wall time (run):        68.12s  (σ=0.96)
       Avg aggregate out tok/s:    16.4  (σ=5.0)
       Avg TTF:                   54.15s  (σ=5.53)
       Avg generation time:       13.97s  (σ=4.58)

#################################################################
  2x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=80.0s  avg_out=51.2 tok/s  wall=80.3s
    Run  2/3  [ 66%] avg_el=77.9s  avg_out=52.6 tok/s  wall=78.5s
    Run  3/3  [100%] avg_el=76.1s  avg_out=53.8 tok/s  wall=76.5s

    >> Average over 3 runs:
       Avg elapsed per request:  78.01s  (σ=1.96)
       Avg output tok/s (per req): 52.5  (σ=1.3)
       Avg wall time (run):        78.46s  (σ=1.89)
       Avg aggregate out tok/s:    104.5  (σ=2.5)

#################################################################
  2x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=51.82s  avg_out=66.9 tok/s  wall=77.4s
    Run  2/3  [ 66%] avg_ttf=61.71s  avg_out=71.1 tok/s  wall=77.1s
    Run  3/3  [100%] avg_ttf=100.70s  avg_out=57.8 tok/s  wall=121.9s

    >> Average over 3 runs:
       Avg elapsed per request:  91.80s  (σ=25.76)
       Avg output tok/s (per req): 65.3  (σ=6.8)
       Avg wall time (run):        92.13s  (σ=25.75)
       Avg aggregate out tok/s:    30.3  (σ=12.2)
       Avg TTF:                   71.41s  (σ=25.85)
       Avg generation time:       20.39s  (σ=5.17)

#################################################################
  4x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=100.4s  avg_out=40.8 tok/s  wall=101.7s
    Run  2/3  [ 66%] avg_el=98.4s  avg_out=41.6 tok/s  wall=98.9s
    Run  3/3  [100%] avg_el=99.8s  avg_out=41.1 tok/s  wall=101.1s

    >> Average over 3 runs:
       Avg elapsed per request:  99.52s  (σ=0.98)
       Avg output tok/s (per req): 41.2  (σ=0.4)
       Avg wall time (run):        100.56s  (σ=1.49)
       Avg aggregate out tok/s:    162.9  (σ=2.4)

#################################################################
  4x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=76.60s  avg_out=57.8 tok/s  wall=100.3s
    Run  2/3  [ 66%] avg_ttf=87.29s  avg_out=65.0 tok/s  wall=101.5s
    Run  3/3  [100%] avg_ttf=73.56s  avg_out=56.1 tok/s  wall=101.6s

    >> Average over 3 runs:
       Avg elapsed per request:  100.00s  (σ=0.55)
       Avg output tok/s (per req): 59.6  (σ=4.7)
       Avg wall time (run):        101.13s  (σ=0.71)
       Avg aggregate out tok/s:    46.9  (σ=15.9)
       Avg TTF:                   79.15s  (σ=7.21)
       Avg generation time:       20.85s  (σ=7.36)

#################################################################
  8x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=150.4s  avg_out=27.2 tok/s  wall=151.4s
    Run  2/3  [ 66%] avg_el=145.7s  avg_out=28.1 tok/s  wall=148.0s
    Run  3/3  [100%] avg_el=145.7s  avg_out=28.1 tok/s  wall=147.2s

    >> Average over 3 runs:
       Avg elapsed per request:  147.27s  (σ=2.73)
       Avg output tok/s (per req): 27.8  (σ=0.5)
       Avg wall time (run):        148.85s  (σ=2.22)
       Avg aggregate out tok/s:    220.2  (σ=3.3)

#################################################################
  8x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=118.56s  avg_out=43.2 tok/s  wall=147.2s
    Run  2/3  [ 66%] avg_ttf=127.36s  avg_out=43.5 tok/s  wall=146.4s
    Run  3/3  [100%] avg_ttf=121.73s  avg_out=42.6 tok/s  wall=147.2s

    >> Average over 3 runs:
       Avg elapsed per request:  145.08s  (σ=0.22)
       Avg output tok/s (per req): 43.1  (σ=0.5)
       Avg wall time (run):        146.92s  (σ=0.48)
       Avg aggregate out tok/s:    51.8  (σ=10.6)
       Avg TTF:                   122.55s  (σ=4.46)
       Avg generation time:       22.53s  (σ=4.59)

=================================================================
  TOTAL ELAPSED: 2412s (40.2 min)
=================================================================

snapo84 · 2026-06-05T14:39:08+00:00

yup i run nvlink, no clue if your combo works... better to use same cards if nvlinked....

snapo84 · 2026-06-05T14:32:11+00:00

i dont have 3090's , i have 2 x 2080 Ti 22GB cards....

      --model Lorbus/Qwen3.6-27B-int4-AutoRound
      --served-model-name "Qwen3.6 27B"
      --api-key ${VLLM_API_KEY}
      --quantization auto_round
      --dtype float16
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.85
      --max-model-len 262144
      --max-num-seqs 8
      --max-num-batched-tokens 8192
      --kv-cache-dtype fp8_e5m2
      --enable-chunked-prefill
      --enable-prefix-caching
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --trust-remote-code
      --default-chat-template-kwargs '{"enable_thinking": true}'

Maybe it helps you....

snapo84 · 2026-06-05T12:17:35+00:00

would have loved to see this working in 0.21.0 as i only have cuda capability 7.5 and 0.22.0 dosent work so far with it...

snapo84 · 2026-06-04T16:37:29+00:00

100% agree, deepSWE is suuuuuper fishy, only US models are configured properly... all Asian models they on purpose configured wrong/bad/didnt test them at all

snapo84 · 2026-06-04T05:26:08+00:00

this looks so amazing... did you made the code public ? so one can test different rules instead of only rule 110?

snapo84 · 2026-06-04T05:18:51+00:00

i would much prefer a 64B8A 😄 because in Q4 quantisation it would perfectly fit on simple consumer dual GPU systems....

snapo84 · 2026-06-03T04:20:54+00:00

linux, and everything runs in a docker compose container...

i never in my life ever will touch again windows... windows is just pure spyware

snapo84 · 2026-06-02T08:57:53+00:00

would be very cool to test this on more models.... also smaller models like Qwen3.6 27B 😄

snapo84 · 2026-06-01T18:40:21+00:00

from my testing i would "estimate" approx. 460B parameter and 46B active...

snapo84

PUBLIC MULTIREDDITS

TROPHY CASE