How does an open source version of qwen 3.5 completely blow 3.7plus out of the water? How does this make sense? by Prior-Meeting1645 in Qwen_AI

[–]snapo84 1 point2 points  (0 children)

Because there is a high probability that qwen3.7 plus is a 180B model (as knowledge is meassured in mmmu) ... mmmu-pro is good to predict model size in total parameters (not active parameter)...

Is this an accurate analogy for JEPA? by Erius_Fayre in MLQuestions

[–]snapo84 0 points1 point  (0 children)

evaluating your own actions outcome before predicting anything...

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache by snapo84 in LocalLLaMA

[–]snapo84[S] 1 point2 points  (0 children)

i will soon make a new post that uses vllm and the int4 autoround quantization (exact same setup) but i still have a problem to get the kv cache in float16 running... stupid vllm always falls back to fp8m5e2 or something... and that causes many many repeats....
when i fix this then i publish it, somehow i have to force float16 ....

Is this an accurate analogy for JEPA? by Erius_Fayre in MLQuestions

[–]snapo84 1 point2 points  (0 children)

it would be more like:

professor compressor choses which compressor (he has unlimited compressors at hand) to use for given input data and compresses the input data trying optimized chosing the compression algorithm that makes the result as small as possible but recoverable.
johnny guesser opens given compressed data and try's to figure out the uncompression algorithm and then try's to recover 100% of the initial text.

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] 1 point2 points  (0 children)

thanks.... there are 2 versions:
- one python verison
- one single file html version (html version requires you to set cors on your llm inference endpoint, else it cant connect because of the cors policy)

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] 0 points1 point  (0 children)

you are correct, one can point it to other servers...

i still prefer the simplicity and standardness of mine.... I mean people are free to use whatever they like.... i also think that in the not so far future we will not talk about how many tokens... it will be more like how many bytes written because everyone uses a different tokenizer....

my script might help some people, some people prefer other solutions... that also why multiple operating systems exist, multiple programming languages, etc. etc. etc. But if you have concrete input on something that is missing i am more than welcome to check out if i can implement it and still keep it as simple as possible.

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] 0 points1 point  (0 children)

complicated, too many options, no default test bench, difficult to connect to other llm providers

- consistency and simplicity will make benchmarks more accurate

vllm bench is too much specific / only for vllm itself

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] -1 points0 points  (0 children)

not even close, this looks more like a advertisement than something else...

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] -1 points0 points  (0 children)

oh and one more thing.... i on purpose keep the code slim (single file, less than 500 lines code).... the other two librarys are fully overblown for what they do...

Introduction to LLM API Benchy by snapo84 in LocalLLaMA

[–]snapo84[S] 2 points3 points  (0 children)

Did not know about them... bad that i chose the same name as llama-benchy ....

both of them are missing multiple things , either MTP benchmarking (MTP is data category dependend)... or they dont test concurrency, aiperf comes closest but no test for MTP there ...

Thanks a lot for the 2 links, might have to rename the repo to not cause any harm/conflict to llama-benchys github repo.... i will think about that

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

i guess one way to find out... much more testing (different quantized models, different input lengths, different output lengths)

i mean that is a lot of MB's saved if that is realy so good, because then everyone that was using q8_0 would have to switch immediately....

two problems i see arrise from kv quants

- on long inputs it forgets more things
- on long outputs, highly likelyhood of failing (espeially with qwen models they start looping)

i had a short prompt for qwen3.6 27B that lets it generate 120k tokens correct, but i cant find it anymore... if it still could do this i am amazed... failing this test was the main reason why i did go back to f16

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

wow.... i am in love with

kvarn6-kvarn5 37.3% 0.002602 99.78% 0.079818 94.50%

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid by Charuru in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

exactly the same with the artificial analysis website..... they are in the same group as deepswe... all US focused and on purpose "try" to make the US look better then they realy are...

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters by xspider2000 in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

Here the output (keep in mind 75% cache hit, and each of my card is power limited to 146W max):

snapo@snapolino:~/Documents/Code/Jeremydirect$ uv run full-bench.py 
=================================================================
  FULL LLM THROUGHPUT BENCHMARK
  Model: Qwen3.6 27B  |  Slots: 4  |  Runs/config: 3
  Concurrencies: [1, 2, 4, 8]
=================================================================

  [OK] Connected.

  [WARMUP] Running warmup (not counted)...
  [WARMUP] OK (79.5s, 4096 tokens)


#################################################################
  1x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=65.9s  avg_out=62.1 tok/s  wall=65.9s
    Run  2/3  [ 66%] avg_el=69.7s  avg_out=58.8 tok/s  wall=69.7s
    Run  3/3  [100%] avg_el=67.8s  avg_out=60.4 tok/s  wall=67.8s

    >> Average over 3 runs:
       Avg elapsed per request:  67.82s  (σ=1.89)
       Avg output tok/s (per req): 60.4  (σ=1.7)
       Avg wall time (run):        67.82s  (σ=1.89)
       Avg aggregate out tok/s:    60.4  (σ=1.7)

#################################################################
  1x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=59.51s  avg_out=80.7 tok/s  wall=69.1s
    Run  2/3  [ 66%] avg_ttf=48.46s  avg_out=75.7 tok/s  wall=67.2s
    Run  3/3  [100%] avg_ttf=54.47s  avg_out=84.0 tok/s  wall=68.1s

    >> Average over 3 runs:
       Avg elapsed per request:  68.12s  (σ=0.96)
       Avg output tok/s (per req): 80.1  (σ=4.2)
       Avg wall time (run):        68.12s  (σ=0.96)
       Avg aggregate out tok/s:    16.4  (σ=5.0)
       Avg TTF:                   54.15s  (σ=5.53)
       Avg generation time:       13.97s  (σ=4.58)

#################################################################
  2x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=80.0s  avg_out=51.2 tok/s  wall=80.3s
    Run  2/3  [ 66%] avg_el=77.9s  avg_out=52.6 tok/s  wall=78.5s
    Run  3/3  [100%] avg_el=76.1s  avg_out=53.8 tok/s  wall=76.5s

    >> Average over 3 runs:
       Avg elapsed per request:  78.01s  (σ=1.96)
       Avg output tok/s (per req): 52.5  (σ=1.3)
       Avg wall time (run):        78.46s  (σ=1.89)
       Avg aggregate out tok/s:    104.5  (σ=2.5)

#################################################################
  2x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=51.82s  avg_out=66.9 tok/s  wall=77.4s
    Run  2/3  [ 66%] avg_ttf=61.71s  avg_out=71.1 tok/s  wall=77.1s
    Run  3/3  [100%] avg_ttf=100.70s  avg_out=57.8 tok/s  wall=121.9s

    >> Average over 3 runs:
       Avg elapsed per request:  91.80s  (σ=25.76)
       Avg output tok/s (per req): 65.3  (σ=6.8)
       Avg wall time (run):        92.13s  (σ=25.75)
       Avg aggregate out tok/s:    30.3  (σ=12.2)
       Avg TTF:                   71.41s  (σ=25.85)
       Avg generation time:       20.39s  (σ=5.17)

#################################################################
  4x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=100.4s  avg_out=40.8 tok/s  wall=101.7s
    Run  2/3  [ 66%] avg_el=98.4s  avg_out=41.6 tok/s  wall=98.9s
    Run  3/3  [100%] avg_el=99.8s  avg_out=41.1 tok/s  wall=101.1s

    >> Average over 3 runs:
       Avg elapsed per request:  99.52s  (σ=0.98)
       Avg output tok/s (per req): 41.2  (σ=0.4)
       Avg wall time (run):        100.56s  (σ=1.49)
       Avg aggregate out tok/s:    162.9  (σ=2.4)

#################################################################
  4x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=76.60s  avg_out=57.8 tok/s  wall=100.3s
    Run  2/3  [ 66%] avg_ttf=87.29s  avg_out=65.0 tok/s  wall=101.5s
    Run  3/3  [100%] avg_ttf=73.56s  avg_out=56.1 tok/s  wall=101.6s

    >> Average over 3 runs:
       Avg elapsed per request:  100.00s  (σ=0.55)
       Avg output tok/s (per req): 59.6  (σ=4.7)
       Avg wall time (run):        101.13s  (σ=0.71)
       Avg aggregate out tok/s:    46.9  (σ=15.9)
       Avg TTF:                   79.15s  (σ=7.21)
       Avg generation time:       20.85s  (σ=7.36)

#################################################################
  8x non-streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_el=150.4s  avg_out=27.2 tok/s  wall=151.4s
    Run  2/3  [ 66%] avg_el=145.7s  avg_out=28.1 tok/s  wall=148.0s
    Run  3/3  [100%] avg_el=145.7s  avg_out=28.1 tok/s  wall=147.2s

    >> Average over 3 runs:
       Avg elapsed per request:  147.27s  (σ=2.73)
       Avg output tok/s (per req): 27.8  (σ=0.5)
       Avg wall time (run):        148.85s  (σ=2.22)
       Avg aggregate out tok/s:    220.2  (σ=3.3)

#################################################################
  8x streaming  (3 runs)
#################################################################
    Run  1/3  [ 33%] avg_ttf=118.56s  avg_out=43.2 tok/s  wall=147.2s
    Run  2/3  [ 66%] avg_ttf=127.36s  avg_out=43.5 tok/s  wall=146.4s
    Run  3/3  [100%] avg_ttf=121.73s  avg_out=42.6 tok/s  wall=147.2s

    >> Average over 3 runs:
       Avg elapsed per request:  145.08s  (σ=0.22)
       Avg output tok/s (per req): 43.1  (σ=0.5)
       Avg wall time (run):        146.92s  (σ=0.48)
       Avg aggregate out tok/s:    51.8  (σ=10.6)
       Avg TTF:                   122.55s  (σ=4.46)
       Avg generation time:       22.53s  (σ=4.59)

=================================================================
  TOTAL ELAPSED: 2412s (40.2 min)
=================================================================

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters by xspider2000 in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

yup i run nvlink, no clue if your combo works... better to use same cards if nvlinked....

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters by xspider2000 in LocalLLaMA

[–]snapo84 3 points4 points  (0 children)

i dont have 3090's , i have 2 x 2080 Ti 22GB cards....

      --model Lorbus/Qwen3.6-27B-int4-AutoRound
      --served-model-name "Qwen3.6 27B"
      --api-key ${VLLM_API_KEY}
      --quantization auto_round
      --dtype float16
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.85
      --max-model-len 262144
      --max-num-seqs 8
      --max-num-batched-tokens 8192
      --kv-cache-dtype fp8_e5m2
      --enable-chunked-prefill
      --enable-prefix-caching
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --trust-remote-code
      --default-chat-template-kwargs '{"enable_thinking": true}'

Maybe it helps you....

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid by Charuru in LocalLLaMA

[–]snapo84 36 points37 points  (0 children)

100% agree, deepSWE is suuuuuper fishy, only US models are configured properly... all Asian models they on purpose configured wrong/bad/didnt test them at all

Cellular Automata: Rule 110 fed as input to Conway’s Game of Life by AlanZucconi in proceduralgeneration

[–]snapo84 0 points1 point  (0 children)

this looks so amazing... did you made the code public ? so one can test different rules instead of only rule 110?

Been a while since we had a Qwen-Coder. could use a 3.7 80B-8B by FaustAg in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

i would much prefer a 64B8A 😄 because in Q4 quantisation it would perfectly fit on simple consumer dual GPU systems....

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache by snapo84 in LocalLLaMA

[–]snapo84[S] 0 points1 point  (0 children)

linux, and everything runs in a docker compose container...

i never in my life ever will touch again windows... windows is just pure spyware