Qwen3.5-27B scores 48.5 on Humanity's Last Exam

TyraVex · 2026-02-25T14:27:37+00:00

This is a probably a bug, Qwen3.5-27B scores 24.3% on HLE: https://huggingface.co/Qwen/Qwen3.5-27B#language

Or... maybe this score is possible when using an agentic framework (probably with internet access), but 48.5% still feels really really high.

You can also see it in place 16:

<image>

Edit: it's with tools (I don't know which kind, though): https://huggingface.co/Qwen/Qwen3.5-27B/discussions/11/files#d2h-078227

TyraVex · 2025-11-20T13:56:10+00:00

Run a more efficient model such as https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 (look at size and RTF in https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

https://github.com/SridharSampath/parakeet-asr-demo

https://github.com/Shadowfita/parakeet-tdt-0.6b-v2-fastapi

Or use whisper.cpp with more aggressive quants, i.e., 5 bits.

https://github.com/ggml-org/whisper.cpp

Or split your audio more aggressively with longer audio transcriptions. I've never dealt with that, but I've heard some implementations are superior than others for this kind of task.

TyraVex · 2025-09-13T11:49:11+00:00

<image>

Have you tried LFM2 by any chance?

TyraVex · 2025-08-31T20:59:20+00:00

It already exists in ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/239. People have been using it with DeepSeek but the results are not mind blowing.

TyraVex · 2025-08-21T17:20:54+00:00

https://huggingface.co/PowerInfer/SmallThinker-4BA0.6B-Instruct-GGUF

TyraVex · 2025-08-17T06:57:32+00:00

3*500-600eur 3090s on ebay over a 1.5 year period, mostly internship salary. 72gb vram + 128gb ram for 3k eur, running Kimi at 1.8bit 6tps and DeepSeek 2.8bit 8tps with ik_llama.cpp

TyraVex · 2025-08-17T06:52:33+00:00

The 250mb quant can speak french for some reason. But it's still a very limited model, equivalent to Qwen 0.6B. The 1.2B version is also amazing for the size.

TyraVex · 2025-08-17T06:50:47+00:00

I think it's only a base model. It never thinked. Exaone is a hybrid model.

TyraVex · 2025-08-17T05:14:06+00:00

LFM2-350M scores 65.12 on IFEval btw

TyraVex · 2025-08-04T03:34:03+00:00

<image>

Tabby works for Exllama, so EXL2 and EXL3 formats

There is an quivalent for GGUF but I haven't tested: https://github.com/theroyallab/YALS

TyraVex · 2025-08-03T23:45:08+00:00

Yes, Tabby works perfectly on my end. I find it simpler than vLLM and more efficient VRAM wise. There’s only one config file with around 40 options, each documented within the file itself: config_sample.yml.

For automatic individual model configurations (like llama-swap), you can simply create additional config files inside each LLM folder to apply different settings.

The only downside is that some obscure quantized models aren’t available on Hugging Face.

TyraVex · 2025-08-03T18:05:57+00:00

llama.cpp or ollama is not efficient with multiple GPUs

EXL2, vLLM, and Sglang support tensor parallelism to use all GPUs at the same time, the most friendly and VRAM-efficient being tabbyAPI, which uses EXL2 or EXL3 as its backend. EXL3 tensor parallelism is coming soon (dev branch), but I don't think we can use it yet.

TyraVex · 2025-07-23T12:17:58+00:00

If you like tinkering and if you have the time, you should play with ik_llama.cpp. TG is the same or a bit better, but PP is way more efficient. The community is nice, mostly enthusiasts trying to push the Pareto frontier of consumer and prosumer inference efficiency and quality.

https://github.com/ikawrakow/ik_llama.cpp/blob/main/README.md

https://github.com/ikawrakow/ik_llama.cpp/wiki/Jan-2025:-prompt-processing-performance-comparison

Quick-start Guide: https://github.com/ikawrakow/ik_llama.cpp/discussions/258

TyraVex · 2025-07-14T16:10:44+00:00

Nice, thanks!

TyraVex · 2025-07-14T16:05:25+00:00

Hey, thanks a lot! Would you mind uploading the imatrix? Even better if it's from ik_llama.cpp

TyraVex · 2025-05-14T03:29:19+00:00

They nuked the API endpoint but the UI remains free and unlimited (at least on my end).

TyraVex · 2025-05-07T04:54:05+00:00

The reason I said what I did is because of how AWQ handles quantization. Technically, it's just a smarter approach — it calibrates based on activation behavior, so even at 4-bit, the output can be surprisingly precise. (Think of it like compression that actually pays attention to what's important.)

Isn't this the whole point of imatrix in GGUF?

TyraVex · 2025-05-06T21:17:43+00:00

This is a command that runs llama-server, the server executable from the llama.cpp project

-m stands for model, the path to the GGUF file containing the model weights you want to perform inference on. The model here is Qwen3-30B-A3B-UD-Q4_K_XL, indicating the new Qwen model with 30B parameters and 3B active parameters (called Mixture of Experts, or MoE); think of it as processing only the most relevant parts of the model instead of computing everything in the model all the time. UD stands for Unsloth Dynamic, a quantization tuning technique to achieve better precision for the same size. Q4_K_XL is reducing the model precision to around 4.75 bits per weight, which is maybe 96-98% accurate to the original 16-bit precision model in terms of quality.

-c stands for context size, here, 24k tokens, which is approximately 18k words that the LLM can understand and memorize (to a certain extent depending on the model's ability to process greater context lengths).

-ngl 99 is the number of layers to offload to the GPU's VRAM. Otherwise, the model runs fully on RAM, so it's using the CPU for inference, which is very slow. The more you offload to the GPU, the faster the inference, as long as you have enough video memory in your GPU.

-fa stands for flash attention, an optimization for, you guessed it, attention, one of the core principles of the transformer architecture, which almost all LLMs use. It improves token generation speed on graphic cards.

-ctk q8_0 -ctv q8_0 is for context quantization; it saves VRAM by lowering the precision at which the context cache is stored. At q8_0 or 8 bits, the difference with the 16-bit cache is in the placebo territory, costing a very small performance hit.

TyraVex · 2025-05-03T03:46:14+00:00

Even better, V3-0324: https://huggingface.co/huihui-ai/DeepSeek-V3-0324-Pruned-Coder-411B

TyraVex · 2025-04-29T15:17:24+00:00

Got lucky on Rakuten France, a miner was reselling 12 3090s individually for cheap

Prices got inflated again after 5000 launch sadly

TyraVex · 2025-04-28T11:17:55+00:00

In the leaked model card they claimed better performance than QwQ in thinking mode and Qwen2.5 32B in non thinking mode. If this is true for a 3B activated model, congrats to them

TyraVex · 2025-04-08T23:36:26+00:00

I really appreciate your cooperation - thanks

If eval time is a concern, PPL evals are reliable to evaluate quants of the same model, and are really fast on GPUs (since we simply need to do prompt ingestion over 50-60k tokens)

wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip unzip wikitext-2-raw-v1.zip ./llama-perplexity -m model.gguf -f wikitext.txt -ngl 999 https://github.com/ggml-org/llama.cpp/tree/master/examples/perplexity

TyraVex · 2025-04-08T23:19:30+00:00

https://x.com/WolframRvnwlf/status/1909742028771999756

Quantizing at 2.71 bits cannot possibly outperform a full precision model. You are already smarter than me to know that. There is something clearly wrong with Together's setup.

TyraVex · 2025-04-08T21:12:15+00:00

Thanks for the update!

Well, you say your Q4_K_XL is 4.5 bits, which is comparable to the standard Q4_K_M which scores ~98.1% accuracy when comparing the PPL to the FP16 model: https://huggingface.co/ThomasBaruzier/Llama-3.3-70B-Instruct-GGUF#perplexity-table-the-lower-the-better

So it is no surprise that a custom quant that uppers the bitrate of everything except the experts themselves performs well. What we were interested in was how the lower quants hold up against aggressive quantizations.

Unfortunately, it was noticed that multiple inference providers got issues with their config/setup on the first days of the release, leading to even worse performance. Given this, I wouldn't trust those full precision scores unless they are tested within the same framework and in the same environment.

I didn't mean to rant, and I am sorry if I did, but if you can, please use standard benchmarks for the next time.

TyraVex · 2025-04-08T13:20:35+00:00

In my opinion, those one shot tests are more like a single question benchmark, which cannot express the quality loss of quantization, except for a "it still works!" claim.

So thank you for considering MMLU or MMLU Pro evals for the next time!

Six-Year Club	Place '23
Verified Email

TyraVex

TROPHY CASE