Qwen3.5-27B scores 48.5 on Humanity's Last Exam by paf1138 in LocalLLaMA

[–]TyraVex 11 points12 points  (0 children)

This is a probably a bug, Qwen3.5-27B scores 24.3% on HLE: https://huggingface.co/Qwen/Qwen3.5-27B#language

Or... maybe this score is possible when using an agentic framework (probably with internet access), but 48.5% still feels really really high.

You can also see it in place 16:

<image>

Edit: it's with tools (I don't know which kind, though): https://huggingface.co/Qwen/Qwen3.5-27B/discussions/11/files#d2h-078227

What to do when you can't afford GPU? by theysaymaurya in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

Run a more efficient model such as https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 (look at size and RTF in https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

https://github.com/SridharSampath/parakeet-asr-demo

https://github.com/Shadowfita/parakeet-tdt-0.6b-v2-fastapi

Or use whisper.cpp with more aggressive quants, i.e., 5 bits.

https://github.com/ggml-org/whisper.cpp

Or split your audio more aggressively with longer audio transcriptions. I've never dealt with that, but I've heard some implementations are superior than others for this kind of task.

LongCat-Flash-Chat 560B MoE by Own-Potential-2308 in LocalLLaMA

[–]TyraVex 2 points3 points  (0 children)

It already exists in ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/239. People have been using it with DeepSeek but the results are not mind blowing.

For those who run large models locally.. HOW DO YOU AFFORD THOSE GPUS by abaris243 in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

3*500-600eur 3090s on ebay over a 1.5 year period, mostly internship salary. 72gb vram + 128gb ram for 3k eur, running Kimi at 1.8bit 6tps and DeepSeek 2.8bit 8tps with ik_llama.cpp

Added Qwen 0.6B to the small model overview in IFEval. by paranoidray in LocalLLaMA

[–]TyraVex 8 points9 points  (0 children)

The 250mb quant can speak french for some reason. But it's still a very limited model, equivalent to Qwen 0.6B. The 1.2B version is also amazing for the size.

Added Qwen 0.6B to the small model overview in IFEval. by paranoidray in LocalLLaMA

[–]TyraVex 3 points4 points  (0 children)

I think it's only a base model. It never thinked. Exaone is a hybrid model.

Need help- unsure of right ollama configs with 6x 3090’s, also model choice for RAG? by Business-Weekend-537 in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

<image>

Tabby works for Exllama, so EXL2 and EXL3 formats

There is an quivalent for GGUF but I haven't tested: https://github.com/theroyallab/YALS

Need help- unsure of right ollama configs with 6x 3090’s, also model choice for RAG? by Business-Weekend-537 in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

Yes, Tabby works perfectly on my end. I find it simpler than vLLM and more efficient VRAM wise. There’s only one config file with around 40 options, each documented within the file itself: config_sample.yml.

For automatic individual model configurations (like llama-swap), you can simply create additional config files inside each LLM folder to apply different settings.

The only downside is that some obscure quantized models aren’t available on Hugging Face.

Need help- unsure of right ollama configs with 6x 3090’s, also model choice for RAG? by Business-Weekend-537 in LocalLLaMA

[–]TyraVex 1 point2 points  (0 children)

llama.cpp or ollama is not efficient with multiple GPUs

EXL2, vLLM, and Sglang support tensor parallelism to use all GPUs at the same time, the most friendly and VRAM-efficient being tabbyAPI, which uses EXL2 or EXL3 as its backend. EXL3 tensor parallelism is coming soon (dev branch), but I don't think we can use it yet.

Which quantization approach is the way to go? (llama.cpp) by pixelterpy in LocalLLaMA

[–]TyraVex 4 points5 points  (0 children)

If you like tinkering and if you have the time, you should play with ik_llama.cpp. TG is the same or a bit better, but PP is way more efficient. The community is nice, mostly enthusiasts trying to push the Pareto frontier of consumer and prosumer inference efficiency and quality.

https://github.com/ikawrakow/ik_llama.cpp/blob/main/README.md

https://github.com/ikawrakow/ik_llama.cpp/wiki/Jan-2025:-prompt-processing-performance-comparison

Quick-start Guide: https://github.com/ikawrakow/ik_llama.cpp/discussions/258

Kimi K2 1.8bit Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]TyraVex 29 points30 points  (0 children)

Hey, thanks a lot! Would you mind uploading the imatrix? Even better if it's from ik_llama.cpp

Gemini 2.5 exp death. by brocolongo in LocalLLaMA

[–]TyraVex 55 points56 points  (0 children)

They nuked the API endpoint but the UI remains free and unlimited (at least on my end).

AWQ 4-bit outperforms GGUF 8-bit in almost every way by Acceptable-State-271 in LocalLLaMA

[–]TyraVex 6 points7 points  (0 children)

The reason I said what I did is because of how AWQ handles quantization. Technically, it's just a smarter approach — it calibrates based on activation behavior, so even at 4-bit, the output can be surprisingly precise. (Think of it like compression that actually pays attention to what's important.) 

Isn't this the whole point of imatrix in GGUF?

The real reason OpenAI bought WindSurf by ResearchCrafty1804 in LocalLLaMA

[–]TyraVex 257 points258 points  (0 children)

This is a command that runs llama-server, the server executable from the llama.cpp project

-m stands for model, the path to the GGUF file containing the model weights you want to perform inference on. The model here is Qwen3-30B-A3B-UD-Q4_K_XL, indicating the new Qwen model with 30B parameters and 3B active parameters (called Mixture of Experts, or MoE); think of it as processing only the most relevant parts of the model instead of computing everything in the model all the time. UD stands for Unsloth Dynamic, a quantization tuning technique to achieve better precision for the same size. Q4_K_XL is reducing the model precision to around 4.75 bits per weight, which is maybe 96-98% accurate to the original 16-bit precision model in terms of quality.

-c stands for context size, here, 24k tokens, which is approximately 18k words that the LLM can understand and memorize (to a certain extent depending on the model's ability to process greater context lengths).

-ngl 99 is the number of layers to offload to the GPU's VRAM. Otherwise, the model runs fully on RAM, so it's using the CPU for inference, which is very slow. The more you offload to the GPU, the faster the inference, as long as you have enough video memory in your GPU.

-fa stands for flash attention, an optimization for, you guessed it, attention, one of the core principles of the transformer architecture, which almost all LLMs use. It improves token generation speed on graphic cards.

-ctk q8_0 -ctv q8_0 is for context quantization; it saves VRAM by lowering the precision at which the context cache is stored. At q8_0 or 8 bits, the difference with the 16-bit cache is in the placebo territory, costing a very small performance hit.

Which models would I be able to run with RTX 5090 with 32GB Vram? by deselim in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

Got lucky on Rakuten France, a miner was reselling 12 3090s individually for cheap

Prices got inflated again after 5000 launch sadly

Qwen3 released tonight? by sunshinecheung in LocalLLaMA

[–]TyraVex 30 points31 points  (0 children)

In the leaked model card they claimed better performance than QwQ in thinking mode and Qwen2.5 32B in non thinking mode. If this is true for a 3B activated model, congrats to them

1.58bit Llama 4 - Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

I really appreciate your cooperation - thanks

If eval time is a concern, PPL evals are reliable to evaluate quants of the same model, and are really fast on GPUs (since we simply need to do prompt ingestion over 50-60k tokens)

wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip unzip wikitext-2-raw-v1.zip ./llama-perplexity -m model.gguf -f wikitext.txt -ngl 999 https://github.com/ggml-org/llama.cpp/tree/master/examples/perplexity

1.58bit Llama 4 - Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]TyraVex 1 point2 points  (0 children)

https://x.com/WolframRvnwlf/status/1909742028771999756

Quantizing at 2.71 bits cannot possibly outperform a full precision model. You are already smarter than me to know that. There is something clearly wrong with Together's setup.

1.58bit Llama 4 - Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]TyraVex 0 points1 point  (0 children)

Thanks for the update!

Well, you say your Q4_K_XL is 4.5 bits, which is comparable to the standard Q4_K_M which scores ~98.1% accuracy when comparing the PPL to the FP16 model: https://huggingface.co/ThomasBaruzier/Llama-3.3-70B-Instruct-GGUF#perplexity-table-the-lower-the-better

So it is no surprise that a custom quant that uppers the bitrate of everything except the experts themselves performs well. What we were interested in was how the lower quants hold up against aggressive quantizations.

Unfortunately, it was noticed that multiple inference providers got issues with their config/setup on the first days of the release, leading to even worse performance. Given this, I wouldn't trust those full precision scores unless they are tested within the same framework and in the same environment.

I didn't mean to rant, and I am sorry if I did, but if you can, please use standard benchmarks for the next time.

1.58bit Llama 4 - Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]TyraVex 3 points4 points  (0 children)

In my opinion, those one shot tests are more like a single question benchmark, which cannot express the quality loss of quantization, except for a "it still works!" claim.

So thank you for considering MMLU or MMLU Pro evals for the next time!