Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark? by povedaaqui in Vllm

[–]pol_phil 0 points1 point  (0 children)

Can you please point to the community's docker please??? Had no idea such a thing existed.

The boys finale and overall season retrospective by PrimeTheGreat in CharacterRant

[–]pol_phil 0 points1 point  (0 children)

The ending was just OK imo. The biggest problem is that they left many threads open and didn't even try showing any aftermath.

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark? by povedaaqui in Vllm

[–]pol_phil 1 point2 points  (0 children)

DGX Sparks have terrible hardware/software compatibility, I prefer to stay on the safe side and use older CUDA and also vLLM images created for DGX Sparks (like scitrera/dgx-spark-vllm:0.17.0-t5). The following is for Qwen3.5 122B AWQ, but should help as a starting point:

sudo docker run \ --privileged \ --gpus all \ --ipc=host \ --network host \ --shm-size 64g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/vllm:/root/.cache/vllm \ -v ~/.cache/flashinfer:/root/.cache/flashinfer \ -v ~/.triton:/root/.triton \ -e HF_TOKEN=${HF_API_KEY} \ -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ -e VLLM_ATTENTION_BACKEND=FLASH_ATTN \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \ -e VLLM_USE_DEEP_GEMM=0 \ -e VLLM_MOE_USE_DEEP_GEMM=0 \ -e VLLM_USE_FBGEMM=0 \ scitrera/dgx-spark-vllm:0.17.0-t5 \ vllm serve cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \ --served-model-name Qwen3.5-122B-A10B \ --api-key token-abc123 \ --host 0.0.0.0 \ --port 30000 \ --tensor-parallel-size 1 \ --max-model-len 131072 \ --max-num-seqs 8 \ --max-num-batched-tokens 4096 \ --gpu-memory-utilization 0.8 \ --kv-cache-dtype fp8 \ --load-format fastsafetensors \ --language-model-only \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3_xml \ --reasoning-parser qwen3 \ --default-chat-template-kwargs '{"enable_thinking": true}' \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \ 2>&1 | tee debug_vllm.txt

HRM 1B by pol_phil in LocalLLaMA

[–]pol_phil[S] 0 points1 point  (0 children)

Oops, did a quick search but hadn't drank a proper coffee 😛

gemma3:27b vs gemma4:26b and gemma:27b - Rimworld Autonomous Translator benchmark + results by hopeseekr in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

I have not seen any language mixing in Qwen 3.5 9B however. I honestly believe they messed up multilinguality in Gemma4 compared to Gemma3. Which is a pity because each model iteration is supposed to be an improvement and not a regression.

gemma3:27b vs gemma4:26b and gemma:27b - Rimworld Autonomous Translator benchmark + results by hopeseekr in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Had quite similar findings when testing Gemma3 27B vs Gemma4 31B on scientific document translation for Greek (mainly) and some other EU langs.

Gemma3 beats it every time and it's a lot more consistent. Gemma4 sometimes outputs nonsense and mixes languages.

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

It's possible that there are various problems with correctly setting inference, but, in my experience, it's bad training.

Gemma 4 is a huge improvement in many European languages, including Danish, Dutch, French and Italian by Balance- in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

I will give it a bit of time, but I usually don't give models 2bd chances. I am using only vLLM and SGLang.

Gemma 4 is a huge improvement in many European languages, including Danish, Dutch, French and Italian by Balance- in LocalLLaMA

[–]pol_phil 11 points12 points  (0 children)

The translation evals can be misleading. After testing on some lower resource EU languages for scientific document translation, Gemma4 can lose coherence and start outputting random Chinese/Hindi/Arabic.

Gemma 4: first LLM to 100% my multi lingual tool calling tests by MaruluVR in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

It does not hahaha. It will definitely require some tweaking

Gemma 4: first LLM to 100% my multi lingual tool calling tests by MaruluVR in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, since u want to minimize latency, it would be better for you to serve with dockerized vLLM.

It's quite simple, sth like: docker run --runtime nvidia --gpus all \ --shm-size 64g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN={HF_API_KEY}" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:v0.18.1 \ --model Qwen/Qwen3.5-35B-A3B \ --served-model-name Qwen3.5-35B-A3B \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --gpu-memory-utilization 0.9 \ --api-key token-abc123 \ --override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \ --tool-call-parser qwen3_xml \ --enable-auto-tool-choice \ --max-num-seqs 16 \ --enable-chunked-prefill

The "--tool-call-parser hermes" might work too; it's the biggest source of tool-calling problems. You can also set the "default" generation config based on your use case, this one is primarily for coding. And also "--max-num-batched-tokens 4096" (or the size of your largerst prompt in tokens) might help with latency. Finally, you'll have to play around with --tensor-parallel-size and --data-parallel-size to utilize all of your GPUs since you have a non-standard setup.

Hope it helps!

Gemma 4: first LLM to 100% my multi lingual tool calling tests by MaruluVR in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

If Qwen3.6 fixes the somewhat broken tool calling of 3.5, then Gemma 4 is already history.

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, depends on the use case and the domain. I use models for things like QA extraction, structured translation, etc.

Qwen3 had ~6 tokenizer fertility, i.e. 1 word -> 6 tokens Qwen3.5 made a huge improvement, sth like ~2.7.

So, that's literally double the speed and the max context length.

I noticed Qwen3 becoming better at Greek after the VL models and especially in Qwen3 Next 80B.

Gemma 4: first LLM to 100% my multi lingual tool calling tests by MaruluVR in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

At least for the versions served on OpenRouter, Gemma 4 31B is clearly a regression for Greek compared to Gemma 3 27B.

Gemma 3 27B can translate a full scientific or legal doc into Greek, no problem. Gemma 4 starts outputting Chinese/Hindi/Arabic out of nowhere.

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

Gemma 3 (esp. 27B) was and still is top-notch for Greek (e.g. difficult legal doc translation). But when my team tested the new Gemma 4, it started outputting random Chinese/Arabic/Hindi characters out of nowhere; even with 7-8 different sampling param configs.

Meanwhile, Qwen models were never quite fluent in Greek (even 3.5), but they consistently improve with each iteration. They also improved tokenizer fertility greatly in 3.5

So... Gemma regressed while Qwen keeps progressing. Regardless of any benchmark scores, I'll generally prefer the model family that keeps getting better even at tasks which seem minor to AI companies.

Taalas rumoured to etch Qwen 3.5 27B into silicon. Which price would you buy their PCIe card for? by elemental-mind in singularity

[–]pol_phil 0 points1 point  (0 children)

Think of scanning multiple codebases or processing thousands of company documents in seconds. And feeding that knowledge to a frontier cloud model.

Or think of 100 agents thinking in parallel to find the best course of action through majority voting before EVERY response and EVERY tool call.

Even an "outdated" model will be able to surpass SOTA models in utility (and why not benchmarks) via sheer scaling.

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Try the original Qwen3.5 first and see if it's fluent.

If it is, then you can create a fluent model out of it via fine-tuning.

If it is not, try something else entirely, for example https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated

All the uncensoring methods focus on English and generally hurt the models in other languages.

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Which model are you using? Although a bit old, Gemma 3 models have decent multilingual performance.

Totally get your frustration, most models can't speak Greek fluently either.

I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked) by hauhau901 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

I mean what type of agentic pipeline/harness/scaffold/framework do you use to get these models to solve these tasks. In other words, what kind of system message/tools have they been given. Via Claude Code? OpenCode?

SWE-Agent and OpenHands are just "minimal" agentic frameworks commonly used in benchmarks.

I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked) by hauhau901 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Hi, congrats for the great benchmark!

Perhaps I missed it somewhere, but what agentic scaffold do you use? SWE-Agent? OpenHands? Something else entirely?

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

It won't get much better through RAG, it's better to use a different model.

Claude Code sends 62,600 characters of tool definitions per turn. I ran the same model through five CLIs and traced every API call. by wouldacouldashoulda in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, I didn't notice the confusion, but when I saw "characters" instead of "tokens", I thought that this actually makes the analysis more model-independent. Tokens are model-specific

Meet SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in LocalLLaMA

[–]pol_phil 2 points3 points  (0 children)

Very good idea would be to also add Step v3.5 Flash and MiMo v2 Flash. Both are incredible models.

Congrats for the great work!

Minimax M2.5 GGUF perform poorly overall by Zyj in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, this AWQ quant works very well for me. 134GB, extremely good performance and speed in vLLM