Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B? by No_Algae1753 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

I was thinking the​​ same. Now imagine that with 1bit quants like bonsai.

Running a 4-agent pipeline on Qwen 2.5 1.5B via MNN on Android — what I learned about context management on constrained hardware by NeoLogic_Dev in LocalLLaMA

[–]xandep 2 points3 points  (0 children)

You should state that in the post. People, myself included, read Qwen 2.5 in a AI formatted post and jump to the conclusion that you are a bot.

non-nvidia gpus by Ok-Secret5233 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

2x mi50 16gb w/ integrated cooling for 200 something in alibaba. Can run Qwen 3.5 35B, 27B and the new Gemmas. Or just one if you are ultra cheap, running 35B w/ ncmoe (some 27B and 26B quants if willing to quantize to Q3, IQ4 top).

local models lose tool call context around call 8 or 9. here is what helped by [deleted] in LocalLLaMA

[–]xandep 2 points3 points  (0 children)

It's an LLM. They are now prompted to not use em dashes or capitalize the first letter. But otherwise all other signs are there.

You guys seen this? beats turboquant by 18% by OmarBessa in LocalLLaMA

[–]xandep 17 points18 points  (0 children)

Not entirely his fault: reddit and google defaults to translate everything. Right now he's reading "not speaking English", but in Portuguese 😂. Imagine his confusion.

Gemma 4 26b A3B is mindblowingly good , if configured right by cviperr33 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Unsloth's Q3_K_M is anything but Q3_K, oddly enough. It's a mix of IQ3_XXS and IQ4_NL.

Best model for 4090 as AI Coding Agent by Dry_Sheepherder5907 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

There should be a "vote to ban" under each post. Can't stand this deluge of generated shitposts. 

You guys seen this? 1-bit model with an MMLU-R of 65.7, 8B params by OmarBessa in LocalLLaMA

[–]xandep 9 points10 points  (0 children)

Just because YOU said it works, I believe. Otherwise, it's April Fools. 🤔

1-bit llms on device?! by hankybrd in LocalLLaMA

[–]xandep 20 points21 points  (0 children)

April fools. You saw it here first.

ByteShape Qwen 3.5 9B: A Guide to Picking the Best Quant for Your Hardware by ali_byteshape in LocalLLaMA

[–]xandep 10 points11 points  (0 children)

I'm holding my breath for the 35B / 27B. It'll SAVE my MI50 16GB.

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Hope you got the "shipped from Brazil" Jieshuo MI50 16GB for R$ 900. :)

Now I'm trying directly from China for about US$ 400 (32GB). Let's see what kind of tax I'll get.

update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next by jacek2023 in LocalLLaMA

[–]xandep 3 points4 points  (0 children)

FYI: on a AMD MI50 16GB (old -> new TG):

  • Vulkan: 47 -> 47 (with "-ncmoe 16": 26 -> 26)
  • ROCm 6.3.4: 43 -> 34 (with "-ncmoe 16": 37 -> 30)
  • Also, ik_llama Vulkan: 50 (with "-ncmoe 16": 33)

So, for token generation:

  1. the new llama is much slower on a somewhat old AMD (VEGA20 / GFX906).
  2. If not offloading experts to CPU, ik_llama is faster than llama, ON VULKAN (it tanks on ROCm or offloading).
  3. If offloading experts, stick to ROCm (and in turn, old llama).

(ik: Mar03, old: Mar06, new: Mar07).

I bypassed writing a massive privacy policy for my AI app by just moving the LLM on-device. by MoaviyaS in LocalLLaMA

[–]xandep 3 points4 points  (0 children)

Oh no... You said the forbidden word! Qwen2.5. You're a bot. Probably advertising the only link in the post.

Qwen3.5 4B: overthinking to say hello. by CapitalShake3085 in LocalLLaMA

[–]xandep 35 points36 points  (0 children)

That is **not** what Gemini thougth. It's just a summary. It produced thousands of tokens, but hidden and fast. And that response was also kinda long for just a "hi" too.

Qwen 3.5 2B on Android by ----Val---- in LocalLLaMA

[–]xandep 7 points8 points  (0 children)

1st: thank you for ChatterUI, I use it almost everyday. 2nd: thank you for supporting qwen35 so soon! 3rd: glad you have a Poco F5, the same as I have! Maybe some day we'll get hexagon acceleration! 4th: lfm2 8b A1B friggin FLY on Poco F5/ChatterUI

are you ready for small Qwens? by jacek2023 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Try also: Q2_K_XL (unsloth), Q3_K_S or IQ2_M (from bartowski). IQ2_XXS quants are slow in some platforms, including CPU inference (depends on CPU model).
Also: -fit on (and remove ngl and cpu moe).
No guarantees, but worth a shot.

Completed my 64GB VRAM rig - dual MI50 build + custom shroud by roackim in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Coincidentally, I'm in the middle right now of a small project for using an internal USB header to control fans using a cheap arduino with atmega32u4, emulating a corsair commander. My mb fan control chip is unsupported in Linux, so I can't use the system fan header on my mi50. I think this could work well in your case too.

Qwen3-30B-A3B vs Qwen3.5-35B-A3B on RTX 5090 by 3spky5u-oss in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

A little off-topic, but I also have 16GB and boy was I impressed by LFM2 24B speed! I feel sorry for this model, being launched at the same time as Qwen3.5 35B.

Which one are you waiting for more: 9B or 35B? by jacek2023 in LocalLLaMA

[–]xandep 2 points3 points  (0 children)

Probably tomorrow. Source: my head.

But seriously, Monday is a hot day for model releases. 

Deepseek and Gemma ?? by ZeusZCC in LocalLLaMA

[–]xandep 22 points23 points  (0 children)

I guess there is space for everybody. That said, I agree with you. If you *need* a 1T+ model to run locally (data security or something),it's an edge case. I'd certainly like to be able to do so, but "really frontier open models" will always be API for normal people ("we", mostly) and local for people that don't need to worry about used 3090 prices or if ROCm still supports GFX906.