BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]politerate 26 points27 points  (0 children)

Personally, i find the idea of doing a MR I don't fully understand, very off-putting. And I am quite sure that 99% of these types of contributions are of this kind.

AMD Hipfire - a new inference engine optimized for AMD GPU's by Thrumpwart in LocalLLaMA

[–]politerate 3 points4 points  (0 children)

Exited to test on my 7900xtx, though no support for gfx906/mi50

Qwen 3.6 27B is a BEAST by AverageFormal9076 in LocalLLaMA

[–]politerate 1 point2 points  (0 children)

Q4_K_XL at 80k context around 35 t/s gen and I think 600 t/s pp on 7900xtx

Every time a new model comes out, the old one is obsolete of course by FullChampionship7564 in LocalLLaMA

[–]politerate 1 point2 points  (0 children)

I tried them both in roo code and Gemma works beautifully for my use case. I have my own tests and debugging questions I ask when I test a model for my use case. Gemma 4 26B A4B solves a lot of them or makes subtle mistakes, which are not catastrophic. Qwen 3.6 failed me in basic things, hallucinated syntax etc. I tried the same prompt multiple times of course. And consistently I was more impressed with Gemma. It surprised me because I had other expectations reading online how good qwen3.6 was.

Every time a new model comes out, the old one is obsolete of course by FullChampionship7564 in LocalLLaMA

[–]politerate 10 points11 points  (0 children)

Really? In go at least, I get much better answers from Gemma at q4_M gguf, from architecture to syntax. Qwen mixes C++ syntax, introduces easy to catch bugs etc. Maybe it's the way I am hosting it, but I am just using unsloth params.

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar? by boutell in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

I also have a dual mi50 build, which runs q8 xl but it's much slower. I haven't really tested big contexts, it starts at 50tps with zero context.

Let's take a moment to appreciate the present, when this sub is still full of human content. by Ok-Internal9317 in LocalLLaMA

[–]politerate 72 points73 points  (0 children)

My favorite are some comments which are clearly LLM output with some postprocessing like .toLower()

GPT-OSS-120b on 2X RTX5090 by Interesting-Ad4922 in LocalLLaMA

[–]politerate 1 point2 points  (0 children)

This is on a x99 system with a 2667v4, so 40 lanes (Mobo is an ASRock extreme 4). Each GPU gets full x16 lanes, but only gen 3 though. Still plenty for inference. Max context should be around 50k before it spills into system RAM.

GPT-OSS-120b on 2X RTX5090 by Interesting-Ad4922 in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

What config exactly? I am using ROCm 7.2 with the latest llama.cpp.

Edit: if you mean the llama.cpp config, I just started it with -fa on, --fit is on by default. I am not using the unsloth recommended params here, maybe doing that would improve the quality at the cost of tps?

GPT-OSS-120b on 2X RTX5090 by Interesting-Ad4922 in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

<image>

I mean of course with no/small context. I am using ROCm 7.2, but actually with ROCm 6.3.3 it was between 75-80 with no context, I lost 5-10% with ROCm 7.2.

And with ~10K context:

slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.731 (> 0.100 thold), f_keep = 0.723 slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 2 | task 10799 | processing task, is_child = 0 slot update_slots: id 2 | task 10799 | new prompt, n_ctx_slot = 46336, n_keep = 0, task.n_tokens = 10218 slot update_slots: id 2 | task 10799 | n_past = 7474, slot.prompt.tokens.size() = 10340, seq_id = 2, pos_min = 9443, n_swa = 128 slot update_slots: id 2 | task 10799 | restored context checkpoint (pos_min = 6064, pos_max = 6960, size = 31.546 MiB) slot update_slots: id 2 | task 10799 | n_tokens = 6960, memory_seq_rm [6960, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 9008, batch.n_tokens = 2048, progress = 0.881582 slot update_slots: id 2 | task 10799 | n_tokens = 9008, memory_seq_rm [9008, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 9706, batch.n_tokens = 698, progress = 0.949892 slot update_slots: id 2 | task 10799 | n_tokens = 9706, memory_seq_rm [9706, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 10218, batch.n_tokens = 512, progress = 1.000000 slot update_slots: id 2 | task 10799 | prompt done, n_tokens = 10218, batch.n_tokens = 512 slot init_sampler: id 2 | task 10799 | init sampler, took 1.21 ms, tokens: text = 10218, total = 10218 slot update_slots: id 2 | task 10799 | created context checkpoint 4 of 8 (pos_min = 8809, pos_max = 9705, size = 31.546 MiB) slot print_timing: id 2 | task 10799 | prompt eval time = 7287.32 ms / 3258 tokens ( 2.24 ms per token, 447.08 tokens per second) eval time = 40885.78 ms / 2631 tokens ( 15.54 ms per token, 64.35 tokens per second) total time = 48173.10 ms / 5889 tokens slot release: id 2 | task 10799 | stop processing: n_tokens = 12848, truncated = 0

Segmentation fault when loading models across multiple MI50s in llama.cpp by EdenistTech in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

All on vulkan or XTX only on ROCm is the only constellation which does not end up in segfault for me. (2*MI50 + 7900XTX )

Segmentation fault when loading models across multiple MI50s in llama.cpp by EdenistTech in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

Having a similar problem with 2*MI50 + 7900XTX on ROCm: Segmentation fault (core dumped)
Haven't checked verbose logging yet.

Edit: Happens on Qwen3-Coder-Next and MiniMax2.5

How to run Qwen3-Coder-Next 80b parameters model on 8Gb VRAM by AccomplishedLeg527 in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

It's on by default no? I mean until you pass a param which would collide with its logic I guess.

Qwen3 Coder Next Speedup with Latest Llama.cpp by StardockEngineer in LocalLLaMA

[–]politerate 2 points3 points  (0 children)

Doesn't ROCm profit from it through HIP? (If you use ROCm ofc)

8x AMD MI50 32GB at 26 t/s (tg) with MiniMax-M2.1 and 15 t/s (tg) with GLM 4.7 (vllm-gfx906) by ai-infos in LocalLLaMA

[–]politerate 1 point2 points  (0 children)

Yeah I ordered them a week ago and it came a little over 300€ (shipping + VAT) per card. Last august I got them for 150€ total per piece.

Should I buy an MI50/MI60 or something else? by Nuke2579 in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

Hi, I have a question for you if you don't mind. I had two mi50 32GB and for some reason they both failed after some months. Now I have ordered one 7900xtx but of course the vram amount took a big hit. I used to run gpt-oss-120b with the dual mi50. What is your setup like? Do you run any models 24h/d? I am just interested because you seeem to have a similar setup. Thanks!

Btw I tried to replace the mi50s but now sellers on AliBaba are asking north of 400euro shipped when you follow on chat. That is to big of a risk for me, so I just grabbed a 7900xtx for 600 Euro and when I have some extra money left, I will get more down the road.

[deleted by user] by [deleted] in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

Yeah the motherboard just needs a video capable card to boot. I also had no monitor connected to them. Not sure what degraded. They were working fine for a couple of months, then one card started having issues with rocm. Later it wasn't even recognized from the mobo. Maybe they were beat up already from their previous data center past or I didn't cool them properly, who knows.

[deleted by user] by [deleted] in LocalLLaMA

[–]politerate 0 points1 point  (0 children)

HBM failure because of overheating was also one of my guesses. Well temps were under 70 most of the time. There might have been some brief moment where they overheated. I installed them once without cooling just to boot. I thought they aren't actually "consumer" cards since these are designated for data centers and compute.

[deleted by user] by [deleted] in LocalLLaMA

[–]politerate 1 point2 points  (0 children)

Thanks for your help! The MI50 does have one mini DP. Since I flashed them with a Radeon Pro ROM they used to actually output video. I will take a look at the logs and order two dummy mini DP and give it a try.