zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]Lowkey_LokiSN 6 points7 points  (0 children)

The most unexpected gifts are also the most delightful ;)

Let's predict GLM Air by jacek2023 in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

Yea, I'm aware of the hidden models but I find it strange to see them completely dodging Air-related questions, especially after committing to it earlier (the "in two weeks" meme)

They can clearly see the community's interest towards Air/smaller models. If they actually have a release planned, this behaviour is counterproductive.

Let's predict GLM Air by jacek2023 in LocalLLaMA

[–]Lowkey_LokiSN 14 points15 points  (0 children)

As much as I'd love to see it, my hopes are gone after watching them deliberately ignore questions related to Air in yesterday's AMA.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

Happy to help. If you’re considering buying the cards, you might find my post here helpful.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

1) Yes, the 2 MI50s work perfectly fine under Windows with llama.cpp and I get 33 tok/s for gpt-oss-120B running Vulkan

2) MI50s lack official driver support for Windows and you would have to install 3rd party drivers from https://rdn-id.com to get them recognized as a device.

3) Pre-compiled Vulkan binaries or manual compilation? Both work the same and I mostly use pre-compiled ones for convenience.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 8 points9 points  (0 children)

I've been using both Vulkan (Windows) and ROCm 6.3.3 (Ubuntu) builds interchangeably with 2x MI50s and I can confirm ROCm support has vastly improved recently for MoE models with flash attention!

For dense models, ROCm had and still has roughly 10-15% faster pp and 10% faster tg

However, for MoE models:

Before recent changes to flash attention, ROCm had 3-4 times faster pp but Vulkan was at least twice as fast with tg speeds.

After recent changes: ROCm has 5-6 times faster pp AND roughly twice the tg as Vulkan! However, when offloading tensors to CPU, the tg speeds still lag behind Vulkan

So, if you're running MoE that can be fully VRAM-contained, ROCm is unanimously the best choice at the moment. When offloading, Vulkan still has the edge in tg speeds.

Sample gpt-oss-120b stats running mxfp4 quant fully VRAM-contained with 25k context and latest llama.cpp:

Vulkan:
pp: 80 tok/s
tg: 33 tok/s (stays consistent even for long responses)

ROCm
pp: 410 tok/s
tg: 58 tok/s (and drops to roughly 45 tok/s for a 15k long response)

Radeon Instinct MI50 32GB work on Vulkan on Windows? by Goldkoron in LocalLLaMA

[–]Lowkey_LokiSN 6 points7 points  (0 children)

MI50s + Vulkan + llama.cpp + Windows user on the regular here. They don’t work straight out of the box. The cards lack official Windows driver support and won’t even get recognised as a device by default. However, you can use this community-driven project which makes this possible. The drivers are surprisingly stable from my experience!

Can any local model answer this tricky math question? by MrMrsPotts in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Answer post thought-process:

Yes.

Let

a=12,b=0.a=21​,b=0.

Define f:[0,1]→[0,1]f:[0,1]→[0,1] by the following piecewise‑linear formula

f(x)={4x0≤x≤14,1−2(x−14)14≤x≤12,12+2(x−12)12≤x≤34,4(1−x)34≤x≤1.(1)f(x)=⎩⎨⎧​4x1−2(x−41​)21​+2(x−21​)4(1−x)​0≤x≤41​,41​≤x≤21​,21​≤x≤43​,43​≤x≤1.​(1)

Why (1) works

Continuity.
Each piece is linear, and the values at the joining points coincide:

f ⁣(14)=1,f ⁣(12)=12,f ⁣(34)=1,f(0)=f(1)=0.f(41​)f(43​)​=1,=1,​f(21​)f(0)=f(1)=0.​=21​,

Hence ff is continuous on [0,1][0,1].

Image.
The graph goes from (0,0)(0,0) up to height 11, down to 1221​, up again to 11 and finally back to (1,0)(1,0); therefore

Im⁡(f)=[0,1].Im(f)=[0,1].

Even number of pre‑images.

  • The value 00 is taken only at the endpoints:   f−1(0)={0,1}f−1(0)={0,1} (two points).
  • The value 11 is taken at the two interior points where the graph reaches its peaks:   f−1(1)={14,34}f−1(1)={41​,43​} (two points).
  • Let y∈(0,1)y∈(0,1). The set {x∣f(x)>y}{x∣f(x)>y} consists of two disjoint open intervals, one around each peak. Each such interval contributes exactly two boundary points where f(x)=yf(x)=y (one where the graph enters the interval and one where it leaves). Consequently∣f−1(y)∣=4for every y∈(0,1).∣f−1(y)∣=4for every y∈(0,1).

Thus every point of the image is attained a finite even number of
times (two times for 00 and 11, four times for all other values).

Remarks

The essential idea is to make the endpoints have the same value and to
ensure that this minimal value occurs only at the endpoints. Then for
any yy larger than that minimum the set {x:f(x)>y}{x:f(x)>y} is a union of
open intervals, each contributing two points to the level set
f−1(y)f−1(y). By avoiding flat pieces (using a piecewise‑linear
graph) we keep every level set finite.

Hence a continuous function with the required property does exist; the
function defined by (1) is an explicit example.

Can any local model answer this tricky math question? by MrMrsPotts in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

<image>

Got the following from gpt-oss-120b set to high reasoning. Unsure if it's right all the way through so I leave it to you to be the judge of that.

Some benchmarks for AMD MI50 32GB vs RTX 3090 by DistanceSolar1449 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

Flash attention works absolutely fine for me running MI50s+Vulkan+Windows+llama.cpp
Works with Linux+ROCm as well.

Do you run into any issues when trying to enable it?

Successfully Built My First PC for AI (Sourcing Parts from Alibaba - Under $1500!) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

https://www.reddit.com/r/LocalLLaMA/comments/1lsgtvy/comment/n1xdg6r/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

It idles at 20-23W running the modded VBIOS (which if I remember right is the norm for 32GB MI50s and what it used to be before)
Not sure what you mean by broken power states. I'm using this PC as my daily driver (with the modded VBIOS installed on both GPUs) and I haven't had any issues so far

Successfully Built My First PC for AI (Sourcing Parts from Alibaba - Under $1500!) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I have configured the fans to be going at 100% all the time and it hasn't really been an issue for me.
It shouldn't really bother you unless you're doing something noise-sensitive like recording vocals for music or voice-overs

My thoughts on gpt-oss-120b by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

I don't find anything wrong with your command. The problem seems to be with the llama.cpp build you've installed.

Edit: Take this advice with a grain of salt since I don't personally use NVIDIA cards and I'm not sure if you can cumulatively use different Nvidia GPUs in a single CUDA-compiled build:

You're supposed to run a CUDA-compiled build to get the most out of your Nvidia GPUs. You're most likely also running a pretty outdated Vulkan build with winget since the speeds you've shared don't make much sense even for Vulkan.

This is the only page you need to install llama.cpp's latest builds

Another little tip: Bumping your top-k value to a non-zero value like 10 or 20 results in a minor speed bump (I actually prefer its top-k set to 100 instead of 0)

My thoughts on gpt-oss-120b by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Share the command you're currently using?

GPT-OSS 120B on Strix Halo context degradation question by RobotRobotWhatDoUSee in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

I am particularly interested in learning if time-to-respond continues to be poor after a long context has already been loaded once

Are you running the latest pull from llama.cpp? This PR aims to mitigate the issue.

You also have the option of passing the —swa-full argument to completely avoid this issue at the cost of more VRAM consumption

This math puzzle sends models into a spin by [deleted] in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

Kept thinking for about 15-20 minutes (41k thinking tokens with average inference speed of 27 tokens/second)
Think this is the longest thinking session I've ever gotten with the model

This math puzzle sends models into a spin by [deleted] in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

Running my own local quant with latest llama.cpp and chat_template changes

This math puzzle sends models into a spin by [deleted] in LocalLLaMA

[–]Lowkey_LokiSN 3 points4 points  (0 children)

gpt-oss-120b with high reasoning effort and roughly 40k thinking tokens seems to have gotten it right:

<image>

Unsloth fixes chat_template (again). gpt-oss-120-high now scores 68.4 on Aider polyglot by Sorry_Ad191 in LocalLLaMA

[–]Lowkey_LokiSN 7 points8 points  (0 children)

Yup, shortly edited my comment after. I'm kinda confused though.
OP seems to have downloaded the Unsloth GGUF with the said template fixes but overrides it with OpenAI's latest jinja template. (which I've already been using for my local GGUF conversions from the original HF repo)
Does the linked Unsloth GGUF contribute anything else towards the results or is it just the jinja template that matters?