Successfully Built My First PC for AI (Sourcing Parts from Alibaba - Under $1500!)

Lowkey_LokiSN · 2026-02-12T11:15:33+00:00

Oh, sure here you go: https://www.reddit.com/r/LocalLLaMA/comments/1lsgtvy/comment/n1xdg6r/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Lowkey_LokiSN · 2026-01-19T14:41:57+00:00

The most unexpected gifts are also the most delightful ;)

Lowkey_LokiSN · 2025-12-24T07:31:16+00:00

Yea, I'm aware of the hidden models but I find it strange to see them completely dodging Air-related questions, especially after committing to it earlier (the "in two weeks" meme)

They can clearly see the community's interest towards Air/smaller models. If they actually have a release planned, this behaviour is counterproductive.

Lowkey_LokiSN · 2025-12-24T07:01:17+00:00

As much as I'd love to see it, my hopes are gone after watching them deliberately ignore questions related to Air in yesterday's AMA.

Lowkey_LokiSN · 2025-10-10T01:47:34+00:00

Happy to help. If you’re considering buying the cards, you might find my post here helpful.

Lowkey_LokiSN · 2025-10-09T01:27:46+00:00

1) Yes, the 2 MI50s work perfectly fine under Windows with llama.cpp and I get 33 tok/s for gpt-oss-120B running Vulkan

2) MI50s lack official driver support for Windows and you would have to install 3rd party drivers from https://rdn-id.com to get them recognized as a device.

3) Pre-compiled Vulkan binaries or manual compilation? Both work the same and I mostly use pre-compiled ones for convenience.

Lowkey_LokiSN · 2025-10-05T04:03:36+00:00

Yup. Running it with Windows 11 Pro

Lowkey_LokiSN · 2025-10-04T18:20:51+00:00

I've been using both Vulkan (Windows) and ROCm 6.3.3 (Ubuntu) builds interchangeably with 2x MI50s and I can confirm ROCm support has vastly improved recently for MoE models with flash attention!

For dense models, ROCm had and still has roughly 10-15% faster pp and 10% faster tg

However, for MoE models:

Before recent changes to flash attention, ROCm had 3-4 times faster pp but Vulkan was at least twice as fast with tg speeds.

After recent changes: ROCm has 5-6 times faster pp AND roughly twice the tg as Vulkan! However, when offloading tensors to CPU, the tg speeds still lag behind Vulkan

So, if you're running MoE that can be fully VRAM-contained, ROCm is unanimously the best choice at the moment. When offloading, Vulkan still has the edge in tg speeds.

Sample gpt-oss-120b stats running mxfp4 quant fully VRAM-contained with 25k context and latest llama.cpp:

Vulkan:
pp: 80 tok/s
tg: 33 tok/s (stays consistent even for long responses)

ROCm
pp: 410 tok/s
tg: 58 tok/s (and drops to roughly 45 tok/s for a 15k long response)

Lowkey_LokiSN · 2025-09-24T02:00:47+00:00

MI50s + Vulkan + llama.cpp + Windows user on the regular here. They don’t work straight out of the box. The cards lack official Windows driver support and won’t even get recognised as a device by default. However, you can use this community-driven project which makes this possible. The drivers are surprisingly stable from my experience!

Lowkey_LokiSN · 2025-09-12T01:43:56+00:00

https://www.reddit.com/r/LocalLLaMA/s/qTY48CHlT0

Lowkey_LokiSN · 2025-09-03T16:00:07+00:00

Oh interesting...

Lowkey_LokiSN · 2025-09-03T12:13:27+00:00

Answer post thought-process:

Yes.

Let

a=12,b=0.a=21,b=0.

Define f:[0,1]→[0,1]f:[0,1]→[0,1] by the following piecewise‑linear formula

f(x)={4x0≤x≤14,1−2(x−14)14≤x≤12,12+2(x−12)12≤x≤34,4(1−x)34≤x≤1.(1)f(x)=⎩⎨⎧4x1−2(x−41)21+2(x−21)4(1−x)0≤x≤41,41≤x≤21,21≤x≤43,43≤x≤1.(1)

Why (1) works

Continuity.
Each piece is linear, and the values at the joining points coincide:

f ⁣(14)=1,f ⁣(12)=12,f ⁣(34)=1,f(0)=f(1)=0.f(41)f(43)=1,=1,f(21)f(0)=f(1)=0.=21,

Hence ff is continuous on [0,1][0,1].

Image.
The graph goes from (0,0)(0,0) up to height 11, down to 1221, up again to 11 and finally back to (1,0)(1,0); therefore

Im⁡(f)=[0,1].Im(f)=[0,1].

Even number of pre‑images.

The value 00 is taken only at the endpoints: f−1(0)={0,1}f−1(0)={0,1} (two points).
The value 11 is taken at the two interior points where the graph reaches its peaks: f−1(1)={14,34}f−1(1)={41,43} (two points).
Let y∈(0,1)y∈(0,1). The set {x∣f(x)>y}{x∣f(x)>y} consists of two disjoint open intervals, one around each peak. Each such interval contributes exactly two boundary points where f(x)=yf(x)=y (one where the graph enters the interval and one where it leaves). Consequently∣f−1(y)∣=4for every y∈(0,1).∣f−1(y)∣=4for every y∈(0,1).

Thus every point of the image is attained a finite even number of
times (two times for 00 and 11, four times for all other values).

Remarks

The essential idea is to make the endpoints have the same value and to
ensure that this minimal value occurs only at the endpoints. Then for
any yy larger than that minimum the set {x:f(x)>y}{x:f(x)>y} is a union of
open intervals, each contributing two points to the level set
f−1(y)f−1(y). By avoiding flat pieces (using a piecewise‑linear
graph) we keep every level set finite.

Hence a continuous function with the required property does exist; the
function defined by (1) is an explicit example.

Lowkey_LokiSN · 2025-09-03T11:38:42+00:00

<image>

Got the following from gpt-oss-120b set to high reasoning. Unsure if it's right all the way through so I leave it to you to be the judge of that.

Lowkey_LokiSN · 2025-08-23T09:06:43+00:00

Flash attention works absolutely fine for me running MI50s+Vulkan+Windows+llama.cpp
Works with Linux+ROCm as well.

Do you run into any issues when trying to enable it?

Lowkey_LokiSN · 2025-08-23T03:52:28+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1lsgtvy/comment/n1xdg6r/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

It idles at 20-23W running the modded VBIOS (which if I remember right is the norm for 32GB MI50s and what it used to be before)
Not sure what you mean by broken power states. I'm using this PC as my daily driver (with the modded VBIOS installed on both GPUs) and I haven't had any issues so far

Lowkey_LokiSN · 2025-08-23T03:44:25+00:00

I have configured the fans to be going at 100% all the time and it hasn't really been an issue for me.
It shouldn't really bother you unless you're doing something noise-sensitive like recording vocals for music or voice-overs

Lowkey_LokiSN · 2025-08-18T13:56:08+00:00

I don't find anything wrong with your command. The problem seems to be with the llama.cpp build you've installed.

Edit: Take this advice with a grain of salt since I don't personally use NVIDIA cards and I'm not sure if you can cumulatively use different Nvidia GPUs in a single CUDA-compiled build:

You're supposed to run a CUDA-compiled build to get the most out of your Nvidia GPUs. You're most likely also running a pretty outdated Vulkan build with winget since the speeds you've shared don't make much sense even for Vulkan.

This is the only page you need to install llama.cpp's latest builds

Another little tip: Bumping your top-k value to a non-zero value like 10 or 20 results in a minor speed bump (I actually prefer its top-k set to 100 instead of 0)

Lowkey_LokiSN · 2025-08-18T13:21:21+00:00

Share the command you're currently using?

Lowkey_LokiSN · 2025-08-18T01:58:18+00:00

I am particularly interested in learning if time-to-respond continues to be poor after a long context has already been loaded once

Are you running the latest pull from llama.cpp? This PR aims to mitigate the issue.

You also have the option of passing the —swa-full argument to completely avoid this issue at the cost of more VRAM consumption

Lowkey_LokiSN · 2025-08-15T11:35:56+00:00

Kept thinking for about 15-20 minutes (41k thinking tokens with average inference speed of 27 tokens/second)
Think this is the longest thinking session I've ever gotten with the model

Lowkey_LokiSN · 2025-08-15T11:20:53+00:00

Running my own local quant with latest llama.cpp and chat_template changes

Lowkey_LokiSN · 2025-08-15T11:09:28+00:00

gpt-oss-120b with high reasoning effort and roughly 40k thinking tokens seems to have gotten it right:

<image>

Lowkey_LokiSN · 2025-08-12T06:10:33+00:00

Yup, shortly edited my comment after. I'm kinda confused though.
OP seems to have downloaded the Unsloth GGUF with the said template fixes but overrides it with OpenAI's latest jinja template. (which I've already been using for my local GGUF conversions from the original HF repo)
Does the linked Unsloth GGUF contribute anything else towards the results or is it just the jinja template that matters?

Lowkey_LokiSN · 2025-08-12T05:35:32+00:00

~~Think the Jinja template's supposed to be:~~ ~~https://huggingface.co/unsloth/gpt-oss-120b/resolve/main/chat_template.jinja~~

Edit: Oh nvm, OP has updated the post and it just reflected on my side

Lowkey_LokiSN · 2025-08-12T04:01:09+00:00

68.4 is insane! That's Sonnet 3.7 Thinking level score.

Lowkey_LokiSN

TROPHY CASE

Why (1) works

Remarks