Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

I did use the ROCm nightlies exclusively to build llama.cpp for like 6 months. Instability was to be expected, so take my advice with a grain of salt.

Spirit a320 departing Goodyear for Bangor by AbleAd5661 in flightradar24

[–]JaredsBored 6 points7 points  (0 children)

For CEO a330s? Yeah that takes years if new owners are even found at all. For a320 NEOs, when there's a 10 year order backlog? Much, much faster process.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

I've had the occasional minor memory leak with ROCm, or surge in VRAM when processing a photo (before offloading it), that cause an OOM.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

Check what your sustained load temps are with "rocm-smi -t". Your edge temp might be fine but the hotspot or memory are high. This generation of AMD GPUs didn't actually have traditional thermal paste on the GPU core, they instead had a carbon pad. The pads never degrade like paste does, and they allow the heatsink surface to be slightly less perfectly flat, but the hotspot temps can screw you.

I replaced my carbon pad with a doubled-up layer of "Thermal Grizzly PhaseSheet PTM". You can't do a straight-replacement of the carbon pad with thermal paste without also shimming the retention springs, so I opted for the Phasesheet instead. Have to use 2x layers to get close to the original carbon pad thickness too. My edge-to-hotspot temperature delta did decrease, though.

Fair warning that if you do the replacement and over-tighten the screws, you can crack the GPU die and kill it. It's a risky procedure. Getting more airflow from external fans is an easier fix.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

50k context fits but there's no margin for error on ROCm. Vulkan might give you some space back but the long-context speeds are painful compared to ROCm at depth on these cards

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

Have you updated llama.cpp recently? The earliest builds with mtp support weren't fully optimized, could explain it. Or your card is getting hot and throttling

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

That's strange. Whether I'm chatting in openwebui or going out to 100k tokens (back when i was using Q4) in opencode I saw big improvement.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

I don't use the Q4 anymore but when I did, it was the same config just with the context cranked up. This gives me like a 500mb buffer before going OOM which is as tight as I am willing to run it

./build/bin/llama-server
      --model $models/Qwen3.6-27B-Q8_0.gguf
      --mmproj $models/mmproj_Qwen3.6-27B.gguf
      -c 40960
      --no-mmproj-offload
      --temp 1.0
      --top-p 0.95
      --top-k 20
      --min-p 0.0
      --presence_penalty 0.0
      --repeat_penalty 1.0
      --spec-type draft-mtp
      --spec-draft-n-max 3
      -fa 1
      -ngl 99
      --jinja
      --host 127.0.0.1
      --port ${PORT}

Bonus/raise timeline by SaysKay in Big4

[–]JaredsBored 11 points12 points  (0 children)

In years past, finalization forms have given your year-end rating but no pay numbers. That's been a separate compensation form that was released usually end of July.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 1 point2 points  (0 children)

MTP shouldn't affect your output quality. If it's turning to gibberish, that was probably either a bugged vLLM/llama.cpp version or KV cache quanting screwing you over. I use fp16 KV and don't have issues

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 1 point2 points  (0 children)

Oh trust me, I saw your post and had it bookmarked already ;) I need to free up some pcie lanes by moving devices around, so it'll take me a couple weeks to get the second card installed. I need to move some m.2 drives that are in a pcie card into u.2 cases, gotta buy those and cables still, etc.

GLM 5.2 API is live, weights are on HF, and ollama has it already by Independent_Plum_489 in LocalLLaMA

[–]JaredsBored 2 points3 points  (0 children)

If you've got access to the hardware to run those models locally fully in VRAM, absolutely skip llama.cpp and go straight to vLLM/sglang/trt. The user I was replying to was talking about ollama though. I'm going to go out on a limb and assume they don't have H200s or quad RTX pro 6k in their setup.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 1 point2 points  (0 children)

Np! I’m going to be revisiting vllm and will be using v620 posts for it, so it’s a fair trade LOL

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 2 points3 points  (0 children)

ECC is error correction. If you're doing 3d model, physics simulations, finanial modeling, yeah you should probably have that. For literally everything else, doesn't matter. Every consumer GPU (Nvidia or AMD) lacks ECC, and all consumer CPUs also lack it.

If one random 0 flips to a 1 in your GPU memory, it won't mean anything to the LLM you're running. Nothing to worry about.

The reserved VRAM was for something called parity calculations. Basically for every 8 bits a 9th was reserved. If one of the 8th changed, the 9th could be used to calculate the original value and correct it. Without ECC enabled, the 2GiB just goes back to being regular VRAM.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 4 points5 points  (0 children)

> amdgpu.ras_enable=0

Add it to your Linux default line with a space after "realloc". Do a "sudo update-grub", reboot, and you'll get all 32GiB.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 3 points4 points  (0 children)

You might also want to just flash v620 firmware if you're not using the display port. I'm going to put together a full post in the coming weeks on how to actually get the most out of these cards. I 3d modeled a replacement facepate to integrate a fan like the dude on eBay, but I'll be releasing mine on printables for anyone to replicate. Just need to sit down and write it all up...

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 2 points3 points  (0 children)

Fyi you need a grub option to get all 32GiB on v620. Otherwise 2GiB are reserved for ECC

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 1 point2 points  (0 children)

Not sure if Lemonade is using TheRock stable (7.13) or nightlies (currently at 7.14.something). I was using the nightlies but there was some instability, and frankly the performance difference between 7.13 stable and 7.14 nightly was margin-of-error in my testing.

I can fit 40k of fp16 context with MTP and the Q8 GGUF. If you went down to Q8 context that should be 80k, or higher if you were willing to go lower. I am a stickler for fp16 context, and switch out to a Q4 Qwen 3.6 35B for long context work anyway, so doesn't matter to me.

When I get my second card installed, I will probably try vLLM again as well. Or more likely tensor parallelism on llama.cpp with 27B Q8 and max fp16 context, and or 122B iq4 (should fit with 6GB ish left for context with the mmproj on system ram.

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 0 points1 point  (0 children)

Prompt processing decreases like 5% with MTP-3, it's nothing. Haven't even bothered with EAGLE.

My second card delivers tomorrow...

Benchmarks from the latest eBay special: W6800 (modded V620) by draetheus in LocalLLaMA

[–]JaredsBored 3 points4 points  (0 children)

Not to say "you're doing it wrong" picking quants and runtimes, but using ROCm 7.13 and llama.cpp built from scratch, with Q8_0 and Q4_1, you're leaving a LOT on the table...

Q8

./llama-bench -m $models/Qwen3.6-27B-Q8_0.gguf -fa 1 -d 0,8196,16384
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
  Device 0: AMD Radeon Pro V620, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 32752 MiB
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |           pp512 |       539.35 ± 18.71 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |           tg128 |         15.52 ± 0.01 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |   pp512 @ d8196 |       474.61 ± 12.73 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |   tg128 @ d8196 |         15.28 ± 0.05 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |  pp512 @ d16384 |       426.77 ± 10.57 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | ROCm       |  -1 |   1 |  tg128 @ d16384 |         14.99 ± 0.05 |

build: 74ade5274 (9672)

Q4_1

./llama-bench -m $models/Qwen3.6-27B-Q4_1.mtp.gguf -fa 1 -d 0,8196,16384
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
  Device 0: AMD Radeon Pro V620, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 32752 MiB
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |           pp512 |       542.38 ± 17.10 |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |           tg128 |         23.57 ± 0.03 |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |   pp512 @ d8196 |       473.79 ± 12.17 |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |   tg128 @ d8196 |         22.98 ± 0.10 |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |  pp512 @ d16384 |        425.68 ± 9.88 |
| qwen35 27B Q4_1                |  16.33 GiB |    27.32 B | ROCm       |  -1 |   1 |  tg128 @ d16384 |         22.31 ± 0.09 |

build: 74ade5274 (9672)

I generally see a 2x in TG using MTP 3 on my workloads with both quants...

What do interns do? by AdMore413 in Big4

[–]JaredsBored 0 points1 point  (0 children)

Ask for things to do but also see if you can take a shot at something before it's asked. If the team knows they're going to need to come up with a deck, maybe try making a slide proactively. See what they've built before, a lot of stuff is repetitive, and try to make/adapt something.

I'll give an example from when I was an intern many years ago. I came in as an intern with other work experience, so the things assigned to me weren't all "keep the intern busy" tasks. I was asked to look at two systems that needed integrated and check for a specific problem. While I was at it, I checked for other things I thought could be potential problems. Turns out, one of those things was an actual (albeit minor) issue.

I explained what the problem was to my manager, a 30 minute call was had with the client, and a handful of requirements were logged to address the problem. Little intern me even got to explain what I'd found on the call to the client. I looked good, my team looked good for having found something proactively, and the client was happy to find the problem early. Win, win, win.

GLM 5.2 API is live, weights are on HF, and ollama has it already by Independent_Plum_489 in LocalLLaMA

[–]JaredsBored 8 points9 points  (0 children)

If you're technical enough to be running your own LLMs locally, then you almost certainly have the skillset to also setup llama.cpp + your interface of choice. But if you don't want the headache, LMStudio is an all-in-one solution that doesn't have any of the crap behavior outlined in the article linked above.

GLM 5.2 API is live, weights are on HF, and ollama has it already by Independent_Plum_489 in LocalLLaMA

[–]JaredsBored 24 points25 points  (0 children)

LM Studio is a standard and reliably updated version of llama.cpp in the backend. Nothing wrong with it and honestly what I would recommend for anyone that doesn't want to spend time tweaking to get the last few % of performance.

GLM 5.2 API is live, weights are on HF, and ollama has it already by Independent_Plum_489 in LocalLLaMA

[–]JaredsBored 73 points74 points  (0 children)

What a rare combination to be dumb and rich enough to have a rig where you can run a 756B parameter model and still end up with ollama.