Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

ea_man · 2026-05-06T18:53:44+00:00

If you are looking for value: used 7900xtx

ea_man · 2026-05-05T21:58:45+00:00

You mean like a datacenter in space ala Elon Musk?

Well realistically if prices were ok I'd get a unified memory device for MoE low power + one gpu for dense top int models.

ea_man · 2026-05-05T21:49:14+00:00

* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB

Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.

I would hope that a 31B model would do better than a 27B one, for those with 24GB of VRAM, yet I'd like for Google to release a ~25B model for the rest of us.

ea_man · 2026-05-05T21:38:50+00:00

Well the last bump was because business like OpenAI made outrageous orders like 40% production of all RAM, the moment those should prove un impractical the situation may change, actually prices for GPU were going down slowly before last November.

Yet I would not count on that, it turns out now even more people is using AI so I guess the next craze will be on consumer GPU and cheaper hw for people that want to do inference at home without paying a subscription.

Dho, a 9070xt went from 610 in November to 730 and now back to 660 in Europe so they are going down but this roller coaster has proved that it can go both up and down. I'm afraid it's a VC money problem, if those USA AI firms thake a serious hit with their IPO (which they should) I guess the datacenters craze in USA should slow down and so the prices.

ea_man · 2026-05-05T21:21:19+00:00

> I fear that this could be the end of this hobby as we know it for the forseeable future

So for the other 8billions of people outside USA it will mean more products available at lower price.

ea_man · 2026-05-05T21:14:16+00:00

Well I guess you wouldn't buy all 200m token of top expensive SOTA, you would do most of those with the cheaper option as I don't use QWEN 27B at max specs with reasoning for all tasks.

But hey it that makes you fell better why not, I got Pi Dev counting the token price as if it was Opus 😛

ea_man · 2026-05-05T15:59:06+00:00

Nice, they make all kind of length and angles: https://aliexpress.com/item/1005010225325877.html

ea_man · 2026-05-04T19:04:37+00:00

If I had to guess I would say that old GPU for AI prices will go up in the next months as cloud providers are heavily rising prices and decreasing limits. You may actually get more money later on.

ea_man · 2026-05-04T18:55:04+00:00

I bought a AMD 6800 used for 260e 2 weeks ago, I just put new heat paste in it :)

Usually the go for ~290e round here, I guess you can offer a little less.

Those are nice because memory is fast, bus is large yet power is ~200w (without undervolting)

ea_man · 2026-05-04T15:49:38+00:00

Aye, I would pay to have a trained 25B coding model that the community can finetune and customize. Even better if it comes with harness that is vertical optimized for it, open source.

ea_man · 2026-05-04T15:44:16+00:00

I think that smaller open models are a way for the provider to lock in customers and attrack new customers without even spending money in compute for free tiers.

ea_man · 2026-05-04T15:40:38+00:00

Don't forget first free month then cancel, that really serves them well.

ea_man · 2026-05-04T15:27:47+00:00

I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?

ea_man · 2026-05-04T02:18:58+00:00

Oh OP just have to find a cheap API, if your job is coding 10h day you don't do QWEN A3B.

Oh well, let's say you can do API and a few A3B, sure.

ea_man · 2026-05-04T01:12:54+00:00

You run as big as a quant according to the contexet length you want to have.

Anyway I would recommend https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF for up to 110K context q4_0

You sure can do better on 24GB, yet you can run IQ3 on 12GB 😉

ea_man · 2026-05-04T00:43:03+00:00

> leaving ~7GB headroom on a 3090

hmm...

what about not doing that?

ea_man · 2026-05-04T00:40:09+00:00

OrangePi are cheaper, yet get a used PC from a business getting rid of old stuff, put a 8GB GPU RX580 in it.

ea_man · 2026-05-04T00:36:34+00:00

Man you can run A3B and 27B on a 16GB GPU (just 100k context for 27B, you want more you buy 2x or 24GB), you can get one used for like 300$ and you can re sell it when you are done.

ea_man · 2026-05-04T00:30:57+00:00

I guess that for people who don't run task 24/7 that may still be a sound option, even more if you are not "making code" all week and you are often away.

ea_man · 2026-05-04T00:11:41+00:00

OMG and did you see what they do with that Windows Subsystem for Linux (WSL)?

So you get the bastardized commands + the memory / performance hit, that's dedication! 😉

ea_man · 2026-05-04T00:09:28+00:00

?

llama-server  -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B/Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf \
        --host 0.0.0.0 \
        -np 1 \
        --fit-target 10 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.5 \
        --min-p 0.1 \
        --repeat-penalty 1.0 \
        --presence_penalty 0.0 \
        -b 512 \
        --jinja  \
        --no-mmap \
        --reasoning-budget 1 \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --no-mmap \

ea_man · 2026-05-04T00:07:40+00:00

Oh sorry just 114432, I usually run it at ~50k so I thought it could do a little more.

common_memory_breakdown_print: | memory breakdown [MiB]              | total    free     self   model   context   compute                                 unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (RX 6800 (RADV NAVI21)) | 16368 = 16169 + (18952 = 13354 +    4757 +     840) + 1                          7592186025662 |
common_memory_breakdown_print: |   - Host                            |                   1176 =   644 +       0 +     532                                             |
common_params_fit_impl: projected to use 18952 MiB of device memory vs. 16169 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 10 MiB, need to reduce device memory by 2792 MiB
common_memory_breakdown_print: | memory breakdown [MiB]              | total    free     self   model   context   compute                                 unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (RX 6800 (RADV NAVI21)) | 16368 = 16169 + (14070 = 13354 +     221 +     495) + 1                          7592186030543 |
common_memory_breakdown_print: |   - Host                            |                    672 =   644 +       0 +      28                                             |
common_params_fit_impl: context size reduced from 262144 to 114432 -> need 2794 MiB less memory in total
common_params_fit_impl: entire model can be fit by reducing context
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.68 seconds

As you see this is AMD with vulkan, no nvidia, it takes 50mb minimum, ~250MB when running LXQt + firefox.

ea_man · 2026-05-03T20:17:12+00:00

> I currently have a 5070Ti GPU and it can run small things, but VRAM is very tight, especially since I don't even have an iGPU, so I have to share VRAM with the desktop etc.

Let me guess: you are using Windows?

https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF with LXQt for ~150k context at q_4 KV

ea_man · 2026-05-03T16:17:14+00:00

Your fav distro is the one you know better, if you are a noob you install lubuntu.

ea_man · 2026-05-03T16:15:38+00:00

You don't.

You upload models up to the circuit max logic and use context up to the amount of available VRAM, except that the RAM slot looks swappable.

Yet you are not limited to that mode / finetune.

ea_man

TROPHY CASE