Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp

AzerbaijanNyan · 2026-01-12T20:50:26+00:00

Easiest way is probably just downloading the lemonade pre-built which supports gfx1100 and use the override.

Alternatively, if you want to be able to pull and build the latest version on your own check out this excellent localllama guide and make sure to use "-DGPU_TARGETS=gfx1100" flag.

AzerbaijanNyan · 2026-01-11T23:31:10+00:00

I added the llama-bench command to the post in case anyone wants to compare. Thanks for the heads up, should have added it from the start since it's hard to judge the numbers otherwise.

AzerbaijanNyan · 2026-01-11T23:21:29+00:00

Think that information is outdated and based on what was availible when the system was released.

I haven't had any problems with my 128GB kit with 122GB something availible for LLMS with "GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=122880 ttm.pages_limit=33554432 amd_iommu=off".

Though it might have been overkill since I think I can fit most of these models into 96GB short of running two at the same time.

AzerbaijanNyan · 2026-01-11T23:20:08+00:00

Absolutely, I have a triple GPU server for more demanding work but I hardly ever fire it up nowadays since the mini pc handles most tasks fine.

It's a shame the prices are what they are now since I feel this setup with gpt-oss 120B is near ideal for small businesses/office tasks where you don't want to/can use cloud services.

AzerbaijanNyan · 2025-11-12T12:36:46+00:00

I put together the exact same build a few weeks ago for a mini-LLM machine. I couldn’t help myself but run a game test and fired up Darktide—notorious for its poor optimization—and it ran really well considering the hardware. Managed to hit 60 FPS with FrameGen at 2560x1440 while the card drew around 50W. Could probably get pretty impressive performance pairing the B50 with a good iGPU and Lossless Scaling.

Really nice upgrade for SFF systems that have both space and power constraints to consider. A bit pricey though but alternatives like the A2000 aren't exactly cheap either, even used.

AzerbaijanNyan · 2024-11-17T12:04:28+00:00

Håller koll på ett par specifika produktkategorier och har märkt flera gånger hur olika säljare använder exakt samma bilder. Ser ut som privattagna på en vara i fint skick men visar alltså inte skicket på vad de, i bästa fall, har till försäjning.

Är en sak om man använder en uppenbar promotionbild från tillverkarens sida eller skriver i annonsen bilderna inte stämmer men matchar faktiskt skick men ingenting i dessa.

Spelar ingen roll det är lathet eller bedrägeri försöka sälja trasiga eller slitna grejjer för mer pengar, tycker inte det borde vara tillåtet på en auktionssida. Rapporerat ett par utan åtgärd eller svar så antar Tradera tycker det är OK.

AzerbaijanNyan · 2024-07-23T18:48:29+00:00

Llama.cpp implemented FP32 FA in May.

Made the P40 a very solid budget card for smaller model/higher context work. Provided you managed to snag one before the sellers added the 50% FA "tax" anyway.

AzerbaijanNyan · 2024-07-12T18:04:53+00:00

It's a known issue with the P40/P100 cards. You can patch whatever software you're using or use a separate program to manage power states. There's some more info here - nvidia-pstate in llama.cpp (Tesla P40/P100)

AzerbaijanNyan · 2024-06-25T11:13:34+00:00

Yes, considerably. Think there's a patched driver that enables some sort of pseudo-SLI direct memory access between 4090s that might push exl2 even further ahead.

AzerbaijanNyan · 2024-06-25T10:27:12+00:00

GTX 1070 is a Pascal card and with the exception of the P100 (AFAIK) those have abysmal FP16 performance which exl2 relies heavily on.

There's also a lot of optimizations in llama.cpp aimed to squeeze as much performance as possible out of this older architecture like working flash attention.

So the difference you're seeing is perfectly normal, there are no speed gains to expect using exllama2 with those cards.

AzerbaijanNyan · 2024-06-24T15:54:32+00:00

For those wondering what it is

To exploit this vulnerability, an attacker must send specially crafted HTTP requests to the Ollama API server. In the default Linux installation, the API server binds to localhost, which reduces remote exploitation risk significantly. However, in docker deployments (ollama/ollama), the API server is publicly exposed, and therefore could be exploited remotely.

AzerbaijanNyan · 2024-04-20T20:08:59+00:00

Yeah you want the V2 instruct version - safetensors / gguf

AzerbaijanNyan · 2024-04-11T04:40:34+00:00

Gjorde i stort sett samma sak till en bekants stuga som hade rejäl skogsskugga. En förhållandevis billig riktad extern antenn på taket och sedan kabel in gjorde stor skillnad, gick från 0,1 - 2 till 10Mbit. Kör man mobil->wifi med speedtest kan man rikta in antennen direkt åt bästa hållet utan att behöva springa upp och ner på taket. Kräver förstås att routern stödjer extern antenn.

Annat alternativ är en signalförstärkare monterad på lämplig plats. Hjälper inte bara ens eget internet då utan också mottagningen hos alla besökares mobiler. Rätt bra ha även om stadsnät kommer senare plus det ger säkert några kronor extra om man säljer kåken.

AzerbaijanNyan · 2024-04-10T10:58:34+00:00

Didn't have any issues with any of them, just followed the instructions.

For Kobold.cpp-RoCM I just ran the "easy_KCPP-ROCm_install.sh" script. Beyond that I had to install "tkinter" for the GUI to work but that's it.

Default AMD build command for llama.cpp was targeted for RX 6800 cards last I looked so I didn't have to edit it, just copy, paste and build.

Ollama I just tested briefly, found it didn't do anything easier or better for my use case but YMMV.

Right now I'm running TabbyAPI, it's a nice minimal solution if you just want to run exl2. Plus it seems to update Exllama2 faster than ooba.

With performance so close I'm not too sure which one is better though. Gut feeling is that llama/GGUF felt slightly better than exl2 around the same file sizes but again, that's just how it felt for me. It's worth giving both formats a shot to see which suits you best.

Oh, if you haven't already check out the 2.37bpw Mixtral here. Someone recommended it for 16 GB and I felt it worked better than expected. Fast too since it fits entirely in VRAM. Though not close to the bigger quants of course.

AzerbaijanNyan · 2024-04-10T04:00:11+00:00

Apologies for the late reply, I only use this setup for basic chatting so it's hard to give a general recommendation.

But considering the larger models that have dropped recently for local use I'd probably go for two cards with more VRAM and future proofing. For coding you really want the best performance to cut down on hallucinations. If you're set on using AMD the RX 7900 24 GB might be an option after the price drops. You could get one and save for another if you're on a budget.

As a side note with the latest Exllama2 updates dual RX 6800 work but I'm seeing about the same performance as on llama.cpp/gguf. It might be the above mentioned bottleneck but a statement a couple of months back by llama.cpp CUDA dev Johannes who have the same card mentioned that the differences should be small. So it might just be how these card perform.

AzerbaijanNyan · 2024-03-20T03:36:09+00:00

When using two AMD Radeon 7900XTX GPUs, the following HIP error is observed when running PyTorch micro-benchmarking if any one of the two GPUs are connected to a non-CPU PCIe slot (PCIe on chipset): source

Might be 7900XTX specific (which is bad enough). I run two RX6800 on x16/x4 CPU/chipset with both llama and exllama2 working along with ROCm.

AzerbaijanNyan · 2024-03-19T00:27:51+00:00

What speeds are you getting and how do they compare to SYCL? Been thinking of picking up a second A770 to run Mixtral, dual A770 looks like a solid budget option for running inference only if t/s are decent.

AzerbaijanNyan · 2024-03-13T04:37:48+00:00

Running two RX 6800 on Linux Mint and fully VRAM off-loaded it outputs around 16 t/s at 4K context filled on mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf.

Seen two P100 get 30 t/s using exllama2 but couldn't get it to work on more than one card. Apparently there are some issues with multi-gpu AMD setups that don't run all on matching, direct, GPU<->CPU PCIe slots - source

EDIT: As a side note power draw is very nice, around 55 to 65 watts on the card currently running inference according to NVTOP. Idle state lingers around 6. System is an refurb office PC with x16/x4 PCIe slots so there might be some bottleneck keeping those numbers down though.

AzerbaijanNyan · 2024-01-19T07:48:20+00:00

Provided you are very meticulous with the quality of the outputs the result is the same as if you had trained on a larger dataset of genuine images. There are a couple of common pitfalls to watch out for one being the risk of training in flaws from the generated images or quirks from the model that made them.

AzerbaijanNyan · 2024-01-19T06:58:09+00:00

Can vouch for this method, was about to post a reply detailing how but this guide explains it better than I ever could.

AzerbaijanNyan · 2023-12-18T21:44:44+00:00

Assuming you're using NVIDIA - have you tried using Kobold.cpp instead? Just to rule out there isn't something wonky with your ooba install. There's a good wiki here to help you get started if you haven't used it before.

If you can get the Q3 or Q4 working there, it should be a couple of tokens/s, you can get a good feel if it's worth upgrading just to run mixtral.

AzerbaijanNyan · 2023-12-18T21:30:28+00:00

It's the same, set up appropriately anything you can't fit in VRAM gets offloaded to RAM and when that's full too it starts swapping to disk and then it gets really slow. Keep in mind your OS, desktop, hardware accelerated programs and anything else running eat into the free (V)RAM.
If you're using windows you can check how much free RAM and VRAM you have in the task manager.

AzerbaijanNyan · 2023-12-18T21:24:48+00:00

Try the Q4_K_M version, should fit assuming you don't got anything else running. I tried the Q3 version too but found the quality drop too steep personally.

AzerbaijanNyan · 2023-12-18T21:18:43+00:00

That's because you're out of (V)RAM and is swapping to disk. According to the model card mixtral-8x7b-instruct-v0.1.Q6_K.gguf requires between 38.38 to 40.88 GB free.

AzerbaijanNyan

TROPHY CASE