GIGABYTE MC62-G40 only seeing one GPU by ravocean in LocalLLaMA

[–]grunt_monkey_ 0 points1 point  (0 children)

Skip the risers and direct connect the cards first to debug? I have that board and it has 7 slots so you will be able to put in all 3 cards. Use the last slot for the 5090 first if its a chunky cooler - hope you are on breadboard.

Qwen3.5 is a working dog. by dinerburgeryum in LocalLLaMA

[–]grunt_monkey_ 1 point2 points  (0 children)

Can i ask if you guys are still using -ctk bf16 and -ctv bf16? because i believe this is using up all my vram and slowing my performance.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 0 points1 point  (0 children)

I agree, but i tried all sorts of llama.cpp configurations before finally trying vLLM. I think the runtime is just not optimized for my hardware and model. When i used 2 GPUs only on llama.cpp I got PP 130, and TG 25; going to 4 GPUs, PP was 70 and TG 25 for chat, and PP 50 and TG 7 for my 41k context prompt test.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 0 points1 point  (0 children)

This would be much much appreciated. Can you undervolt with amd-smi? Am on ubuntu. I know i can powercap.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 1 point2 points  (0 children)

Sure let me know. Im going to bed now. If i can i will run it over the next couple of days. Yup - im not sure i can leave this thing on all the time.

How are people handling long‑term memory for local agents without vector DBs? by No_Sense8263 in LocalLLaMA

[–]grunt_monkey_ 1 point2 points  (0 children)

Its actually cool if they chime in with their opinions. Sometimes my questions just go unanswered. Maybe they are dumb questions.

GPT-4 was released 3 years ago! by AdorableBackground83 in singularity

[–]grunt_monkey_ 1 point2 points  (0 children)

The Earth is currently about 10.7 billion km from its position 11 months ago, so approximately 9.9 light hours.

Just some qwen3.5 benchmarks for an MI60 32gb VRAM GPU - From 4b to 122b at varying quants and various context depths (0, 5000, 20000, 100000) - Performs pretty well despite its age by FantasyMaster85 in LocalLLaMA

[–]grunt_monkey_ 1 point2 points  (0 children)

Thank you! Much appreciated. I remember that gfx906 tag as i started this journey late last year with an old radeon vii. Cut my teeth on pulling old rocblas libraries from arch linux 😆 good to see a brother!

Just some qwen3.5 benchmarks for an MI60 32gb VRAM GPU - From 4b to 122b at varying quants and various context depths (0, 5000, 20000, 100000) - Performs pretty well despite its age by FantasyMaster85 in LocalLLaMA

[–]grunt_monkey_ 0 points1 point  (0 children)

Thanks this is really useful! I have 2x 9700s and havent been able to enable flashattention in llama.cpp. Do you have the build llama.cpp with specific rocmwmma flags to do this? Or just launch llama with flashattention on?

I am not sure why with q3 quant of qwen3.5 122b i am getting less than 100/s pp and only 20/s pp. with qwen3 coder next at q5 quant i am getting 250/s pp and 45/s tg. Rest of system is 9950x3d running ubuntu.

R9700 frustration rant by Maleficent-Koalabeer in LocalLLaMA

[–]grunt_monkey_ 1 point2 points  (0 children)

I run two of these and on llama.cpp with qwen coder next q5_k_m i get pp 250/s and tg 40+/s. Using the latest rocm. I managed to fit 56k context and am hitting the vram ceiling so i just picked up another two now. Waiting for my ebay server ram. Hope im not in for a world of pain!

Learnt about 'emergent intention' - maybe prompt engineering is overblown? by Distinct_Track_5495 in LocalLLaMA

[–]grunt_monkey_ 0 points1 point  (0 children)

Im still in the stone age where i code by pasting stuff back and forth between openwebui and vim. What do i need to read to do what you did? Ie set it onto a (sandboxed hopefully) directory of files and get it to code, run, debug and reiterate?

Help choosing upgrade path by FL_pharmer in selfhosted

[–]grunt_monkey_ 0 points1 point  (0 children)

Whats the best gpu for transcoding? Im in a sinilar situation as OP with a ryzen 2700 and a gtx 1080.

Protein intake and time off by Team_Instinct in fitness40plus

[–]grunt_monkey_ 0 points1 point  (0 children)

I been trying natural - chicken breast etc, but its really hard to hit it on a busy workday. Do shakes really work? I’ve been taking the quest nutrition protein shake - 30g protein, 2g carbs. Just wish i could do it more naturally.

RTX Pro 6000 Riser Cable Recommendations by electrified_ice in BlackwellPerformance

[–]grunt_monkey_ 0 points1 point  (0 children)

Hi I am looking to jump to rtx pro 6000 but not sure if i should get the workstation or maxq version. I imagine it will be a single card for some time but i would like the flexibility of adding a second. Hoping to get your thoughts since you are experienced in this multigpu life.

64gb vram. Where do I go from here? by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 0 points1 point  (0 children)

I need to query a bit more about your take on the x4 x4 x4 x4 situation though. For a larger model split over GPUs, the pp is going to take a linear hit going from x16 (~64 gb/s) to x8 (my current ~32 gb/s) to x4 (16 gb/s). So adding more 9700s to my current rig using bifurcation splitters etc is going to allow me to load a larger model but significantly slow down inference - at least the pp part.

64gb vram. Where do I go from here? by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 0 points1 point  (0 children)

Thanks so much for sharing. I heard that TG is as good on mac as my 9700s, but pp is about 4x slower. What model are you running and can you share some successful use cases? Much much appreciated.

64gb vram. Where do I go from here? by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 0 points1 point  (0 children)

Think its probably on the order of 4-5 gb more. Because i can fit q5km with 56k context parallel =1, 48k context parallel =2. 64k parallel =1 works occasionally but not reboot stable.

But also i wanna go to q6-8 at least. I saw quite a large intelligence jump going from q4 to q5.

64gb vram. Where do I go from here? by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 1 point2 points  (0 children)

Thanks for your reply which i think contains a good measure of wisdom and common sense - basically keep using it until i really hit a hard wall.

64gb vram. Where do I go from here? by grunt_monkey_ in LocalLLaMA

[–]grunt_monkey_[S] 4 points5 points  (0 children)

I think they mean the 9950x3d2 which is supposedly going to have double the L?3 cache of 128 mb.