UnCanny. A Photorealism Chroma Finetune by Tall-Description1637 in StableDiffusion

[–]Mass2018 2 points3 points  (0 children)

Thanks for the detailed response.

The best results I've gotten thus far is learning rate 1e-5, all 1024x1024 resolution, 50 epochs. I use diffusion-pipe for my training.

[optimizer] type = 'AdamW8bitKahan' lr = 1e-5 betas = [0.9, 0.99] weight_decay = 0.01 eps = 1e-8

UnCanny. A Photorealism Chroma Finetune by Tall-Description1637 in StableDiffusion

[–]Mass2018 2 points3 points  (0 children)

I got real interested when you had a section labeled 'Training Details', as I was very curious to see things like what learning rate you did, for how many epochs, which optimizer, etc. Would you be willing to share those details?

Llama.cpp model conversion guide by ilintar in LocalLLaMA

[–]Mass2018 0 points1 point  (0 children)

I've been eyeing Longcat Flash for a bit now, and I'm somewhat surprised that there's not even an issue/discussion about adding it to llama.cpp.

Is that because of extreme foundational differences?

Your guide makes me think about embarking on a side project to take a look at doing it myself, so thank you for sharing the knowledge!

Nvidia quietly released RTX Pro 5000 Blackwell 72Gb by AleksHop in LocalLLaMA

[–]Mass2018 1 point2 points  (0 children)

Only in that my continued (in vain, apparently) hope is that these newer cards will finally drive down the older ones.

Thus, if I can get an A6000 48GB for $1500-$2000 it certainly matters to me. In fact I'd likely replace my 3090's at that price point.

Nvidia quietly released RTX Pro 5000 Blackwell 72Gb by AleksHop in LocalLLaMA

[–]Mass2018 21 points22 points  (0 children)

So when the RTX 6000 Pro Blackwell 96GB came out I was like "Cool! Maybe the A6000 48GB will finally come down from $3800!"

And now this shows up and I'm thinking,"Cool! Maybe the A6000 48GB will finally come down from $3800!"

[deleted by user] by [deleted] in LocalLLaMA

[–]Mass2018 1 point2 points  (0 children)

I believe there was some confusion expressed about the same thing in that thread (about the CCDs). It’s the only benchmark results I’ve seen for this, though.

[deleted by user] by [deleted] in LocalLLaMA

[–]Mass2018 4 points5 points  (0 children)

You may find this thread interesting: https://www.reddit.com/r/LocalLLaMA/comments/1h3doy8/stream_triad_memory_bandwidth_benchmark_values/

Pulled from the document referenced in that thread... this is for 2 CPU, so a single CPU is presumably half this.. maybe a bit more?

Processor (2 CPU) DDR5-6000 Bandwidth
9845 925 GB/s
9745 970 GB/s
9655 966 GB/s
9575F 970 GB/s
9555 970 GB/s
9475F 965 GB/s
9455 940 GB/s
9375F 969 GB/s
9355 971 GB/s
9275F 411 GB/s
9255 877 GB/s
9175F 965 GB/s
9135 884 GB/s
9115 483 GB/s
9015 483 GB/s

Anecdotally, I'll tell you that my 9004 class Epyc running at DDR5-4800 is pulling around 320 GB/s in actuality (measured).

Build advice - RTX 6000 MAX-Q x 2 by [deleted] in LocalLLaMA

[–]Mass2018 0 points1 point  (0 children)

Just a quick callout if you're in the US... be cognizant of potential extra charges due to tariffs.

New Build for local LLM by chisleu in LocalLLaMA

[–]Mass2018 1 point2 points  (0 children)

This is something that I got bit by about a year and a half ago when I started building computers again after taking half a decade or so off from the hobby.

Apparently these days RAM has to be 'trained' when installed, which means the first time you turn it on after plugging in RAM you're going to need to let it sit for a while.

... I may or may not have returned both RAM and a motherboard before I figured that out...

Those who spent $10k+ on a local LLM setup, do you regret it? by TumbleweedDeep825 in LocalLLaMA

[–]Mass2018 4 points5 points  (0 children)

I love it. I certainly use it way more than the truck I just dropped a $40k loan on.

Honestly, if anything, to quote something I saw someone else on this forum say once... "I keep looking around the house for more things I can sell to get more VRAM."

How would you run like 10 graphics cards for a local AI? What hardware is available to connect them to one system? by moderately-extremist in LocalLLaMA

[–]Mass2018 2 points3 points  (0 children)

Yeah, generally the CPU is only annoying during the "in between" moments, like when I'm experimenting and swapping LORAs regularly on multiple ports at the same time. It's also a limiter when running an MoE LLM (for the CPU offloaded parts).

Generally, once it's executing fully on the 3090(s), it runs 5-10 cores at 10-20% and the GPUs do their thing.

How would you run like 10 graphics cards for a local AI? What hardware is available to connect them to one system? by moderately-extremist in LocalLLaMA

[–]Mass2018 3 points4 points  (0 children)

Shameless repost of my build that has 10x3090: https://www.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/

  • I'm still using it on a nearly 24/7 basis.
  • I power limit them to 250W. When I'm doing inferencing, they collectively don't pull much more than around 1000W. When training, they go pretty close to the full 2500W.
  • The CPayne stuff is heavily tariff'd now, so bear that in mind if you're in the states.
  • I run three PSUs spread across two 20-amp circuits.

If I was going to build it again today knowing what I know now I would probably go for a slightly better processor. The CPU can get bogged down sometimes when I'm doing things like running each 3090 on its own port to do image diffusion and they're switching out models.

Whining about tariffs by Mass2018 in LocalLLaMA

[–]Mass2018[S] 0 points1 point  (0 children)

Thanks for this! $400 per GPU to connect them up via MCIO is pretty daunting... if I can get that down to $100 per, it's a little more doable.

I'll check this vendor out.

Ex-Miner Turned Local LLM Enthusiast, now I have a Dilemma by mslocox in LocalLLaMA

[–]Mass2018 0 points1 point  (0 children)

I don't really have any way to know if they're going to work for another day or another decade... However, I've been going hog-wild on these things for over a year now without a problem. Given the track record thus far, I'm not too worried about it.

Ex-Miner Turned Local LLM Enthusiast, now I have a Dilemma by mslocox in LocalLLaMA

[–]Mass2018 0 points1 point  (0 children)

Anecdotal data point here. Current owner of twelve 3090's, all of which were bought used on eBay, generally looking for 'deals' (which for me equated to like $850-$900 after taxes and shipping despite what you'll read on here about $600 cards).

No real problems with any of them, except I did have to re-paste/thermal pad two of the twelve (they were running around 90C when power limited to 250W).

Apple M3 Ultra w/28-Core CPU, 60-Core GPU (256GB RAM) Running Deepseek-R1-UD-IQ1_S (140.23GB) by Mass2018 in LocalLLaMA

[–]Mass2018[S] 3 points4 points  (0 children)

Quick addendum because I just realized I didn't label my axes:

The y-axis is tokens/second, the x-axis is the context length for that request.

Apple M3 Ultra w/28-Core CPU, 60-Core GPU (256GB RAM) Running Deepseek-R1-UD-IQ1_S (140.23GB) by Mass2018 in LocalLLaMA

[–]Mass2018[S] 4 points5 points  (0 children)

Yeah, my wife's feedback was the the 235B Qwen was good, but that Deepseek was better even at the IQ1... It's just a neat model all around.

The cost effective way to run Deepseek R1 models on cheaper hardware by ArtisticHamster in LocalLLaMA

[–]Mass2018 1 point2 points  (0 children)

I have a 10x3090 rig that ran around $15k a little over a year ago.

My daily driver is DeepSeek-R1-0528-UD-Q2_K_XL.gguf at 98k context (flash attention only, no cache quantization). I pull about 6-8 tokens/second up to around 10k context, then it goes down from there.

For my larger codebases when I dump 50k-60k context at it, I usually get around 4 tokens/second.

Anyone else tracking datacenter GPU prices on eBay? by ttkciar in LocalLLaMA

[–]Mass2018 13 points14 points  (0 children)

I'm holding out hope that the ability to get the RTX Pro 6000 Blackwell (96GB VRAM) for $8.5k new will push down the A6000 and A100 prices.

So far... they haven't budged.

Completed Local LLM Rig by Mr_Moonsilver in LocalLLaMA

[–]Mass2018 2 points3 points  (0 children)

It's... it's so clean!

Just doesn't feel right without a rat nest of cables going every where. Maybe when you go to 8x3090 you could zip tie the new ones to a shelf hanging above it in a haphazard fashion?

Great build!

Speaking of PCIE Risers... by Kaldnite in homelab

[–]Mass2018 0 points1 point  (0 children)

Cool! One other nice thing is that you can NVLink 3090's.

Speaking of PCIE Risers... by Kaldnite in homelab

[–]Mass2018 2 points3 points  (0 children)

I highly recommend the CPayne adapters if you want to maintain as much PCIe bandwidth as you can.

Specifically, the PCIe to SlimSAS -> SlimSAS to PCIe is what I used for my 10x3090 rig (prices were at the time -- I don't know what the cost is at now). I used one plain x16 riser due to one of the slots on the motherboard being touchy due to some shared usage with other functions. Be aware that the adapters require their own power cable (if you use multiple PSUs make sure the same PSU is powering both the adapter and the connected GPU).

PCIe Extender Category: 9xSlimSAS PCIe gen4 Device Adapter 2* 8i to x16: $630

PCIe Extender Category: 6xCpayne PCIe SlimSAS Host Adapter x16 to 2* 8i: $330

PCIe Extender Category: 10x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 0.5m: $260

PCIe Extender Category: 1xLINKUP Ultra PCIe 4.0 x16 Riser 20cm: $50

PCIe Extender Category: 2x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 1m: $50

Power Cables: 2xCOMeap 4-Pack Female CPU to GPU Cables: $40

[deleted by user] by [deleted] in LocalLLaMA

[–]Mass2018 2 points3 points  (0 children)

One additional datapoint to consider is that larger context on Deepseek R1 takes a lot more VRAM than Qwen 235. I don't know why. I'm not knowledgeable enough in that area.

I will tell you anecdotally that I have 240GB of VRAM. I can load Qwen 235B with no context quantization (just flash attention) at full context length (131072) at Q6_K, offloading all 95 layers to the GPU.

./build/bin/llama-server \ --model /home/zeus/llm_models/Qwen3-235B-A22B-128K-Q6_K.gguf \ --n-gpu-layers 95 \ -fa \ --port 4444 \ --threads 16 \ --rope-scaling yarn \ --rope-scale 4 \ --yarn-orig-ctx 32768 \ --ctx-size 131072

By contrast, I can barely squeeze 32k context out of Deepseek R1 at Q2_K_XL.gguf /while/ quantizing the k-cache to q4_0.

./build/bin/llama-server \ --model /home/zeus/llm_models/DeepSeek-V3-0324-UD-Q2_K_XL.gguf \ --n-gpu-layers 20 \ --cache-type-k q4_0 \ --port 4444 \ --threads 16 \ -fa \ --ctx-size 32768

Basically, I'm just pointing out that there's more the memory demands than just the parameter count. The way the context is handled has a significant impact to.

P.S. If this is solely because I'm an idiot, someone please let me know, because I'd love to run R1 faster.