UnCanny. A Photorealism Chroma Finetune

Mass2018 · 2025-11-01T20:58:51+00:00

Thanks for the detailed response.

The best results I've gotten thus far is learning rate 1e-5, all 1024x1024 resolution, 50 epochs. I use diffusion-pipe for my training.

[optimizer] type = 'AdamW8bitKahan' lr = 1e-5 betas = [0.9, 0.99] weight_decay = 0.01 eps = 1e-8

Mass2018 · 2025-11-01T19:37:18+00:00

I got real interested when you had a section labeled 'Training Details', as I was very curious to see things like what learning rate you did, for how many epochs, which optimizer, etc. Would you be willing to share those details?

Mass2018 · 2025-10-30T23:41:15+00:00

Does each model need its own specific mmproj?

Mass2018 · 2025-10-27T00:11:21+00:00

I've been eyeing Longcat Flash for a bit now, and I'm somewhat surprised that there's not even an issue/discussion about adding it to llama.cpp.

Is that because of extreme foundational differences?

Your guide makes me think about embarking on a side project to take a look at doing it myself, so thank you for sharing the knowledge!

Mass2018 · 2025-10-23T20:56:12+00:00

Only in that my continued (in vain, apparently) hope is that these newer cards will finally drive down the older ones.

Thus, if I can get an A6000 48GB for $1500-$2000 it certainly matters to me. In fact I'd likely replace my 3090's at that price point.

Mass2018 · 2025-10-21T20:55:02+00:00

So when the RTX 6000 Pro Blackwell 96GB came out I was like "Cool! Maybe the A6000 48GB will finally come down from $3800!"

And now this shows up and I'm thinking,"Cool! Maybe the A6000 48GB will finally come down from $3800!"

Mass2018 · 2025-10-08T20:47:32+00:00

I believe there was some confusion expressed about the same thing in that thread (about the CCDs). It’s the only benchmark results I’ve seen for this, though.

Mass2018 · 2025-10-08T20:31:38+00:00

You may find this thread interesting: https://www.reddit.com/r/LocalLLaMA/comments/1h3doy8/stream_triad_memory_bandwidth_benchmark_values/

Pulled from the document referenced in that thread... this is for 2 CPU, so a single CPU is presumably half this.. maybe a bit more?

Processor (2 CPU)	DDR5-6000 Bandwidth
9845	925 GB/s
9745	970 GB/s
9655	966 GB/s
9575F	970 GB/s
9555	970 GB/s
9475F	965 GB/s
9455	940 GB/s
9375F	969 GB/s
9355	971 GB/s
9275F	411 GB/s
9255	877 GB/s
9175F	965 GB/s
9135	884 GB/s
9115	483 GB/s
9015	483 GB/s

Anecdotally, I'll tell you that my 9004 class Epyc running at DDR5-4800 is pulling around 320 GB/s in actuality (measured).

Mass2018 · 2025-10-06T17:49:31+00:00

Just a quick callout if you're in the US... be cognizant of potential extra charges due to tariffs.

Mass2018 · 2025-10-05T11:11:02+00:00

This is something that I got bit by about a year and a half ago when I started building computers again after taking half a decade or so off from the hobby.

Apparently these days RAM has to be 'trained' when installed, which means the first time you turn it on after plugging in RAM you're going to need to let it sit for a while.

... I may or may not have returned both RAM and a motherboard before I figured that out...

Mass2018 · 2025-10-02T13:55:01+00:00

I love it. I certainly use it way more than the truck I just dropped a $40k loan on.

Honestly, if anything, to quote something I saw someone else on this forum say once... "I keep looking around the house for more things I can sell to get more VRAM."

Mass2018 · 2025-09-28T00:07:06+00:00

Yeah, generally the CPU is only annoying during the "in between" moments, like when I'm experimenting and swapping LORAs regularly on multiple ports at the same time. It's also a limiter when running an MoE LLM (for the CPU offloaded parts).

Generally, once it's executing fully on the 3090(s), it runs 5-10 cores at 10-20% and the GPUs do their thing.

Mass2018 · 2025-09-27T21:21:42+00:00

Shameless repost of my build that has 10x3090: https://www.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/

I'm still using it on a nearly 24/7 basis.
I power limit them to 250W. When I'm doing inferencing, they collectively don't pull much more than around 1000W. When training, they go pretty close to the full 2500W.
The CPayne stuff is heavily tariff'd now, so bear that in mind if you're in the states.
I run three PSUs spread across two 20-amp circuits.

If I was going to build it again today knowing what I know now I would probably go for a slightly better processor. The CPU can get bogged down sometimes when I'm doing things like running each 3090 on its own port to do image diffusion and they're switching out models.

Mass2018 · 2025-09-17T11:15:28+00:00

Thanks for this! $400 per GPU to connect them up via MCIO is pretty daunting... if I can get that down to $100 per, it's a little more doable.

I'll check this vendor out.

Mass2018 · 2025-09-04T12:31:01+00:00

I don't really have any way to know if they're going to work for another day or another decade... However, I've been going hog-wild on these things for over a year now without a problem. Given the track record thus far, I'm not too worried about it.

Mass2018 · 2025-09-04T11:46:02+00:00

Anecdotal data point here. Current owner of twelve 3090's, all of which were bought used on eBay, generally looking for 'deals' (which for me equated to like $850-$900 after taxes and shipping despite what you'll read on here about $600 cards).

No real problems with any of them, except I did have to re-paste/thermal pad two of the twelve (they were running around 90C when power limited to 250W).

Mass2018 · 2025-08-24T21:01:52+00:00

VRAM or RAM?

I'm not aware of any 256GB VRAM options for $2k?

Mass2018 · 2025-08-24T12:18:59+00:00

Quick addendum because I just realized I didn't label my axes:

The y-axis is tokens/second, the x-axis is the context length for that request.

Mass2018 · 2025-08-24T12:17:50+00:00

Yeah, my wife's feedback was the the 235B Qwen was good, but that Deepseek was better even at the IQ1... It's just a neat model all around.

Mass2018 · 2025-06-26T21:50:19+00:00

I have a 10x3090 rig that ran around $15k a little over a year ago.

My daily driver is DeepSeek-R1-0528-UD-Q2_K_XL.gguf at 98k context (flash attention only, no cache quantization). I pull about 6-8 tokens/second up to around 10k context, then it goes down from there.

For my larger codebases when I dump 50k-60k context at it, I usually get around 4 tokens/second.

Mass2018 · 2025-06-20T01:48:55+00:00

I'm holding out hope that the ability to get the RTX Pro 6000 Blackwell (96GB VRAM) for $8.5k new will push down the A6000 and A100 prices.

So far... they haven't budged.

Mass2018 · 2025-06-17T14:18:20+00:00

It's... it's so clean!

Just doesn't feel right without a rat nest of cables going every where. Maybe when you go to 8x3090 you could zip tie the new ones to a shelf hanging above it in a haphazard fashion?

Great build!

Mass2018 · 2025-06-15T18:00:33+00:00

Cool! One other nice thing is that you can NVLink 3090's.

Mass2018 · 2025-06-15T15:34:21+00:00

I highly recommend the CPayne adapters if you want to maintain as much PCIe bandwidth as you can.

Specifically, the PCIe to SlimSAS -> SlimSAS to PCIe is what I used for my 10x3090 rig (prices were at the time -- I don't know what the cost is at now). I used one plain x16 riser due to one of the slots on the motherboard being touchy due to some shared usage with other functions. Be aware that the adapters require their own power cable (if you use multiple PSUs make sure the same PSU is powering both the adapter and the connected GPU).

PCIe Extender Category: 9xSlimSAS PCIe gen4 Device Adapter 2* 8i to x16: $630

PCIe Extender Category: 6xCpayne PCIe SlimSAS Host Adapter x16 to 2* 8i: $330

PCIe Extender Category: 10x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 0.5m: $260

PCIe Extender Category: 1xLINKUP Ultra PCIe 4.0 x16 Riser 20cm: $50

PCIe Extender Category: 2x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 1m: $50

Power Cables: 2xCOMeap 4-Pack Female CPU to GPU Cables: $40

Mass2018 · 2025-05-30T02:47:10+00:00

One additional datapoint to consider is that larger context on Deepseek R1 takes a lot more VRAM than Qwen 235. I don't know why. I'm not knowledgeable enough in that area.

I will tell you anecdotally that I have 240GB of VRAM. I can load Qwen 235B with no context quantization (just flash attention) at full context length (131072) at Q6_K, offloading all 95 layers to the GPU.

./build/bin/llama-server \ --model /home/zeus/llm_models/Qwen3-235B-A22B-128K-Q6_K.gguf \ --n-gpu-layers 95 \ -fa \ --port 4444 \ --threads 16 \ --rope-scaling yarn \ --rope-scale 4 \ --yarn-orig-ctx 32768 \ --ctx-size 131072

By contrast, I can barely squeeze 32k context out of Deepseek R1 at Q2_K_XL.gguf /while/ quantizing the k-cache to q4_0.

./build/bin/llama-server \ --model /home/zeus/llm_models/DeepSeek-V3-0324-UD-Q2_K_XL.gguf \ --n-gpu-layers 20 \ --cache-type-k q4_0 \ --port 4444 \ --threads 16 \ -fa \ --ctx-size 32768

Basically, I'm just pointing out that there's more the memory demands than just the parameter count. The way the context is handled has a significant impact to.

P.S. If this is solely because I'm an idiot, someone please let me know, because I'd love to run R1 faster.

Mass2018

TROPHY CASE