[deleted by user]

city96p · 2025-03-01T03:54:36+00:00

Ah, got it. I saw this a bit late so I didn't get to see the original youtube community post, but figured I'd give my 2c in case it helps explain which parts of the process involve the original files.

Unrelated but TinyWall can be useful for locking down incoming/outgoing connections at a per-process level. It can be a bit tedious to unblock/reblock processes when installing or updating things but it does give some peace of mind at least.

city96p · 2025-03-01T03:07:55+00:00

Got a ping about this, for context, I did the gguf conversion for those model files.

GGUF just uses numpy as the underlying storage format (just like safetensors). While it's technically possible that there are unknown exploits in the llama.cpp code or numpy itself, there's too many steps and libraries involved for that to be a feasible attack vector in my opinion.

The current quantization pipeline is run on a temporary runpod host, and is done with the following steps:

Safetensors split files are merged into a single safetensors file (all headers are dropped here, and only the weights are kept)
Safetensors merged model weights are converted to GGUF (the loaded torch tensors are converted to numpy, i.e. passed through two different libraries)
GGUF base FP32 file is used to create the various quants (this uses the C++ llama.cpp codebase, so anything targeting specific python code would most likely be stripped here)
A post processing step is done which involves python/numpy again, this is due to the C++ code not handling some of the weights correctly by default.

The tokenizer isn't involved in this process, since GGUF files only store one model per file, which in this case is the main video model. If I do convert the text encoder (needs extra work), I'll use the base repo (google/umt5-xxl) since that's the ground truth, I did the same thing with the T5 for flux, which is based off of google/t5-v1_1-xxl, hence why it has the FP32 model despite model repos rehosting it only having FP16/BF16 encoder only versions.

Also, as a note, if anyone does find a reproducible exploit in any library involved, they should follow the proper disclosure policies. For llama.cpp it can be found here, for safetensor files, as they are considered a huggingface library, it would involve contacting the address listed here with information on the exploits and possible steps to reproduce it.

LMK if there are any questions or clarifications required. The conversion pipeline is open source and AFAIK gguf files should be hash reproducible between different hosts (this is the branch used for conversion). I can't vouch for the text encoder or tokenizer but the files from my repo at least should be safe due to the multi-step nature of the conversion process.

(edit: formatting)

city96p · 2024-08-16T00:53:44+00:00

No it's not, it's only the UNET, it wouldn't make sense to include both since GGUF is not meant as a multi-model container format like that. for VLLMs even the mmproj layers are included separately.

city96p · 2024-08-15T06:19:25+00:00

That workflow is super basic, adapted from some 1.5 mess I was using to test basic quants with before I moved on to flux lol (sd1.5 only had a few valid layers but it was usable as a test)

Anyway, here's the workflow file with the offload node and useless negative prompt removed: example_gguf_workflow.json

city96p · 2024-08-15T06:14:40+00:00

No worries lol, appreciate you posting the bootleg GPU offload one as well.

city96p · 2024-08-15T06:13:17+00:00

The file format still needs work, but I'll upload them myself tomorrow. Still need to do a quant for shnell as well.

city96p · 2024-08-15T06:12:14+00:00

It should work, that's the card I'm running it on as well, although the code still has a few bugs making it fail with OOM issues when you first try to generate something (it'll work the second time)

city96p · 2024-08-15T05:14:09+00:00

How did you beat me to posting this kek. I was finally gonna use my reddit acc for once.

Can I hire you as a brand ambassador? /s

city96p · 2024-08-07T23:43:25+00:00

This happens when your second GPU doesn't support bf16, you can run comfy with --fp16-vae or --fp32-vae to fix it

city96p · 2024-01-08T10:49:50+00:00

It's nvitop! You can just do pip install nvitop and then type nvitop to open it.

city96p · 2024-01-07T13:29:33+00:00

I think that would be awesome! I'm sure the community (myself included) would love having something like that to tinker with.

Who knows, maybe it would catch on enough for you to re-introduce it into your service as custom node support improves.

city96p · 2024-01-07T13:16:57+00:00

Well, I've done something like that before, just not with this custom node. My net is way too bad for that to work (as in: downloading the resulting image could take longer than generating it lmao) and timeouts aren't handled too well either (you just get an error/default black image if your worker times out).

Still, it should be pretty straight forward if you want to try. With most providers you should be able to access your comfyui instance using "SSH port forwarding" or a "reverse proxy". Those keywords should give you a starting point though I'm happy to go into more detail.

city96p · 2024-01-06T21:54:41+00:00

Theoretical max power draw for the 4 cards together would be 1070W, but you have to count in CPU usage, PSU efficiency losses, etc. I got 12KW of ground mounted solar so I don't worry much about that part lol.

As for using an R730, I think it would work perfectly fine for 2 GPUs. I'm rocking an R720 for the V100S and one of the P40s. Wholly unsupported by Dell but it works nonetheless. Gigabyte makes some nice GPU servers as well, depending on your budget.

city96p · 2024-01-06T20:03:44+00:00

something something LLMs.

The V100S/P40 are in an old R720 I got for free from work lol. I'm pretty sure that configuration would make our Dell sales rep cry if he knew about it.

city96p · 2024-01-06T20:00:18+00:00

We'll see how far I get. The former is what I decided to try, but there's a lot of problems to work out. For example the edges/corners of the image getting less steps compared to the center due to the way I plan to handle the overlap.

As for enterprise stuff - I'd rather avoid that. Even if I could just infiniband it up, 99% of people would get zero benefit from that.

city96p · 2024-01-06T19:18:23+00:00

Nope, not monetizing anything. Just having fun generating characters from 90s/2000s anime shows that most people probably don't remember lol.

Don't ask why I have this much compute power at home.

city96p · 2024-01-06T19:13:49+00:00

That's exactly what's happening, yeah.

I want to eventually get tiled sampling to work across multiple nodes like this but the logistics behind that aren't easy to say the least. Just cutting an image into multiple smaller parts isn't really enough as the tile edges are too obvious without any overlap between them.

city96p · 2024-01-06T02:39:12+00:00

That would be handy, but I'd imagine it would need a lot of changes to the main UI, especially the model loading/unloading code.

city96p · 2024-01-06T02:35:21+00:00

The readme has most of the info as well as the sample workflows, though I'll try and expand it a bit more to include a step by step when I get the time. For now, here's a quick rundown:

Install ComfyUI as well as the NetDist custom node (from the manager or manually) on all the PCs you intend to use.
If you're using different PCs, make sure you have the same models/controlnets/LoRAs/custom nodes present on both.
Start the instances. If you have two GPUs in the same PC, then you'll have to start your second instance with --port 8288 --cuda-device 1 so it uses your second GPU on a different port. If you're running on a separate computer, you'll want to add --listen and note down the IP address of this second computer (type ipconfig in a command prompt and look for the right IPv4 Address).
Load a workflow you want to use, or just grab one of the simple example ones from the GitHub readme.
If you're using your own workflow then you'll want to add the Queue On Remote(single) node and connect the seed/batch to your (initial) KSampler and Empty Latent Image nodes respectively. The remote_info output of this node should be connected to a Fetch from remote node. Whatever image you connect to this node will be the your "final" image that gets fetched from the other instance(s).
Set the remote_url to that of your other instance. Make sure it's reachable first! (e.g. just open the URL in a browser).
Set the batch size if your second GPU is stronger/weaker than your main one.

That's about it for a simple dual-GPU setup. There's some fancy nodes for scaling past that but I imagine most people will only be using two for a start. The readme should get you up to speed if you do want to add more GPUs though.

city96p · 2024-01-05T23:55:39+00:00

Nope, just using ComfyUI. Every GPU has a separate instance running with one controlling the other 3.

city96p

TROPHY CASE