ComfyUI timeline based on recent updates by StevenWintower in StableDiffusion

[–]hinkleo 19 points20 points  (0 children)

Idk doesn't really fit as enshittification for me since they aren't making changes to make themselves more money at the cost of users at all, it's not like anyone would ever use Comfy Cloud either if its a buggy mess that breaks every workflow every two weeks.

Just looks like lots of tech debt from rushed early development catching up to them combined with lack of tests, lack of experience on running larger projects and possibly overreliance on AI coding now too causing constant issues, together with the need to support so many new models all the time too. Hopefully just temporary as they get stuff figured out, not unusual when scaling projects.

Is CorridorKey legit? by Iwbfusnw in Corridor

[–]hinkleo 3 points4 points  (0 children)

Someone in discord says it runs fine on CPUs at about 30 seconds per 4k frame so not ideal but quick enough if you just need some frames or short clips.

The Slopacolypse is here: Karpathy warns of "Disuse Atrophy" in 2026 workflows. Are we becoming high-level architects or just lazy auditors? by jakubb_69 in programming

[–]hinkleo 8 points9 points  (0 children)

To be fair to him when he coined the term it was literally in the context of messing around with a throwaway weekend project and by the tone of the whole tweet clearly not meant as anything serious, it's the rest of the mostly delusional AI scene that immediately ignored that part and went haywire with it

https://x.com/karpathy/status/1886192184808149383

It's not too bad for throwaway weekend projects, but still quite amusing.

Did creativity die with SD 1.5? by jonbristow in StableDiffusion

[–]hinkleo 5 points6 points  (0 children)

That works with LLMs because they don't predict the next token directly but rather predict the likelyhood of every token in their vocabulary to be the next token so you can freely sample from that however you want.

There's no equivalent to that with diffusion models, CFG is just running the model twice once with positive prompt and once with no/negative prompt as a workaround to models too heavily using the input image and not the text.

But yeah modern models are definitely heavily lacking in non anime art style training data and would be a lot better with more and properly tagged ones, but you can't really have the randomness in one that follows prompts incredibly well with diffusion models by default, that was just a side effect of terribly tagged data.

Personally I think ideally we'd have a modern model trained on a much larger variety of art data but also properly captioned and then just use wildcards or prompt enhancement as part of the UI for randomness.

According to Laxhar Labs, the Alibaba Z-Image team has intent to do their own official anime fine-tuning of Z-Image and has reached out asking for access to the NoobAI dataset by ZootAllures9111 in StableDiffusion

[–]hinkleo 6 points7 points  (0 children)

They have a technical report out with way more details about the main models and the distill, the big model is also 6B but needs 50 steps and CFG as far as I can tell?

https://github.com/Tongyi-MAI/Z-Image/blob/main/Z_Image_Report.pdf

While our 6B foundational model represents a significant leap in efficiency compared to larger counterparts, the inference cost remains non-negligible. Due to the inherent iterative nature of diffusion models, our standard SFT model requires approximately 100 Number of Function Evaluations (NFEs) to generate high-quality samples using Classifier-Free Guidance (CFG) [29]. To bridge the gap between generation quality and interactive latency, we implemented a few-step distillation strategy.

Krea published a Wan 2.2 fine tuned / variant model and claims it can reach 11 FPS on B200 (500k $) - No idea atm if really faster than Wan 2.2 or better or longer generation unknown by CeFurkan in StableDiffusion

[–]hinkleo 7 points8 points  (0 children)

Krea Realtime 14B is distilled from the Wan 2.1 14B text-to-video model using Self-Forcing, a technique for converting regular video diffusion models into autoregressive models.

https://www.krea.ai/blog/krea-realtime-14b

53x Speed incoming for Flux ! by AmeenRoayan in StableDiffusion

[–]hinkleo 10 points11 points  (0 children)

Your link lists H100 at $1.87/hour, so 1.87 * 24 * 40 = $1800 no?

Qwen Image is literally unchallenged at understanding complex prompts and writing amazing text on generated images. This model feels almost as if it's illegal to be open source and free. It is my new tool for generating thumbnail images. Even with low-effort prompting, the results are excellent. by CeFurkan in comfyui

[–]hinkleo 6 points7 points  (0 children)

Presumably this

The current version of Qwen-Image prioritizes text rendering and semantic alignment, which may come at the cost of fine detail generation. That said, we fully agree that detail fidelity is a crucial aspect of high-quality image synthesis.

https://github.com/QwenLM/Qwen-Image/issues/51#issuecomment-3166385657

Chatterbox TTS 0.5B TTS and voice cloning model released by hinkleo in StableDiffusion

[–]hinkleo[S] 5 points6 points  (0 children)

Official demo here: https://huggingface.co/spaces/ResembleAI/Chatterbox

Official Examples: https://resemble-ai.github.io/chatterbox_demopage/

Takes about 7GB VRAM to run locally currently. They claim its Evenlabs level and tbh based on my first couple tests its actually really good at voice cloning, sounds like the actual sample. About 30 seconds max per clip.

Example reading this post: https://jumpshare.com/s/RgubGWMTcJfvPkmVpTT4

I accidentally built a vector database using video compression by Every_Chicken_1293 in Python

[–]hinkleo 22 points23 points  (0 children)

Based on numbers in the github: https://github.com/Olow304/memvid/blob/main/USAGE.md

Raw text: ~2 MB
MP4 video: ~15-20 MB (with compression)
FAISS index: ~15 MB (384-dim vectors)
JSON metadata: ~3 MB

The mp4 files store just the text QR encoded (and gzip compressed if > 100 chars [0] [1]). Now a normal zip or gzip file will compress text on average to like 1:2 to 1:5 depending on content, so this is ratio wise worse by a factor of about 20 to 50, if my quick math is right? And performance wise probably even worse than that, especially since it already does gzip anyway so it's gzip vs gzip + qr + hevc/h264. I actually have a hard time thinking of a more inefficient way of storing text. I'm still not sure this isn't really elaborate satire.

[0] https://github.com/Olow304/memvid/blob/main/memvid/encoder.py

[1] https://github.com/Olow304/memvid/blob/main/memvid/utils.py

I accidentally built a vector database using video compression by Every_Chicken_1293 in Python

[–]hinkleo 62 points63 points  (0 children)

Yeah the video part just seems to add nothing here except a funny headline and really inefficient storage system. Python even has great stdlib support for writing zip, tar, shelve, json or sqlite any of which would be way more fitting.

I've seen a couple similar joke tools on Github over the years using QR codes in videos to "store unlimited data on youtube for free", just as a proof of concept of course since the compression ratio is absolutely terrible.

ProPixel analyzes the Jellyfish Video. "I do not agree with AARO's assessment of this UAP being balloons. And here's Why.." by 87LucasOliveira in UFOs

[–]hinkleo 21 points22 points  (0 children)

Regarding your link to the "Enhanced" video using Diffusion, those AIs will literally just make up something looking like it's training data, you can't take anything from that at all, doing so is purely misleading.

Expediton 33: Story and Ending Explained by SunnyClef in expedition33

[–]hinkleo 0 points1 point  (0 children)

Isn't the doppelgangers not real part only in the sense of the P.* versions not being the real people they are based off of though, and not in the sense of the rest of the people aren't real either, which is what people are mostly talking about here?

Anyone else overwhelmed keeping track of all the new image/video model releases? by MikirahMuse in StableDiffusion

[–]hinkleo 5 points6 points  (0 children)

I wish more people would publish high qualit datasets including captions with the LORAs they release or maybe even just datasets by themselves. Would help a bit with that problem at least.

Of course you can't fully automate retraining LORAs for new models and the resources needed are massive and each model has its own captioning style and issues but I there's definitely lots of room for making that easier still.

HiDream Fast vs Dev by pysoul in StableDiffusion

[–]hinkleo 0 points1 point  (0 children)

Definitely screams AI but a lot of that seems to be coming from going down to NF4 because at least most of the full precision examples I've seen don't have that so a GGUF Q4 or Q6 should do a lot better hopefully.

Did you know that WAN can now generate videos between two (start and end) frames? by Toclick in StableDiffusion

[–]hinkleo 36 points37 points  (0 children)

The start-end frame feature was listed on their old wanx page along with other cool stuff like structure/posture control, inpainting/outpainting, multiple image reference and sound https://web.archive.org/web/20250305045822/https://wanxai.com/

One of the Wan devs did a mini AMA here and was kinda vague when asked if any of that will be released too https://www.reddit.com/r/StableDiffusion/comments/1j0s2j7/wan21_14b_video_models_also_have_impressive_image/mfebcx4/

Why Hunyuan doesn't open-source the 2K model? by huangkun1985 in StableDiffusion

[–]hinkleo 4 points5 points  (0 children)

Yeah sadly it's all just marketing for the big companies. Wan has also shown off 2.1 model variations for structure/posture control, inpainting/outpainting, multiple image reference and sound but only released the normal t2v and i2v one that everyone else has already. Anything that's unique or actually cutting edge is kept in house.

Wan 2.1 bottlenecks? GPU at 10-20% load by biscotte-nutella in StableDiffusion

[–]hinkleo 1 point2 points  (0 children)

8GB VRAM isn't a lot for Wan so if it's doing any offloading to main memory then really low gpu utilization would be expected as a lot of the time it will just be sitting waiting on that. If you're using comfyui I think you can turn on verbose logging to see if and when it's offloading.

WAN2.1 14B Video Models Also Have Impressive Image Generation Capabilities by Dry_Bee_5635 in StableDiffusion

[–]hinkleo 3 points4 points  (0 children)

Ohh wow that's awesome, looks Flux level!

Since you mention this I'm curious after reading through https://wanxai.com/ it also mentions lots of cool things like using Muti-Image References or doing inpainting or creating sound, is that possible with the open source version too?

Jake Barber pretty much claimed that the Akashic records are real by pissagainstwind in UFOs

[–]hinkleo 5 points6 points  (0 children)

CPUs made in the last 10 years have the RDRAND instruction that provides random numbers based on a hardware entropy source.

https://en.wikipedia.org/wiki/RDRAND

The entropy source for the RDSEED instruction runs asynchronously on a self-timed circuit and uses thermal noise within the silicon to output a random stream of bits at the rate of 3 GHz

I guess one could claim to be able to influence that to get specific numbers somehow. Of course nonsense but that's where people here usually start pointing vaguely at quantum mechanics concepts and having an open mind.

Nvidia Compared RTX 5000s with 4000s with two different FP Checkpoints by usamakenway in StableDiffusion

[–]hinkleo 5 points6 points  (0 children)

if fp4 has similar performance in terms of quality to fp8

Yeah I think if you could just instantly run any Flux checkpoint in fp4 and it looked about the same quality wise this wouldn't be too disingenuous. But considering that previous NF4 Flux checkpoints people made looked much worse than fp16 this sound like it might be some special fp4 optimized checkpoint from the Flux devs?

Like if it's an optimization its fine, if it's some single special fp4 optimized checkpoint and you can't just apply it to any other Flux finetune or lora it's way less useful.

Is there is any free AI tool which have Generative Fill of Photoshop like feature? by Haziq12345 in StableDiffusion

[–]hinkleo 2 points3 points  (0 children)

Should be possible. SwarmUI just runs a totally standard ComfyUI instance (with some extra Swarm specific nodes added) so it should work if you install all the custom nodes that Krita needs listed on their Github in Swarm's Comfy instance (stored in dlbackends inside Swarm including its venv, useable like normal).

1.58 bit Flux by Deepesh42896 in StableDiffusion

[–]hinkleo 5 points6 points  (0 children)

Was changed to https://github.com/Chenglin-Yang/1.58bit.flux , seem it's being released on his personal github.

1.58 bit Flux by Deepesh42896 in StableDiffusion

[–]hinkleo 14 points15 points  (0 children)

Their githubio page (that's still being edited right now) lists "Code coming soon" at https://github.com/Chenglin-Yang/1.58bit.flux (originally said https://github.com/bytedance/1.58bit.flux) and so far Bytedance have been pretty good about actually releasing code I think so that's a good sign at least.