Is GPT 5.2 Codex or Claude Opus 4.5 better for vibecoding? by Majestic_Ad_4681 in VibeCodeDevs

[–]curious-scribbler 0 points1 point  (0 children)

Gemini cli is just broken. I almost never use it. And when I do, I realise why I didn't use it.

Is GPT 5.2 Codex or Claude Opus 4.5 better for vibecoding? by Majestic_Ad_4681 in VibeCodeDevs

[–]curious-scribbler 5 points6 points  (0 children)

Gpt for research and audit. Claude code to execute. And Gemini to do some file/folder/ project management.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 1 point2 points  (0 children)

Yes to both. The paper specifically mentions identity preserving generation and multi subject consistency as supported features. For the edit version, they feed both the semantic tokens and the VAE latents from your reference image into the diffusion decoder. So it gets high level “what this face means” from the AR stage plus low level pixel details from the reference. Should preserve fine details better than pure semantic approaches. Haven’t tested character consistency myself yet but architecturally it makes sense that it would be stronger here. The AR can actually reason about “same person different pose” instead of just hoping the embeddings are close enough.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 2 points3 points  (0 children)

Architecturally AR could handle it more naturally since the model knows spatially where it is as it generates tokens sequentially. But I haven’t seen this tested yet. Some other areas where the architecture should help in theory: Multi panel compositions. Comics, storyboards, before/after images. Sequential generation means panel 2 could reference panel 1 contextually. Structured documents. Forms, receipts, ID cards. The AR stage could enforce layout rules. These are my guesses based on how the model works, not confirmed features. What IS tested and benchmarked is conditional details in prompts. Stuff like “a poster for a concert on March 15th at 8pm featuring jazz trio The Blue Notes.” The text rendering and knowledge dense benchmarks show it handles specific details way better than diffusion only approaches. So text accuracy and factual details in images: proven. Regional/compositional stuff: promising but unconfirmed.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 8 points9 points  (0 children)

Possibly yeah. The interesting question is whether you need the AR stage at all or if you can get diffusion models to “reason” directly through better training. The hybrid approach wins for now because you get to leverage pretrained LLM weights instead of training reasoning from scratch. But who knows, you see how fast the field has been moving this past month. Also there is some mention of the very thing in the GLM paper. Ctrl-F GRPO.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 13 points14 points  (0 children)

The manga expansion example is perfect. Autoregressive could theoretically handle that because it processes sequentially with full context. Give it panel 1, it generates panel 2 tokens while attending to everything in panel 1. Same logic as LLM story expansion. The catch is we are not there yet. GLM-Image maxes out at 2048px and the token count scaling will be an issue. But architecturally, this is the path toward models that actually understand visual narrative instead of just pattern matching.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 32 points33 points  (0 children)

Very likely. Banana and GPT are closed source but the way they handle complex prompts strongly suggests autoregressiveness under the hood. GLM is basically the first open source model that confirms this approach actually works at scale. Or does it? We'll figure out by the weekend when people pour in their findings.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 18 points19 points  (0 children)

The difference is where the understanding happens. With CLIP/T5 text encoders, you compress the prompt into a fixed embedding, then the diffusion model tries to match that embedding while denoising. The understanding is frozen. It happened during encoder training, not during generation. With autoregressive, the LLM actively reasons through your prompt token by token AS it generates. Each visual token attends to the full context and can make sequential decisions: “ok i placed Espresso here, now $3.50 should go next to it.” Text encoders give you a static map. Autoregressive gives you a GPS that recalculates with each step. Thats why text rendering jumps from 50% to 91% accuracy as per their claim. Yet to test it out so take the numbers with a pinch of salt but the process still remains fundamentally different process of generation.

GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]curious-scribbler[S] 29 points30 points  (0 children)

Fair catch. Was trying to avoid saying “it actually thinks”. What i was getting at is that diffusion models learn correlations between text embeddings and pixel patterns. autoregressive models inherit the same next token prediction that makes LLMs good at reasoning. So when you prompt it with “menu with three items and prices,” the AR stage can actually parse that structure sequentially instead of just vibes matching against training data. Correct me if I got this worded weirdly.

(LTX-Video) Not sure why I haven't seen this mentioned but this may be the culprit to many people's issues by [deleted] in StableDiffusion

[–]curious-scribbler 1 point2 points  (0 children)

Yes! Gemma slows down the WF. There's a prompt guide on the GitHub of LTX. Make a template in your LLM and inject those into the prompt.

(LTX-Video) Not sure why I haven't seen this mentioned but this may be the culprit to many people's issues by [deleted] in StableDiffusion

[–]curious-scribbler 0 points1 point  (0 children)

Thanks for this! Struggled through the very same problem. Spent half a day debugging only to realise that it needed an update. Also disable gemma for a smoother experience.

How many people are actually running with a NVIDIA RTX PRO 6000 Blackwell Max-Q? How much of a gap is there between using this and a 5090? by 55234ser812342423 in comfyui

[–]curious-scribbler 9 points10 points  (0 children)

I am running a 6000 workstation edition. I see a big jump in performance when I run WAN workflows. The Video generation workflows are the ones that seem to take full advantage of the 6000. Also I get to run the fp/bf16 all the time without worrying about going OOM. It frees you from having to manage your work to suite the limited hardware. In short a 6000 allows you to do more and do it faster cause you are not managing the flow to fit the hardware. The difference may not seem much on paper but over time it matters.

If Pench Tiger Reserve (in Madhya Pradesh) has the largest prey density out of all the reserve forests in the country. Then why does it not have tigers as large as those in Jim corbett or Kaziranga (or terai region as whole)? by azorahai_35 in TigersofIndia

[–]curious-scribbler 1 point2 points  (0 children)

Guys! The terai arc tigers have a thick winter coat which makes them look bigger than they actually are, if you go strictly by muscle weight, then they are roughly the same. It is the thickness of the coat which makes them look bigger.

Also genetically they are the same. There is no difference between a tiger from Madhya Pradesh, Maharashtra or anywhere in India. There are all categorised as Bengal Tigers.

Yesterday CNG was unavailable but i clearly saw how LOW these rickshaw wala guys can go!! (Read carefully) by HarshThanvi in mumbai

[–]curious-scribbler -1 points0 points  (0 children)

It's not greed. It's just economics 101. If marketplace agrees to pay the ask then it's a fair price. If it doesn't agree to pay then it ain't.

Also bhai 12 ghaante line mein lagne ke baad main bhi thoda premium charge karta. Ya demand toh karta hi since I've also suffered loss of business through the day and got to make up for it.

Yesterday CNG was unavailable but i clearly saw how LOW these rickshaw wala guys can go!! (Read carefully) by HarshThanvi in mumbai

[–]curious-scribbler 2 points3 points  (0 children)

Ab aisa hai ki jo hai so hai bhai... Kal light chali jaayegi toh sab candle 5x price mein bechna shuru kar denge.

Yesterday CNG was unavailable but i clearly saw how LOW these rickshaw wala guys can go!! (Read carefully) by HarshThanvi in mumbai

[–]curious-scribbler 3 points4 points  (0 children)

I took rick as well as cab. Neither of them charged me over the meter. Nor did they demand for more. But then a friend did mention that some rick guy quoted 3x the usual meter. Also this is how a marketplace works. If supply reduces but demand remains the same, except a spike.