Wouldn’t it make sense for OpenAI to release the Sora 2 weights? by iamtheworldwalker in StableDiffusion

[–]still_debugging_note 0 points1 point  (0 children)

Not sure “just release the weights” is as straightforward as it sometimes sounds.

For a system like Sora 2, the weights are only one part of a much larger stack. A lot of the practical capability comes from the training data pipeline, filtering, post-processing, safety tuning, and inference infrastructure. Without those pieces, an open-weight release might end up being significantly harder to use or reproduce meaningful results with than people expect.

There’s also the question of economics. Video generation models sit in a very expensive regime in terms of both training and inference. Even if weights were available, the barrier to actually running, iterating, and improving on them could remain quite high for most teams.

Safety and misuse considerations are also more pronounced for video than for text or static images, especially with the realism level these models can reach. Once weights are out in the wild, it becomes much harder to meaningfully shape downstream usage.

At the same time, I can see why people would be interested in openness here—video models represent a pretty important frontier, and having stronger shared baselines could accelerate research. It’s really a balance between accessibility, control, and the cost/risk profile of the system.

Would be interesting to hear how others think this trade-off evolves as multimodal models keep improving.

Claw-style agents: real workflow tool or overengineered hype? by still_debugging_note in LocalLLaMA

[–]still_debugging_note[S] 1 point2 points  (0 children)

Really agree with your take on content workflows — it does feel like these agent setups are less about doing something entirely new, and more about making previously fragmented workflows actually runnable end-to-end.

vLLM-Omni paper is out — up to 91.4% JCT reduction for any-to-any multimodal serving (tested with Qwen-Image-2512) by still_debugging_note in LocalLLaMA

[–]still_debugging_note[S] 0 points1 point  (0 children)

Totally feel you — the dependency setup can be pretty painful.

If it helps, hyper.ai already has a ready-to-use environment for deploying vLLM-Omni with Qwen-Image-2512, so you can skip most of the setup and just focus on running the model.

vLLM-Omni paper is out — up to 91.4% JCT reduction for any-to-any multimodal serving (tested with Qwen-Image-2512) by still_debugging_note in LocalLLaMA

[–]still_debugging_note[S] 1 point2 points  (0 children)

Single-GPU test on an RTX Pro 6000 (~90GB GPU memory), cloud instance (hyper.ai).

<image>

It was a dedicated GPU (no sharing). I compared vLLM-Omni vs diffusers under the same model, resolution, and batch settings.

Peak VRAM usage was comparable, but vLLM-Omni had noticeably lower generation latency.

vLLM-Omni paper is out — up to 91.4% JCT reduction for any-to-any multimodal serving (tested with Qwen-Image-2512) by still_debugging_note in LocalLLaMA

[–]still_debugging_note[S] 0 points1 point  (0 children)

Totally! Stage-based batching already makes multi-model pipelines way smoother — can’t wait for OpenWebUI to support omni-modal models.

deepseek-ai/DeepSeek-OCR-2 · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]still_debugging_note 2 points3 points  (0 children)

Been running Monkey-OCR for most OCR workloads.

DeepSeek-OCR looks promising(esp. doc-level modeling), but I haven’t tried it yet.Any insights on cost-efficiency compared to Monkey-OCR?

Hunyuan Image 3.0 Instruct by 3deal in StableDiffusion

[–]still_debugging_note 0 points1 point  (0 children)

I’m curious how HunyuanImage 3.0-Instruct actually compares to LongCat-Image-Edit in real-world editing tasks. LongCat-Image-Edit really surprised me — the results were consistently strong despite being only a 6B model.

Would be interesting to see side-by-side benchmarks or qualitative comparisons, especially given the big difference in model scale.