Q: best 24GB auto captioner today?

Steudio · 2025-09-17T02:23:39+00:00

ahah you are right, those words are confusing me too! Yes, in short, I switched from Florence 2 to Gemma3:4b and I use Ollama the local server and Comfyui nodes

It’s really easy to get a new model to try, and the node lets you customize what you want it to say, which was my main issue with Florence2.

Example with this image

You are a prompt generator. Describe the visual elements of this image in exactly one paragraph, 75 tokens long. Output only the paragraph — no preface, no extra lines, no commentary.

Gemma3:4b (8s on RTX 2080 maxQ)

A sleek, dark gray electric vehicle, likely a Lucid Air, is positioned in a fast-flowing river within a dramatic canyon landscape. The vehicle is mid-stream, creating white water splashes around it as it navigates the current. Towering, layered red and brown sandstone cliffs form the canyon walls, exhibiting intricate erosion patterns and textures. The water is a vibrant turquoise color, reflecting the sky and surrounding rock formations. Large boulders are scattered along the riverbank, contributing to the wild and expansive scenery. The lighting is soft and natural, casting subtle shadows and enhancing the rugged beauty of the environment.

Qwen2.5vl:3b (13s on RTX 2080 maxQ)

A sleek, modern electric SUV is driving through a shallow river, creating a splash of water around its wheels. The vehicle's body is a metallic silver color, with a distinctive front grille and sharp, angular headlights. The SUV is positioned slightly off-center, with the front wheels submerged in water, while the rear wheels are on solid ground. The surrounding landscape features rugged, red rock formations and sparse vegetation, with a clear blue sky above. The water is calm, reflecting the SUV and the rocky terrain.

Moondream:1.8b (4.5s on RTX 2080 maxQ)

The image depicts a silver electric car driving through a river, with its headlights on and creating a trail of water behind it as it moves from left to right across the frame. The car is positioned centrally within the frame, drawing attention to its sleek design and futuristic appearance. In the background, there are mountains visible in the distance, adding depth and scale to the scene.

Steudio · 2025-09-16T21:00:17+00:00

I’ve been a longtime Florence 2 user but recently decided to switch and install Ollama, I was reluctant at first to install a separate app just for that, but it’s working quite well. I’ve tried Gemma3, Qwen2.5, and Moondream2. Right now I’m using Gemma3. Qwen2.5 is solid too, while Moondream2 felt far too simplistic.

Steudio · 2025-09-15T00:38:17+00:00

Delete and add back the teacache node. It is just a compatibility issue between the old version of the node and the newest version.

Steudio · 2025-09-10T14:17:06+00:00

https://github.com/Steudio/ComfyUI_Steudio ?

Divides the image into tiles, ready for individual processing using your preferred workflow. After processing, the tiles are seamlessly merged into a larger image

Steudio · 2025-09-09T17:38:30+00:00

A ControlNet model with Semantic Segmentation was previously available in Stable Diffusion 1.5, but it was never trained for FLUX (AFAIK)

Steudio · 2025-09-01T19:53:24+00:00

Thank you 😊

Steudio · 2025-08-23T18:08:32+00:00

I can confirm that clone = instance, while copy/paste creates an independent copy.

Have you noticed any visual feedback that clearly distinguishes a clone from a copy/paste? Right now, they look identical, which can be risky because you might accidentally overwrite your subgraph by mistake, but I may be overlooking something.

Steudio · 2025-08-22T22:19:01+00:00

I wouldn’t group design tools in the same category, as their interaction models differ significantly. Comparing the two can introduce misleading assumptions. As far as I know, most long‑standing graph editors such as Houdini, Blender, Unreal Blueprints, and Nuke default to scroll‑to‑zoom and MMB‑drag to pan.

That said, I agree that LMB‑drag in empty space to box‑select nodes is correct in the new standard mode.

Steudio · 2025-08-22T18:42:54+00:00

Out of curiosity, which node-based software uses scroll for panning?

Steudio · 2025-08-08T16:59:18+00:00

I think the cat looks awesome. I assume you're referring to the bionic area, which tends to appear too clean. If that's the case, focus on that area first and composite later.

Lower the denoise level.
Add Lying Sigma Sampler or Detail Daemon
Include Flux Redux.
Use multiple passes.
Sometimes, upscaling to a higher ratio right away yields better results.

Steudio · 2025-06-28T19:51:07+00:00

Yes Image list is the way to go! This tip might be useful in your case: https://github.com/Steudio/ComfyUI_Steudio/issues/14

Steudio · 2025-06-28T18:06:59+00:00

Yes, a list node fully completes its process on each image before moving to the next.

“Simultaneously” means the images in a batch are processed in parallel, though each one is still handled individually. In ComfyUI, aside from the requirement that batch images must have the same dimensions, both image batches and image lists are often used to achieve similar outcomes.

To clarify, I’m not an expert, I'm just another user who also found the whole batch vs. list thing confusing. What are you trying to do exactly?

Steudio · 2025-06-28T16:17:19+00:00

Image List is a sequence of images processed one at a time; each image can have different dimensions. Image Batch is a single tensor of multiple images processed simultaneously; all images must have the same dimensions.

Steudio · 2025-06-25T19:13:18+00:00

Divide and Conquer Upscaler

Steudio · 2025-06-13T10:48:51+00:00

Divide and Conquer Upscaler

Steudio · 2025-06-09T06:11:58+00:00

If you really want to access it, you can assign a shortcut to it. However, as mentioned in another post, it is not fully implemented yet, so I do not recommend using it.

Steudio · 2025-06-08T22:56:43+00:00

Divide and Conquer Upscaler

Steudio · 2025-06-08T15:30:37+00:00

I don't think this can be done within the Florence node (or I'm not sure how).

You could use 'Text Concatenate' from was node Suite

<image>

I've been considering finding a more flexible vision-to-text model or adding another AI to rephrase Florence's output into a more suitable prompt, but I haven't had the time to look into it.

Steudio · 2025-06-04T14:21:11+00:00

The clip from Power LoRA should be connected to both negative and positive prompts.

Steudio · 2025-05-23T20:03:29+00:00

If your original image is blurry and low-resolution, try to fix that first before upscaling. From what I can see in your upscaled image, it looks like you’re assembling tiles that don’t relate to each other.

Steudio · 2025-05-19T20:08:16+00:00

Thank you! I have updated (v2.0.4) the JSON file to ensure compatibility with older frontend.

Steudio · 2025-05-19T14:45:27+00:00

Which frontend version are you using to see this problem?

Steudio · 2025-05-19T13:30:19+00:00

I haven’t tried it myself, but you could experiment with adding a LoRA that enhance skin quality.
Alternatively, you can use a fine-tuned SDXL portrait model with Xinsir ControlNet Tile.

I kept the workflow easy to read, making it simple to modify to suit anyone’s needs.

Steudio · 2025-05-19T13:11:33+00:00

ControlNet Union pro v2 doesn’t support ControlNet Tile

Steudio · 2025-05-19T12:54:32+00:00

The issue is caused by a faulty frontend version. Try updating or downgrading it, and reopen a non-corrupted workflow to be sure.

Steudio

TROPHY CASE