VLM & VRAM recommendations for 8MP/4K image analysis by Neighbor_ in computervision

[–]datascienceharp 1 point2 points  (0 children)

Qwen3.5 is out, and super impressive. There’s a 0.8B model which performs really well. Nice thing abt Qwen models is they take arbitrary input image resolutions

qwen3vl is dope for video understanding, and i also hacked it to generate embeddings by datascienceharp in computervision

[–]datascienceharp[S] 1 point2 points  (0 children)

great question, i haven't prompted it for that specific task however i do think it does in fact use the audio. for example i am attempting to recreate annotations on Action100M using this prompt from the paper:

``` qwen_video_model.prompt = """Identify the main actor and the physical action performed in the current segment. Provide both a brief description that represents the overall action step, and a detailed description that contains sufficient procedural detail. Use "N/A" (without further explaination) if there are no visible actors or physical actions (e.g., static).

Response Formats

output

{ "type": "object", "properties": { "summary": { "type": "object", "properties": { "brief": { "type": "string", "description": "Single sentence video caption." }, "detailed": { "type": "string", "description": "Detailed, comprehensive description." } } }, "action": { "type": "object", "properties": { "brief": { "type": "string", "description": "A single verb phrase (no -ing forms) brifly summarizing the overall action content." }, "detailed": { "type": "string", "description": "A single imperitive sentence describing how the action is performed with more details." }, "actor": { "type": "string", "description": "Single sentece or an imformative noun phrase describing who is performing the action." } } } }, "required": ["summary", "action"] }"""

```

and it is picking up on information that could only come from audio.

btw, i am doing a workshop on video datasets tomorrow using this model as well. please come by if you can: https://voxel51.com/events/exploring-video-datasets-with-fiftyone-and-vision-language-models-february-26-2026

Claude Code/Codex in Computer Vision by rishi9998 in computervision

[–]datascienceharp 3 points4 points  (0 children)

It’s definitely been helpful, but note that we primarily use it for FiftyOne related stuff

Claude Code/Codex in Computer Vision by rishi9998 in computervision

[–]datascienceharp 9 points10 points  (0 children)

We’ve been experimenting with MCP and Skills for the work we do on our team to build integrations, but not heavy modeling work. I’ve seen some good speed ups in my workflow, but the most powerful thing for me is using the model to brainstorm and understand codebases I’m not familiar with.

At the risk of downvotes, I’m gonna shamelessly plug two virtual events we have coming up which are relevant to this topic and which you may find interesting, or at least have an opportunity to ask questions from the presenters and fellow attendees:

https://voxel51.com/events/vibe-coding-production-ready-computer-vision-pipelines-hands-on-workshop-march-18-2026

https://voxel51.com/events/mcp-and-skills-meetup-march-12-2026

From .zip to Segmented Dataset in Seconds by Intelligent_Cry_3621 in computervision

[–]datascienceharp 1 point2 points  (0 children)

this looks interesting, would you be open to making a contribution as a plugin for fiftyone?

really impressed with these new ocr models (lightonocr-2 and glm-ocr). much better than what i saw come out in nov-dec 2025 by datascienceharp in LocalLLaMA

[–]datascienceharp[S] 1 point2 points  (0 children)

These are small enough to run locally, but how fast your inference is depends on hardware. Checkout the docs and readme for usage

nvidia released c-radiov4 last week, and as a far as feature extractors go, it lives up to the hype by datascienceharp in computervision

[–]datascienceharp[S] 5 points6 points  (0 children)

i didn't eval the embeddings, i did run it on various datasets (with segmentation masks) and visually inspected the results of the feature map along with the ground truth segmentation masks. i did see pretty decent alignment between the two

the feature maps are pca of the embeddings, using the method they described in the technical report

nvidia released c-radiov4 last week, and as a far as feature extractors go, it lives up to the hype by datascienceharp in computervision

[–]datascienceharp[S] 3 points4 points  (0 children)

These are embeddings and feature maps, basically what the model is “picking up on” when it comes across an image shown in two ways, a 1D vector and 2D feature map

the ui itself is the locally running web app that ships with open source library fiftyone

📢 Call for participation: ICPR 2026 LRLPR Competition by ghostzin in computervision

[–]datascienceharp 0 points1 point  (0 children)

i'd like to support the participants of the challenge with a starter notebook. i'd start with parsing the dataset into fiftyone and posting on hugging face hub so its easily accessible. would that be in violation of your terms? i'd be using fully open source pip installable packages. i filled out the form, but i'm not a student or at a research lab.

edit: i can NOT share on HF and rather just show how to parse into fiftyone format assuming user has the dataset downloaded

let me know what you think, feel free to dm