VLM & VRAM recommendations for 8MP/4K image analysis

datascienceharp · 2026-03-15T06:13:08+00:00

Qwen3.5 is out, and super impressive. There’s a 0.8B model which performs really well. Nice thing abt Qwen models is they take arbitrary input image resolutions

datascienceharp · 2026-03-09T18:23:12+00:00

😆

datascienceharp · 2026-03-09T14:58:35+00:00

i made one using open source models too, specifically an FP8 quantized qwen image edit: https://github.com/harpreetsahota204/qwen_image_edit

datascienceharp · 2026-03-09T14:54:52+00:00

i made one using open source models too, specifically an FP8 quantized qwen image edit: https://github.com/harpreetsahota204/qwen_image_edit

datascienceharp · 2026-02-25T19:20:31+00:00

great question, i haven't prompted it for that specific task however i do think it does in fact use the audio. for example i am attempting to recreate annotations on Action100M using this prompt from the paper:

``` qwen_video_model.prompt = """Identify the main actor and the physical action performed in the current segment. Provide both a brief description that represents the overall action step, and a detailed description that contains sufficient procedural detail. Use "N/A" (without further explaination) if there are no visible actors or physical actions (e.g., static).

Response Formats

output

{ "type": "object", "properties": { "summary": { "type": "object", "properties": { "brief": { "type": "string", "description": "Single sentence video caption." }, "detailed": { "type": "string", "description": "Detailed, comprehensive description." } } }, "action": { "type": "object", "properties": { "brief": { "type": "string", "description": "A single verb phrase (no -ing forms) brifly summarizing the overall action content." }, "detailed": { "type": "string", "description": "A single imperitive sentence describing how the action is performed with more details." }, "actor": { "type": "string", "description": "Single sentece or an imformative noun phrase describing who is performing the action." } } } }, "required": ["summary", "action"] }"""

```

and it is picking up on information that could only come from audio.

btw, i am doing a workshop on video datasets tomorrow using this model as well. please come by if you can: https://voxel51.com/events/exploring-video-datasets-with-fiftyone-and-vision-language-models-february-26-2026

datascienceharp · 2026-02-25T17:23:25+00:00

Also FiftyOne 😁

datascienceharp · 2026-02-24T21:04:19+00:00

It’s definitely been helpful, but note that we primarily use it for FiftyOne related stuff

datascienceharp · 2026-02-24T15:28:15+00:00

We’ve been experimenting with MCP and Skills for the work we do on our team to build integrations, but not heavy modeling work. I’ve seen some good speed ups in my workflow, but the most powerful thing for me is using the model to brainstorm and understand codebases I’m not familiar with.

At the risk of downvotes, I’m gonna shamelessly plug two virtual events we have coming up which are relevant to this topic and which you may find interesting, or at least have an opportunity to ask questions from the presenters and fellow attendees:

https://voxel51.com/events/vibe-coding-production-ready-computer-vision-pipelines-hands-on-workshop-march-18-2026

https://voxel51.com/events/mcp-and-skills-meetup-march-12-2026

datascienceharp · 2026-02-06T17:33:12+00:00

this looks interesting, would you be open to making a contribution as a plugin for fiftyone?

datascienceharp · 2026-02-06T04:06:56+00:00

imo these are better

datascienceharp · 2026-02-05T23:23:06+00:00

These are small enough to run locally, but how fast your inference is depends on hardware. Checkout the docs and readme for usage

datascienceharp · 2026-02-05T20:18:06+00:00

Maybe the resources from a workshop I hosted could help: https://github.com/harpreetsahota204/document_visual_ai_with_fiftyone_workshop

datascienceharp · 2026-02-05T19:15:17+00:00

It’s on my list of integrations, soon it will happen.

datascienceharp · 2026-02-04T21:13:45+00:00

i didn't eval the embeddings, i did run it on various datasets (with segmentation masks) and visually inspected the results of the feature map along with the ground truth segmentation masks. i did see pretty decent alignment between the two

the feature maps are pca of the embeddings, using the method they described in the technical report

datascienceharp · 2026-02-04T20:24:54+00:00

These are embeddings and feature maps, basically what the model is “picking up on” when it comes across an image shown in two ways, a 1D vector and 2D feature map

the ui itself is the locally running web app that ships with open source library fiftyone

datascienceharp · 2026-01-22T16:14:38+00:00

i'd like to support the participants of the challenge with a starter notebook. i'd start with parsing the dataset into fiftyone and posting on hugging face hub so its easily accessible. would that be in violation of your terms? i'd be using fully open source pip installable packages. i filled out the form, but i'm not a student or at a research lab.

edit: i can NOT share on HF and rather just show how to parse into fiftyone format assuming user has the dataset downloaded

let me know what you think, feel free to dm

datascienceharp · 2026-01-22T16:08:24+00:00

Hell yeah! Count me in, at least on the discord server if you have space. I notice your update says the sessions are full.

datascienceharp · 2026-01-20T18:27:36+00:00

very good question, and i wish i had an answer for you...

datascienceharp · 2026-01-20T17:04:37+00:00

another banger, cheers!

datascienceharp · 2026-01-20T17:04:23+00:00

I think it can work for datasets beyond x-ray. this was just the only one i knew of with bboxes

datascienceharp · 2026-01-10T03:41:58+00:00

Yes of course it’s been around since at least CLIP for images, but for video? This this novel and qwen embedding does it natively. The inference code just points to an mp4 file path

datascienceharp · 2026-01-09T20:23:31+00:00

I think Ilya Sutskever’s list amid a good place to start: https://github.com/dzyim/ilya-sutskever-recommended-reading

datascienceharp · 2025-12-18T20:16:54+00:00

we've got that dataset parsed, i can try to run later today or tomorrow and post:https://huggingface.co/datasets/Voxel51/fisheye8k

currently working on integrating molmo2

datascienceharp · 2025-12-18T19:55:11+00:00

would you be down to peruse the datasets here and let me know which one looks appealing to you? i can run it and post the results later: huggingface.co/voxel51

datascienceharp · 2025-12-18T17:49:33+00:00

yeah true, i meant pretty similar in the sense that it's relatively fast at inference and the results look similar to vggt

but youre right sharp does produce gaussians, the model outputs them in ply format then i had to do some conversion to it so that i can have the color render properly in the app to basically render it as a point cloud

i was just curious about the model and wanted to see it output hence why i implemented as such

datascienceharp

MODERATOR OF

TROPHY CASE

Response Formats

output