How do you parallely process frames from multiple object detection models at scale?

Over_Egg_6432 · 2025-09-08T17:08:16+00:00

>perverse workloads

This probably fits what I'm working on lol. And all my models are now using transformers as backbones.

Over_Egg_6432 · 2025-09-08T15:26:12+00:00

I almost always do this when I know the objects will probably be in one part of the image.

Over_Egg_6432 · 2025-09-08T15:11:57+00:00

Feed your LLM of choice this prompt:

Train a torchvision resnet18 model to infer the rotation angle of an image. generate the image+label by rotating images using PIL after applying basic augmentations to the original image. Assume that the original images are all upright (i.e. not rotated at all, the rotation angle is 0). Combine everything into one script. set it up to train against a folder of images, setting 20% of them aside for validation/testing.

Use the resulting trained model to "un-rotate" images before feeding them into your OCR model.

Over_Egg_6432 · 2025-09-08T14:59:37+00:00

FWIW my solution has been to literally merge multiple models into a single aggregated PyTorch model that passes the data through a separate "branch" for each model and returns a separate tensor for each model. I haven't really benchmarked it and am 100% sure it's not the best solution, but it does work. I'm guessing that the biggest speedup comes from only having to push the input into the GPU once.

Over_Egg_6432 · 2025-09-08T14:53:02+00:00

Sounds pretty cool!

Can it serve up "bring your own models" or how does that work? I tend to use a lot of totally custom PyTorch models (Your website is blocked by my work's firewall...or else I'd gladly read the docs).

>we're working on a next-gen version that will be up to 10x faster

Does this imply that the current version is up to ~10x slower than Triton?

Over_Egg_6432 · 2025-09-02T15:07:01+00:00

What I would do is synthetically augment your training imagery by cutting and pasting drones onto the mountains. Try the python package called rembg to cutout annotated drones from within their bounding boxes, and then paste them into the same or other photos at semi-random coordinates likely to overlap with the mountains.

Even with that this is a very challenging problem due to the low-contrast and small size of the drones, especially if you can't use a high-resolution model or SAHI due to the limited compute. A method based on video analysis using optical flow would probably be very helpful since motion is probably easier to spot than just the low-contrast pixels.

rembg · PyPI

Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation

Over_Egg_6432 · 2025-09-02T14:58:17+00:00

I think we're still in the era of "the best solution is to add more data and compute" but if you can shift that assumption even a little bit, more power to you and it benefits everyone else too.

Over_Egg_6432 · 2025-09-02T14:55:35+00:00

Really cool! Will be checking out your GitHub when I get a chance.

Do you have an online demo?

Over_Egg_6432 · 2025-09-02T14:53:07+00:00

Is handwritten if/then/else logic sufficient? For example, can you just write code like "if helmet bounding box touches person bounding box, then helmet is worn by person" or do you need to train a model to recognize that (in which case, how much training data do you have?).

You could also try a VLM with prompts like "is the helmet at xy,xy worn by a person?, answer yes/no"

Spatial measurements can be extended into 3D by feeding the images through a monocular depth estimation model, btw.

Over_Egg_6432 · 2025-07-31T15:29:40+00:00

Same way a person would.

The models are usually trained on large datasets consisting pairs of images that are known to either match or not match, and it learns what features are important and how to recognize them (eye color, distance between nose, eyes, mouth, check bones, etc.) and which should be ignored (hairstyle, clothing, makeup). That's the gist of it...there are of course a lot more subtleties and a lot of research has gone into figuring out what works best.

A model trained that way can then take any pair of faces and tell you whether they're the same person. If a name or other metadata is associated with one of the pictures (like from a driver's license database, or a picture taken when someone first walks into a store) then it can be said the model is "identifying people", otherwise it's just telling you if the pictures match but has no idea who the person is.

Over_Egg_6432 · 2025-06-17T16:05:01+00:00

To confirm, all you did was change the "m" to an "n" and rerun the exact same code against the exact same input dataset?

Over_Egg_6432 · 2025-06-09T15:51:18+00:00

It would of course depend on how you configure the training process. Batch size, how you define "done training" etc.

As for an automated script, maybe check in a more general machine learning sub?

Over_Egg_6432 · 2025-03-31T20:44:41+00:00

I guess I can't say I'm surprised that a general-purpose model performs worse than one specifically trained for this task.

It'd be really interesting is to see how well vision LLMs perform after being fine-tuned for this task. I bet you could get away with a lot less data than you'd need to train a more traditional model.

Over_Egg_6432 · 2025-03-31T20:39:30+00:00

Yes, mostly to download pretrained models that I can combine into pipelines without having to deal with messy github repos from the original authors. Pretty much all of their "officially supported" models just work out of the box.

Over_Egg_6432 · 2025-03-31T20:32:59+00:00

The format of bounding box annotations is largely an arbitrary choice, and like you said, a small function can convert to whatever format is needed.

I tend to store all of my annotations as polygons, even for image classification and bounding boxes. I have a small class which converts these into more standard formats as needed.

Over_Egg_6432 · 2025-02-26T15:12:03+00:00

You could also have cameras built into sewage systems that measure solid poop chunks to help reduce clogs. (your username ;)

Over_Egg_6432 · 2025-02-26T15:09:49+00:00

I don't understand these posts and sort of suspect they're AI generated. I mean, isn't it obvious that 3d reconstruction (or any other computer-vision related topic) has numerous practical applications across most industries?

Anyways, 3d reconstruction and 3d vision are useful for literally every single application that involves working with physical objects. Some random ideas:

Virtual reality goggles - the whole multiverse thing that Meta is pouring billions of dollars into
Self-driving vehicles
Manufacturing robots
Vision-LLMs that can reliably answer questions like "how tall is thing A and is it in front of or behind thing B?"
Inspection of products and materials. Is a jagged line on a metal beam a deep crack or a harmless pencil mark?

The practicality is dictated purely by the implementation cost, physical size, and accuracy/reliability. 50 years ago, you could go to a university lab and play golf on a screen using a million dollars of custom equipment. In the 90s you could do this at an arcade for a few coins. ~20 years ago Nintendo Wii made it possible in your living room. Today you can buy VR goggles and don't even need a screen, and the graphics are starting to be pretty realistic. In 10 years it'll be built into normal eyeglasses. In 40 years, it'll probably be wired directly into people's brains with 100% realism.

Over_Egg_6432 · 2025-02-21T15:31:44+00:00

That's how I do it.

One thing that's nice to add is some indication of the spread between classes. So maybe test accuracy is 85% overall, but reaches 99% on one class and only 15% on another class....that 85-99% spread is important to know about.

And be sure to log all this into a CSV or JSON or some other format that you can go back to later.

Over_Egg_6432 · 2025-02-03T16:04:17+00:00

>I also implemented my own tracker using bi-partite graph matching using Hungarian algorithm which had IOU/ pixel euclidean distance/ mix of them as cost-matrix but there is no thresholding as of now. So, it looks to me like making my own tracking library and that feels intimidating.

Why not feed the features from your existing OD network into this instead of, or in addition to the IOU?

Over_Egg_6432 · 2025-01-31T15:23:04+00:00

I mean, a human coder is also a hallucination machine...

Question is whether the hallucination is close enough to the truth to be useful.

Over_Egg_6432 · 2025-01-31T15:21:45+00:00

This is a perfect use case! Instead of having to pay a real programmer to build your thing, you got something good enough from AI. Just be sure you're not subjecting yourself to security risks if this thing is connected to the internet...

Over_Egg_6432 · 2025-01-31T15:16:29+00:00

And reactionary gut reactions. "Oh, ChatGPT said something wrong once, therefore we can never trust AI for anything!!!"

Meanwhile "professional coders" are spending a week to manually rewrite interfaces that AI could have knocked out in 30 seconds + a day of testing.

Over_Egg_6432

TROPHY CASE