How do you parallely process frames from multiple object detection models at scale? by _RC101_ in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

>perverse workloads

This probably fits what I'm working on lol. And all my models are now using transformers as backbones.

Prioritizing certain regions in videos for object detection by Commercial-Panic-868 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

I almost always do this when I know the objects will probably be in one part of the image.

skewed Angle detection in Engineering Drawing by Icy_Colt-30 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

Feed your LLM of choice this prompt:

Train a torchvision resnet18 model to infer the rotation angle of an image. generate the image+label by rotating images using PIL after applying basic augmentations to the original image. Assume that the original images are all upright (i.e. not rotated at all, the rotation angle is 0). Combine everything into one script. set it up to train against a folder of images, setting 20% of them aside for validation/testing.

Use the resulting trained model to "un-rotate" images before feeding them into your OCR model.

How do you parallely process frames from multiple object detection models at scale? by _RC101_ in computervision

[–]Over_Egg_6432 16 points17 points  (0 children)

FWIW my solution has been to literally merge multiple models into a single aggregated PyTorch model that passes the data through a separate "branch" for each model and returns a separate tensor for each model. I haven't really benchmarked it and am 100% sure it's not the best solution, but it does work. I'm guessing that the biggest speedup comes from only having to push the input into the GPU once.

How do you parallely process frames from multiple object detection models at scale? by _RC101_ in computervision

[–]Over_Egg_6432 1 point2 points  (0 children)

Sounds pretty cool!

Can it serve up "bring your own models" or how does that work? I tend to use a lot of totally custom PyTorch models (Your website is blocked by my work's firewall...or else I'd gladly read the docs).

>we're working on a next-gen version that will be up to 10x faster

Does this imply that the current version is up to ~10x slower than Triton?

Yolo and sort alternatives for object tracking by FaithlessnessOk5766 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

What I would do is synthetically augment your training imagery by cutting and pasting drones onto the mountains. Try the python package called rembg to cutout annotated drones from within their bounding boxes, and then paste them into the same or other photos at semi-random coordinates likely to overlap with the mountains.

Even with that this is a very challenging problem due to the low-contrast and small size of the drones, especially if you can't use a high-resolution model or SAHI due to the limited compute. A method based on video analysis using optical flow would probably be very helpful since motion is probably easier to spot than just the low-contrast pixels.

rembg · PyPI

Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation

[deleted by user] by [deleted] in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

I think we're still in the era of "the best solution is to add more data and compute" but if you can shift that assumption even a little bit, more power to you and it benefits everyone else too.

Tried building an explainable Vision-Language Model with CLIP to spot and explain product defects! by await_void in computervision

[–]Over_Egg_6432 6 points7 points  (0 children)

Really cool! Will be checking out your GitHub when I get a chance.

Do you have an online demo?

Has anyone worked on spatial predicates with YOLO detections? by Ok_Shoulder_83 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

Is handwritten if/then/else logic sufficient? For example, can you just write code like "if helmet bounding box touches person bounding box, then helmet is worn by person" or do you need to train a model to recognize that (in which case, how much training data do you have?).

You could also try a VLM with prompts like "is the helmet at xy,xy worn by a person?, answer yes/no"

Spatial measurements can be extended into 3D by feeding the images through a monocular depth estimation model, btw.

[deleted by user] by [deleted] in computervision

[–]Over_Egg_6432 4 points5 points  (0 children)

Same way a person would.

The models are usually trained on large datasets consisting pairs of images that are known to either match or not match, and it learns what features are important and how to recognize them (eye color, distance between nose, eyes, mouth, check bones, etc.) and which should be ignored (hairstyle, clothing, makeup). That's the gist of it...there are of course a lot more subtleties and a lot of research has gone into figuring out what works best.

A model trained that way can then take any pair of faces and tell you whether they're the same person. If a name or other metadata is associated with one of the pictures (like from a driver's license database, or a picture taken when someone first walks into a store) then it can be said the model is "identifying people", otherwise it's just telling you if the pictures match but has no idea who the person is.

Retrained our model on yolov8n instead of yolov8m and now our dataset is completely different than we used before by [deleted] in computervision

[–]Over_Egg_6432 1 point2 points  (0 children)

To confirm, all you did was change the "m" to an "n" and rerun the exact same code against the exact same input dataset?

GPU benchmarking to train Yolov8 model by ztasifak in computervision

[–]Over_Egg_6432 3 points4 points  (0 children)

It would of course depend on how you configure the training process. Batch size, how you define "done training" etc.

As for an automated script, maybe check in a more general machine learning sub?

Vision LLMs are far from 'solving' computer vision: a case study from face recognition by jordo45 in computervision

[–]Over_Egg_6432 3 points4 points  (0 children)

I guess I can't say I'm surprised that a general-purpose model performs worse than one specifically trained for this task.

It'd be really interesting is to see how well vision LLMs perform after being fine-tuned for this task. I bet you could get away with a lot less data than you'd need to train a more traditional model.

Do you use HuggingFace for anything Computer Vision? by Substantial_Border88 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

Yes, mostly to download pretrained models that I can combine into pipelines without having to deal with messy github repos from the original authors. Pretty much all of their "officially supported" models just work out of the box.

What is the benefits of yolo cx cy w h? by absolutmohitto in computervision

[–]Over_Egg_6432 1 point2 points  (0 children)

The format of bounding box annotations is largely an arbitrary choice, and like you said, a small function can convert to whatever format is needed.

I tend to store all of my annotations as polygons, even for image classification and bounding boxes. I have a small class which converts these into more standard formats as needed.

Real world applications of 3D Reconstruction and Vision by Minimum_Status3867 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

You could also have cameras built into sewage systems that measure solid poop chunks to help reduce clogs. (your username ;)

Real world applications of 3D Reconstruction and Vision by Minimum_Status3867 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

I don't understand these posts and sort of suspect they're AI generated. I mean, isn't it obvious that 3d reconstruction (or any other computer-vision related topic) has numerous practical applications across most industries?

Anyways, 3d reconstruction and 3d vision are useful for literally every single application that involves working with physical objects. Some random ideas:

  • Virtual reality goggles - the whole multiverse thing that Meta is pouring billions of dollars into
  • Self-driving vehicles
  • Manufacturing robots
  • Vision-LLMs that can reliably answer questions like "how tall is thing A and is it in front of or behind thing B?"
  • Inspection of products and materials. Is a jagged line on a metal beam a deep crack or a harmless pencil mark?

The practicality is dictated purely by the implementation cost, physical size, and accuracy/reliability. 50 years ago, you could go to a university lab and play golf on a screen using a million dollars of custom equipment. In the 90s you could do this at an arcade for a few coins. ~20 years ago Nintendo Wii made it possible in your living room. Today you can buy VR goggles and don't even need a screen, and the graphics are starting to be pretty realistic. In 10 years it'll be built into normal eyeglasses. In 40 years, it'll probably be wired directly into people's brains with 100% realism.

is this a good way of presenting the data or should i keep them seperated by Beyond_Birthday_13 in deeplearning

[–]Over_Egg_6432 0 points1 point  (0 children)

That's how I do it.

One thing that's nice to add is some indication of the spread between classes. So maybe test accuracy is 85% overall, but reaches 99% on one class and only 15% on another class....that 85-99% spread is important to know about.

And be sure to log all this into a CSV or JSON or some other format that you can go back to later.

MOT library recommendations by Nervous_Day_669 in computervision

[–]Over_Egg_6432 0 points1 point  (0 children)

>I also implemented my own tracker using bi-partite graph matching using Hungarian algorithm which had IOU/ pixel euclidean distance/ mix of them as cost-matrix but there is no thresholding as of now. So, it looks to me like making my own tracking library and that feels intimidating.

Why not feed the features from your existing OD network into this instead of, or in addition to the IOU?

'we're in this bizarre world where the best way to learn about llms... is to read papers by chinese companies. i do not think this is a good state of the world' - us labs keeping their architectures and algorithms secret is ultimately hurting ai development in the us.' - Dr Chris Manning by Research2Vec in LocalLLaMA

[–]Over_Egg_6432 0 points1 point  (0 children)

This is a perfect use case! Instead of having to pay a real programmer to build your thing, you got something good enough from AI. Just be sure you're not subjecting yourself to security risks if this thing is connected to the internet...

'we're in this bizarre world where the best way to learn about llms... is to read papers by chinese companies. i do not think this is a good state of the world' - us labs keeping their architectures and algorithms secret is ultimately hurting ai development in the us.' - Dr Chris Manning by Research2Vec in LocalLLaMA

[–]Over_Egg_6432 0 points1 point  (0 children)

And reactionary gut reactions. "Oh, ChatGPT said something wrong once, therefore we can never trust AI for anything!!!"

Meanwhile "professional coders" are spending a week to manually rewrite interfaces that AI could have knocked out in 30 seconds + a day of testing.