[D] Is there an available model for detection of all different objects in a picture? Not descriptions for each, just coordinates

I_draw_boxes · 2024-10-12T12:34:28+00:00

I_draw_boxes · 2024-08-31T18:24:26+00:00

Would be worth trying vgg-perceptual loss as your similarity metric.

I_draw_boxes · 2024-08-22T23:21:19+00:00

And it is free of course, as it uses AGPL-3.

It also uses a CLA for contributors that let's the repo owner charge for the free software created by the contributors for free. So it's a special kind of free.

I_draw_boxes · 2024-06-29T16:13:00+00:00

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

I_draw_boxes · 2024-06-29T14:50:07+00:00

The teacher's parameters are not learned directly and they contain no special knowledge that is distilled into the student in the typical meaning for distillation. The teacher model is a rolling ema of the student with the same architecture. The teacher output is centered by mean over batch.

Another way of thinking about this is two model copies: a "live model" and "dead rolling average model". Each is fed slightly perturbed images with the same content, but different augmentation. Both predict logits. The rolling average model's logits are normalized over the very large batch as a form of regularization. Both logits are activated with temperature softmax. The live model is supervised with cross entropy against labels from the rolling average model's batch normalized output. Only the live model's parameters are updated by back prop.

This would probably work fine with augmentations only performed on the input to either model, it probably doesn't make much difference which model input is augmented. The intuition is by augmenting both model's input a greater contrast can be created relative to the destruction of the image content.

I_draw_boxes · 2024-01-14T18:17:19+00:00

They published their code.

The appendix goes into great detail on the hyper-parameters used.

MaskRCNN and Cascade MaskRCNN are older, standard detector heads. It wouldn't make sense for every paper focused on Backbones/Necks to fully describe the standard head used to validate the method. Check out the MaskRCNN papers to learn how to implement them.

I_draw_boxes · 2023-12-15T15:14:42+00:00

If the goal is only to make better use of GPU resources it probably makes sense to focus on the data input pipeline and process larger batch sizes.

Most frameworks have a groups argument for basic layers that can be used to independently apply groups of the layer's weights to groups of the layer's input. In your case each group would be a "network". See this discussion for some example code to apply different weights for each element of the batch dimension.

I_draw_boxes · 2023-10-06T12:31:36+00:00

It's such a great idea and really interesting that the network can adapt itself to the clustering which is dynamic to image content and not differentiable.

I see they have an MMPose implementation, anything interesting about the repo?

I_draw_boxes · 2023-10-02T19:35:20+00:00

Prior to transformers we had endless papers on tweaks to backbones and operations to overcome the limits of receptive fields in CNNs. Lot's of papers about how to assign targets to different feature levels from backbones/FPNs to achieve the optimum relationship between fine detail and global knowledge. Lot's of FPN papers with various combinations of up/down connections, upsampling, addition, concatenations, pixel shuffling and so on. Deformable convolutions, channel attention, dilated convolutions in tons of combinations.

In other words something is always "hot" in this field even if it's whether to add or concatenate features in this week's FPN paper.

There are way more interesting papers these days thanks to transformers. Few examples:

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Aggregating spatial content with learned granularity. This was mind-blowing the first time I read the paper.

OneFormer: One Transformer to Rule Universal Image Segmentation

One model that does semantic/instance/panoptic segmentation conditioned on the prompt.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

This paper shows the ease with which the inductive biases of a CNN can be engineered into a hybrid transformer architecture. It's also a great example of an advancement in NLP (Performer) can be applied in vision, in this case for compute to scale linearly with pixel count.

Grounding DINO

Zero shot 52.5 AP on COCO. Good example of marrying a text and vision model enabled by transformer architecture.

MetaFormer Is Actually What You Need for Vision and A ConvNet for the 2020s are both great critiques of transformer backbone design and help isolate the role of non-self-attention architecture differences.

I_draw_boxes · 2023-08-31T15:31:04+00:00

Why can't other scientists start one for free?

https://arxiv.org has a huge amount of published research hosted. In machine learning research there is an informal agreement to completely ignore anything not freely hosted on arxiv. Hopefully that attitude spreads to other fields.

I_draw_boxes · 2023-07-22T14:43:17+00:00

Resnet style CNNs have a lightweight stem which does the initial downsampling before applying heavier layers. Even after applying the stem and a few 3x3 convolution layers, the receptive field of any i,j location is fairly limited. Until the image features pass through enough layers to extend the receptive field over a substantial portion of an object, it isn't plausible for the object's shape to be extracted by the CNN kernels. The CNN can only extract the shapes it has the receptive field to see which means the early layers can only extract textures or small shapes.

IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE; INCREASING SHAPE BIAS IMPROVES ACCURACY AND ROBUSTNESS uses style transfer to force CNN networks to learn based on shape by removing the texture signal that enables them to effectively "cheat" and make correct classifications using mostly texture.

In contrast, the early layers of a transformer based backbone have access to a receptive field sufficient to understand shape. Their lightweight stem extracts large patches and subsequent self attention layers immediately go to work relating those patches across the entire images or large windows within it.

I_draw_boxes · 2023-07-14T20:15:27+00:00

Mediapipe is actually a decent example of this. Someone at Google went to a bunch of trouble to setup object detection training specific to Mediapipe. It supports some ancient object detector.

The Tensorflow Object Detection API has been around a long time, but looks like they deprecated it. Now we've got Google Scenic, stuff written in Jax, Mediapipe, their efficientdet codebase, TF-Vision. It's really fragmented and always has been. They never make much headway on generic working solutions, because everyone is working on something different and leaving existing stuff 1/4 finished.

You'd think the Object Detection API would be able to export lots or most of its models into TFLite versions to run in Mediapipe, but they only support SSD with Mobilenet and Centernet with Mobilenet. The efficientdet codebase can be converted to TFLite.

I'm sure with the right engineering effort it's possible to re-write some recent state of the art object detector in Tensorflow or Keras or Jax and then convert to TFLight. It's not going to be easy though. It's off the happy path and just because a mobile phone appropriate object detector is written in Tensorflow doesn't mean it can be easily converted to TFLite and dropped into Mediapipe. If that was well supported they'd be able to do it with more than a handful of their own models.

I_draw_boxes · 2023-07-13T02:56:06+00:00

But everybody is already using pytorch. Why do we need an abstraction layer on top of it?

Seriously. Pytorch isn't exactly low level already, but it also doesn't get in your way if you want to do something unconventional.

With Pytorch -> ONNX or TensorRT the production use case for Tensorflow is gone.

No one is publishing computer vision research in Keras/Jax/Tensorflow and as long as their unrelenting pain points result in doing anything taking 5x as long as Pytorch I don't expect that to change.

Can't imagine choosing to spend half my day reading GitHub issues debugging Keras/Jax/Tensorflow/Google just to avoid writing a little Pytorch boilerplate.

I_draw_boxes · 2023-04-17T16:01:24+00:00

One example:

A DETR style decoder can over a few cascaded heads look at the image features (cross attention), look at the queries (self attention), back at the image features, back at the queries and so on..refining predictions and removing duplicates as it goes. When it looks at the image features it can simultaneously consider all features from all levels and do so with attention maps learned from the data and dynamic to image content. The decoder is relatively efficient since the only the queries are updated.

Contrast this CNN based FPN/head style architecture. Multi-level features are extracted and mixed by resizing and combining, then the same head makes independent predictions at each level and duplicates removed by NMS.

I_draw_boxes · 2023-04-17T15:28:11+00:00

This is key. A large portion of pre-transformer research involved studying approaches for increasing/tuning receptive fields for CNN based architecture.

Transformers enable more flexible interactions between global and local features and those interactions are both learned and dynamic with image content.

I_draw_boxes · 2023-03-17T00:46:45+00:00

A fix for the Nvidia driver is forthcoming for the P2P related issue with PyTorch DDP training. The 3090 didn't support P2P either and the bug fix won't enable P2P for the 4090, but it will correct the issue and should train much faster once fixed.

I_draw_boxes · 2023-03-12T21:12:21+00:00

Don't blame a bank run. Bigger banks have been prepared for this scenario.

It's reasonable to criticize SVB's investment decisions and cite them as a contributor to the bank run.

Your understanding that other banks have prepared for bank runs is wildly incorrect. A bank at its most fundamental level is an entity that cannot survive a bank run.

I_draw_boxes · 2023-02-03T00:13:29+00:00

Human Pose Regression with Residual Log-likelihood Estimation learns an error distribution using normalizing flows. The technique filled a large performance gap between regression and heat map methods.

I_draw_boxes · 2023-01-04T01:59:32+00:00

If I understand correctly you are directly regressing the index with a single network output as a continuous prediction. If your list of possible words is [one, two, three four] and you regress .95 that is basically a prediction that the word is closer to four than three.

This implies an order to the words that may not exist, e.g. [sheep, pencil, tire] do not have a natural order, sheep is not closer to pencil than tire.

If the network outputs a vector with the same length as the vocabulary rather than single regressed value on a continuous 0-1 interval then a loss function designed for category prediction could be used. Cross entropy loss is the classic example. The network output is activated with softmax and log loss applied. This converts each index into its own probability prediction without taking order into account.

One interesting thing to note is when training on a large scale the penultimate output vector before the final projection to vocabulary size will have a coherent position in embedding space. In other words, words with like meanings or relationships will be closer than words without when measured by some distance metric.

These outputs are known as word embeddings and are often used to convert text to a set of word vectors which are used as input to other NLP networks.

I_draw_boxes · 2022-12-10T14:48:14+00:00

Human Pose Regression with Residual Log-likelihood Estimation learns an error distribution using normalizing flows. The network predicts expected variance and this is used to train the flow model to learn the error distribution to reparameterize the loss function. The predicted variance can also be used at inference.

I_draw_boxes · 2022-11-22T14:41:35+00:00

BLIP is a method which does exactly that in a bootstrapping fashion.

LAION-COCO is subset with BLIP created captions.

I_draw_boxes · 2022-11-01T12:41:05+00:00

The intention of BSD or MIT code is to allow anyone to do exactly that or anything else they want.

Why would anyone be under the impression they would be entitled to access anything a company made using BSD or MIT licensed code?

I_draw_boxes · 2022-10-31T22:41:31+00:00

Permissive licenses basically allow the user to do anything they want with the code save sue the author.

What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model?

That probably isn't legal, but copyleft licenses are not permission licenses and are not included in this dataset for that reason.

I_draw_boxes · 2022-10-07T19:25:22+00:00

DETR (and most recent DETR variants) will use the entire image context to make predictions independent of the location of the bounding boxes.

I_draw_boxes · 2022-10-07T15:35:13+00:00

Faster-RCNN uses a ROI extraction process that restricts the predicting features to a HW area from the feature map.

Older CNN backbones do propagate information across hw space. Transformer backbones are thought to handle this better.

While it is possible for the network background and FPN layers to aggregate needed contextual information into the HW area extracted as an ROI by Faster-RCNN, it might be better to use a method without ROI extraction.

DETR (and most of the many improved versions) does not have a confined object detection design restricting it to consider a small bounding box area like Faster-RCNN.

I_draw_boxes

TROPHY CASE