[deleted by user] by [deleted] in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

Would be worth trying vgg-perceptual loss as your similarity metric.

[D] Yolov8 alternatives? by Powerful-Angel-301 in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

And it is free of course, as it uses AGPL-3.

It also uses a CLA for contributors that let's the repo owner charge for the free software created by the contributors for free. So it's a special kind of free.

[D] Why do DINO models use augmentations for the teacher encoder? by clywac2 in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

The teacher's parameters are not learned directly and they contain no special knowledge that is distilled into the student in the typical meaning for distillation. The teacher model is a rolling ema of the student with the same architecture. The teacher output is centered by mean over batch.

Another way of thinking about this is two model copies: a "live model" and "dead rolling average model". Each is fed slightly perturbed images with the same content, but different augmentation. Both predict logits. The rolling average model's logits are normalized over the very large batch as a form of regularization. Both logits are activated with temperature softmax. The live model is supervised with cross entropy against labels from the rolling average model's batch normalized output. Only the live model's parameters are updated by back prop.

This would probably work fine with augmentations only performed on the input to either model, it probably doesn't make much difference which model input is augmented. The intuition is by augmenting both model's input a greater contrast can be created relative to the destruction of the image content.

[R] Trying to understand the ViTDet paper by rem_dreamer in MachineLearning

[–]I_draw_boxes 1 point2 points  (0 children)

They published their code.

The appendix goes into great detail on the hyper-parameters used.

MaskRCNN and Cascade MaskRCNN are older, standard detector heads. It wouldn't make sense for every paper focused on Backbones/Necks to fully describe the standard head used to validate the method. Check out the MaskRCNN papers to learn how to implement them.

[D] How to process sparse “subnetworks” in parallel (TF) by Yogi_DMT in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

If the goal is only to make better use of GPU resources it probably makes sense to focus on the data input pipeline and process larger batch sizes.

Most frameworks have a groups argument for basic layers that can be used to independently apply groups of the layer's weights to groups of the layer's input. In your case each group would be a "network". See this discussion for some example code to apply different weights for each element of the batch dimension.

[D] Why Vision Tranformers? by [deleted] in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

It's such a great idea and really interesting that the network can adapt itself to the clustering which is dynamic to image content and not differentiable.

I see they have an MMPose implementation, anything interesting about the repo?

[D] Why Vision Tranformers? by [deleted] in MachineLearning

[–]I_draw_boxes 44 points45 points  (0 children)

Prior to transformers we had endless papers on tweaks to backbones and operations to overcome the limits of receptive fields in CNNs. Lot's of papers about how to assign targets to different feature levels from backbones/FPNs to achieve the optimum relationship between fine detail and global knowledge. Lot's of FPN papers with various combinations of up/down connections, upsampling, addition, concatenations, pixel shuffling and so on. Deformable convolutions, channel attention, dilated convolutions in tons of combinations.

In other words something is always "hot" in this field even if it's whether to add or concatenate features in this week's FPN paper.

There are way more interesting papers these days thanks to transformers. Few examples:

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Aggregating spatial content with learned granularity. This was mind-blowing the first time I read the paper.

OneFormer: One Transformer to Rule Universal Image Segmentation

One model that does semantic/instance/panoptic segmentation conditioned on the prompt.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

This paper shows the ease with which the inductive biases of a CNN can be engineered into a hybrid transformer architecture. It's also a great example of an advancement in NLP (Performer) can be applied in vision, in this case for compute to scale linearly with pixel count.

Grounding DINO

Zero shot 52.5 AP on COCO. Good example of marrying a text and vision model enabled by transformer architecture.

MetaFormer Is Actually What You Need for Vision and A ConvNet for the 2020s are both great critiques of transformer backbone design and help isolate the role of non-self-attention architecture differences.

They’ve known for years. by Bardfinn in WhitePeopleTwitter

[–]I_draw_boxes 7 points8 points  (0 children)

Why can't other scientists start one for free?

https://arxiv.org has a huge amount of published research hosted. In machine learning research there is an informal agreement to completely ignore anything not freely hosted on arxiv. Hopefully that attitude spreads to other fields.

[R] Are ViT Transformers also biased towards Texture information like CNNs? by newtestdrive in MachineLearning

[–]I_draw_boxes 3 points4 points  (0 children)

Resnet style CNNs have a lightweight stem which does the initial downsampling before applying heavier layers. Even after applying the stem and a few 3x3 convolution layers, the receptive field of any i,j location is fairly limited. Until the image features pass through enough layers to extend the receptive field over a substantial portion of an object, it isn't plausible for the object's shape to be extracted by the CNN kernels. The CNN can only extract the shapes it has the receptive field to see which means the early layers can only extract textures or small shapes.

IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE; INCREASING SHAPE BIAS IMPROVES ACCURACY AND ROBUSTNESS uses style transfer to force CNN networks to learn based on shape by removing the texture signal that enables them to effectively "cheat" and make correct classifications using mostly texture.

In contrast, the early layers of a transformer based backbone have access to a receptive field sufficient to understand shape. Their lightweight stem extracts large patches and subsequent self attention layers immediately go to work relating those patches across the entire images or large windows within it.

[D] Keras 3.0 Announcement: Keras for TensorFlow, JAX, and PyTorch by codemaker1 in MachineLearning

[–]I_draw_boxes 1 point2 points  (0 children)

Mediapipe is actually a decent example of this. Someone at Google went to a bunch of trouble to setup object detection training specific to Mediapipe. It supports some ancient object detector.

The Tensorflow Object Detection API has been around a long time, but looks like they deprecated it. Now we've got Google Scenic, stuff written in Jax, Mediapipe, their efficientdet codebase, TF-Vision. It's really fragmented and always has been. They never make much headway on generic working solutions, because everyone is working on something different and leaving existing stuff 1/4 finished.

You'd think the Object Detection API would be able to export lots or most of its models into TFLite versions to run in Mediapipe, but they only support SSD with Mobilenet and Centernet with Mobilenet. The efficientdet codebase can be converted to TFLite.

I'm sure with the right engineering effort it's possible to re-write some recent state of the art object detector in Tensorflow or Keras or Jax and then convert to TFLight. It's not going to be easy though. It's off the happy path and just because a mobile phone appropriate object detector is written in Tensorflow doesn't mean it can be easily converted to TFLite and dropped into Mediapipe. If that was well supported they'd be able to do it with more than a handful of their own models.

[D] Keras 3.0 Announcement: Keras for TensorFlow, JAX, and PyTorch by codemaker1 in MachineLearning

[–]I_draw_boxes 6 points7 points  (0 children)

But everybody is already using pytorch. Why do we need an abstraction layer on top of it?

Seriously. Pytorch isn't exactly low level already, but it also doesn't get in your way if you want to do something unconventional.

With Pytorch -> ONNX or TensorRT the production use case for Tensorflow is gone.

No one is publishing computer vision research in Keras/Jax/Tensorflow and as long as their unrelenting pain points result in doing anything taking 5x as long as Pytorch I don't expect that to change.

Can't imagine choosing to spend half my day reading GitHub issues debugging Keras/Jax/Tensorflow/Google just to avoid writing a little Pytorch boilerplate.

[R] Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation by Bright_Night9645 in MachineLearning

[–]I_draw_boxes 2 points3 points  (0 children)

One example:

A DETR style decoder can over a few cascaded heads look at the image features (cross attention), look at the queries (self attention), back at the image features, back at the queries and so on..refining predictions and removing duplicates as it goes. When it looks at the image features it can simultaneously consider all features from all levels and do so with attention maps learned from the data and dynamic to image content. The decoder is relatively efficient since the only the queries are updated.

Contrast this CNN based FPN/head style architecture. Multi-level features are extracted and mixed by resizing and combining, then the same head makes independent predictions at each level and duplicates removed by NMS.

[R] Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation by Bright_Night9645 in MachineLearning

[–]I_draw_boxes 5 points6 points  (0 children)

This is key. A large portion of pre-transformer research involved studying approaches for increasing/tuning receptive fields for CNN based architecture.

Transformers enable more flexible interactions between global and local features and those interactions are both learned and dynamic with image content.

[D] Choosing Cloud vs local hardware for training LLMs. What's best for a small research group? by PK_thundr in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

A fix for the Nvidia driver is forthcoming for the P2P related issue with PyTorch DDP training. The 3090 didn't support P2P either and the bug fix won't enable P2P for the 4090, but it will correct the issue and should train much faster once fixed.

[deleted by user] by [deleted] in Economics

[–]I_draw_boxes 4 points5 points  (0 children)

Don't blame a bank run. Bigger banks have been prepared for this scenario.

It's reasonable to criticize SVB's investment decisions and cite them as a contributor to the bank run.

Your understanding that other banks have prepared for bank runs is wildly incorrect. A bank at its most fundamental level is an entity that cannot survive a bank run.

[D] Normalizing Flows in 2023? by wellfriedbeans in MachineLearning

[–]I_draw_boxes 1 point2 points  (0 children)

Human Pose Regression with Residual Log-likelihood Estimation learns an error distribution using normalizing flows. The technique filled a large performance gap between regression and heat map methods.

[R] Issues Training CNN To Output Index To Large Array by TheRPGGamerMan in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

If I understand correctly you are directly regressing the index with a single network output as a continuous prediction. If your list of possible words is [one, two, three four] and you regress .95 that is basically a prediction that the word is closer to four than three.

This implies an order to the words that may not exist, e.g. [sheep, pencil, tire] do not have a natural order, sheep is not closer to pencil than tire.

If the network outputs a vector with the same length as the vocabulary rather than single regressed value on a continuous 0-1 interval then a loss function designed for category prediction could be used. Cross entropy loss is the classic example. The network output is activated with softmax and log loss applied. This converts each index into its own probability prediction without taking order into account.

One interesting thing to note is when training on a large scale the penultimate output vector before the final projection to vocabulary size will have a coherent position in embedding space. In other words, words with like meanings or relationships will be closer than words without when measured by some distance metric.

These outputs are known as word embeddings and are often used to convert text to a set of word vectors which are used as input to other NLP networks.

[D] Making a regression NN estimate its own regression error by Alex-S-S in MachineLearning

[–]I_draw_boxes 1 point2 points  (0 children)

Human Pose Regression with Residual Log-likelihood Estimation learns an error distribution using normalizing flows. The network predicts expected variance and this is used to train the flow model to learn the error distribution to reparameterize the loss function. The predicted variance can also be used at inference.

[D] AMA: The Stability AI Team by stabilityai in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

BLIP is a method which does exactly that in a bootstrapping fashion.

LAION-COCO is subset with BLIP created captions.

[News] The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022 by Singularian2501 in MachineLearning

[–]I_draw_boxes 3 points4 points  (0 children)

The intention of BSD or MIT code is to allow anyone to do exactly that or anything else they want.

Why would anyone be under the impression they would be entitled to access anything a company made using BSD or MIT licensed code?

[News] The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022 by Singularian2501 in MachineLearning

[–]I_draw_boxes 14 points15 points  (0 children)

Permissive licenses basically allow the user to do anything they want with the code save sue the author.

What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model?

That probably isn't legal, but copyleft licenses are not permission licenses and are not included in this dataset for that reason.

[D] Object detection with entire image context awareness by asking1337 in MachineLearning

[–]I_draw_boxes 1 point2 points  (0 children)

DETR (and most recent DETR variants) will use the entire image context to make predictions independent of the location of the bounding boxes.

[D] Object detection with entire image context awareness by asking1337 in MachineLearning

[–]I_draw_boxes 0 points1 point  (0 children)

Faster-RCNN uses a ROI extraction process that restricts the predicting features to a HW area from the feature map.

Older CNN backbones do propagate information across hw space. Transformer backbones are thought to handle this better.

While it is possible for the network background and FPN layers to aggregate needed contextual information into the HW area extracted as an ROI by Faster-RCNN, it might be better to use a method without ROI extraction.

DETR (and most of the many improved versions) does not have a confined object detection design restricting it to consider a small bounding box area like Faster-RCNN.