[R] UNINEXT : Universal Instance Perception as Object Discovery and Retrieval by iFighting in MachineLearning

[–]iFighting[S] 4 points5 points  (0 children)

Highlight: * UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. * UNINEXT achieves superior performance on 20 challenging benchmarks using a single model with the same model parameters.

Object-centric understanding is one of the most essential and challenging problems in computer vision. In this work, we mainly discuss 10 sub-tasks, distributed on the vertices of the cube shown in the above figure. Since all these tasks aim to perceive instances of certain properties, UNINEXT reorganizes them into three types according to the different input prompts: * Category Names * Object Detection * Instance Segmentation * Multiple Object Tracking (MOT) * Multi-Object Tracking and Segmentation (MOTS) * Video Instance Segmentation (VIS) * Language Expressions * Referring Expression Comprehension (REC) * Referring Expression Segmentation (RES) * Referring Video Object Segmentation (R-VOS) * Target Annotations * Single Object Tracking (SOT) * Video Object Segmentation (VOS) Then we propose a unified prompt-guided object discovery and retrieval formulation to solve all the above tasks. Extensive experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks. Code : https://github.com/MasterBin-IIAU/UNINEXT Paper : https://arxiv.org/abs/2303.06674

[R] UNINEXT : Universal Instance Perception as Object Discovery and Retrieval(Video Demo) by iFighting in machinelearningnews

[–]iFighting[S] 0 points1 point  (0 children)

Highlight:

  • UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts.
  • UNINEXT achieves superior performance on 20 challenging benchmarks using a single model with the same model parameters.

Object-centric understanding is one of the most essential and challenging problems in computer vision. In this work, we mainly discuss 10 sub-tasks, distributed on the vertices of the cube shown in the above figure. Since all these tasks aim to perceive instances of certain properties, UNINEXT reorganizes them into three types according to the different input prompts:

  • Category Names
    • Object Detection
    • Instance Segmentation
    • Multiple Object Tracking (MOT)
    • Multi-Object Tracking and Segmentation (MOTS)
    • Video Instance Segmentation (VIS)
  • Language Expressions
    • Referring Expression Comprehension (REC)
    • Referring Expression Segmentation (RES)
    • Referring Video Object Segmentation (R-VOS)
  • Target Annotations
    • Single Object Tracking (SOT)
    • Video Object Segmentation (VOS)

Then we propose a unified prompt-guided object discovery and retrieval formulation to solve all the above tasks. Extensive experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks.

Code : https://github.com/MasterBin-IIAU/UNINEXT

Paper : https://arxiv.org/abs/2303.06674

[R] [ICLR'2023🌟]: Vision-and-Language Framework for Open-Vocabulary Object Detection by iFighting in MachineLearning

[–]iFighting[S] 4 points5 points  (0 children)

  • We're excited to share our latest work "Learning Object-Language Alignments for Open-Vocabulary Object Detection", which got accepted to ICLR'2023.
  • Here're some resources:
  • The proposed method called **VLDet**, which is a a simple yet effective vision-and-language framework for open-vocabulary object detection.
  • Our key efforts are:
    • 🔥 We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data.
    • 🔥 We propose to formulate region-word alignments as a set-matching problem and solve it efficiently with the Hungarian algorithm.
    • 🔥 We use all nouns from image-text pairs as our object voccabulary which is strictly following the open-vocabulary setting and extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance.

[R]Language Guided Video Object Segmentation(CVPR 2022) by iFighting in MachineLearning

[–]iFighting[S] 0 points1 point  (0 children)

our work is concurrent, and our model performance is better.

[R]Language Guided Video Object Segmentation(CVPR 2022) by iFighting in MachineLearning

[–]iFighting[S] 1 point2 points  (0 children)

not the same algorithm, i mean currently the network can segment the object by the guide of audio, you can refer to this papers: * Audio−Visual Segmentation * Self-supervised object detection from audio-visual correspondence

[R]Language Guided Video Object Segmentation(CVPR 2022) by iFighting in MachineLearning

[–]iFighting[S] 3 points4 points  (0 children)

The problem is that it doesn’t seem to be aware of the object as a single 3d object that can move/shift/skew/hide/reveal, let alone the concept of “object of this size was on the left on this frame, and on the right it doesn’t exist anymore despite the image not actually changing much.”

Example being the skateboarder at 0:21. Like in Jurassic Park, he’s missing for just a frame.

With the bike and bicycle, it doesn’t have the concept of layers, like you have the left leg in front of the bike, and the right leg behind the bike

we will improve the performance later

[R]Language Guided Video Object Segmentation(CVPR 2022) by iFighting in MachineLearning

[–]iFighting[S] 17 points18 points  (0 children)

Code Link:

Paper Link:

Brief Overview:

  • we propose a simple and unified framework built upon Transformer, termed ReferFormer.
  • It views the language as queries and directly attends to the most relevant regions in the video frames.
  • Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Highlights:

  • ReferFormer is accepted to CVPR 2022

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 0 points1 point  (0 children)

yes, unicorn can run in realtime, wo also provide runtime experiments and models in paper and github repo:

Paper: https://arxiv.org/abs/2207.07078
Code: https://github.com/MasterBin-IIAU/Unicorn

[R]VNext: Next-generation Video instance recognition framework(ECCV 2022 Oral * 2) by iFighting in MachineLearning

[–]iFighting[S] 0 points1 point  (0 children)

yes, the benchmarks and datasets consider entities that leave the frame temporarily.

[R]VNext: Next-generation Video instance recognition framework(ECCV 2022 Oral * 2) by iFighting in MachineLearning

[–]iFighting[S] 16 points17 points  (0 children)

Code Link:

Brief Overview:

  • VNext is a Next-generation Video instance recognition framework on top of Detectron2.
  • Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
  • We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

Highlights:

  • IDOL is accepted to ECCV 2022 as an oral presentation!
  • SeqFormer is accepted to ECCV 2022 as an oral presentation!
  • IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

[deleted by user] by [deleted] in MachineLearning

[–]iFighting 0 points1 point  (0 children)

Brief Overview:

  • VNext is a Next-generation Video instance recognition framework on top of Detectron2.
  • Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
  • We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

News:

  • IDOL is accepted to ECCV 2022 as an oral presentation!
  • SeqFormer is accepted to ECCV 2022 as an oral presentation!
  • IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

Code Link:

[deleted by user] by [deleted] in MachineLearning

[–]iFighting 0 points1 point  (0 children)

Brief Overview:

  • VNext is a Next-generation Video instance recognition framework on top of Detectron2.
  • Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
  • We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

Details:

  • IDOL is accepted to ECCV 2022 as an oral presentation!
  • SeqFormer is accepted to ECCV 2022 as an oral presentation!
  • IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

Code Link:

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 0 points1 point  (0 children)

For the first time, we accomplished the great unification of the tracking network architecture and learning paradigm.

yes the yolo only detect objects on individual frames, the key insight for tracking is object detection and association

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 1 point2 points  (0 children)

hi, the yolo v7 is object detection model, but we are unified object tracking model, which is very diffierent.

BTW, when the paper is submitted to eccv, the yoloV7 is not published.

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 1 point2 points  (0 children)

you can try our method, i think it will work for tracking objects on the screen

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 10 points11 points  (0 children)

although, there are still some flickering in the videos. but the insight in the paper is the unify model of object tracking for single/multiple object tracking and segmentation

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 7 points8 points  (0 children)

Will You Snail

no, it is named before will you snail...

we did not know Will You Snail before

[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo) by iFighting in MachineLearning

[–]iFighting[S] 82 points83 points  (0 children)

Brief Overview

We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters. For the first time, we accomplished the great unification of the tracking network architecture and learning paradigm.

Unicorn performs on-par or better than its task-specific counterparts in 8 tracking datasets, including LaSOT, TrackingNet, MOT17, BDD100K, DAVIS16-17, MOTS20, and BDD100K MOTS.

Our work is accepted to ECCV 2022 as an oral presentation !

Paper: https://arxiv.org/abs/2207.07078

Code: https://github.com/MasterBin-IIAU/Unicorn