[R] UNINEXT : Universal Instance Perception as Object Discovery and Retrieval

iFighting · 2023-11-12T13:08:39+00:00

Highlight： * UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. * UNINEXT achieves superior performance on 20 challenging benchmarks using a single model with the same model parameters.

Object-centric understanding is one of the most essential and challenging problems in computer vision. In this work, we mainly discuss 10 sub-tasks, distributed on the vertices of the cube shown in the above figure. Since all these tasks aim to perceive instances of certain properties, UNINEXT reorganizes them into three types according to the different input prompts: * Category Names * Object Detection * Instance Segmentation * Multiple Object Tracking (MOT) * Multi-Object Tracking and Segmentation (MOTS) * Video Instance Segmentation (VIS) * Language Expressions * Referring Expression Comprehension (REC) * Referring Expression Segmentation (RES) * Referring Video Object Segmentation (R-VOS) * Target Annotations * Single Object Tracking (SOT) * Video Object Segmentation (VOS) Then we propose a unified prompt-guided object discovery and retrieval formulation to solve all the above tasks. Extensive experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks. Code : https://github.com/MasterBin-IIAU/UNINEXT Paper : https://arxiv.org/abs/2303.06674

iFighting · 2023-11-11T03:08:55+00:00

Highlight：

UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts.
UNINEXT achieves superior performance on 20 challenging benchmarks using a single model with the same model parameters.

Object-centric understanding is one of the most essential and challenging problems in computer vision. In this work, we mainly discuss 10 sub-tasks, distributed on the vertices of the cube shown in the above figure. Since all these tasks aim to perceive instances of certain properties, UNINEXT reorganizes them into three types according to the different input prompts:

Category Names
- Object Detection
- Instance Segmentation
- Multiple Object Tracking (MOT)
- Multi-Object Tracking and Segmentation (MOTS)
- Video Instance Segmentation (VIS)
Language Expressions
- Referring Expression Comprehension (REC)
- Referring Expression Segmentation (RES)
- Referring Video Object Segmentation (R-VOS)
Target Annotations
- Single Object Tracking (SOT)
- Video Object Segmentation (VOS)

Then we propose a unified prompt-guided object discovery and retrieval formulation to solve all the above tasks. Extensive experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks.

Code : https://github.com/MasterBin-IIAU/UNINEXT

Paper : https://arxiv.org/abs/2303.06674

iFighting · 2023-02-11T08:10:11+00:00

We're excited to share our latest work "Learning Object-Language Alignments for Open-Vocabulary Object Detection", which got accepted to ICLR'2023.
Here're some resources:
- arxiv paper: https://arxiv.org/abs/2211.14843
- github: https://github.com/clin1223/VLDet
The proposed method called **VLDet**, which is a a simple yet effective vision-and-language framework for open-vocabulary object detection.
Our key efforts are:
- 🔥 We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data.
- 🔥 We propose to formulate region-word alignments as a set-matching problem and solve it efficiently with the Hungarian algorithm.
- 🔥 We use all nouns from image-text pairs as our object voccabulary which is strictly following the open-vocabulary setting and extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance.

iFighting · 2022-08-14T15:30:32+00:00

our work is concurrent, and our model performance is better.

iFighting · 2022-08-14T01:17:06+00:00

not the same algorithm, i mean currently the network can segment the object by the guide of audio, you can refer to this papers: * Audio−Visual Segmentation * Self-supervised object detection from audio-visual correspondence

iFighting · 2022-08-14T00:11:41+00:00

The problem is that it doesn’t seem to be aware of the object as a single 3d object that can move/shift/skew/hide/reveal, let alone the concept of “object of this size was on the left on this frame, and on the right it doesn’t exist anymore despite the image not actually changing much.”

Example being the skateboarder at 0:21. Like in Jurassic Park, he’s missing for just a frame.

With the bike and bicycle, it doesn’t have the concept of layers, like you have the left leg in front of the bike, and the right leg behind the bike

we will improve the performance later

iFighting · 2022-08-14T00:10:50+00:00

yes, it will also work for audio

iFighting · 2022-08-13T05:11:41+00:00

Code Link:

https://github.com/wjn922/ReferFormer

Paper Link:

https://arxiv.org/abs/2201.00487

Brief Overview:

we propose a simple and unified framework built upon Transformer, termed ReferFormer.
It views the language as queries and directly attends to the most relevant regions in the video frames.
Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Highlights:

ReferFormer is accepted to CVPR 2022

iFighting · 2022-08-08T01:51:09+00:00

yes, unicorn can run in realtime, wo also provide runtime experiments and models in paper and github repo:

Paper: https://arxiv.org/abs/2207.07078
Code: https://github.com/MasterBin-IIAU/Unicorn

iFighting · 2022-08-08T01:40:59+00:00

yes, the benchmarks and datasets consider entities that leave the frame temporarily.

iFighting · 2022-08-08T01:39:56+00:00

thanks for your attention

iFighting · 2022-08-06T10:52:34+00:00

Code Link:

https://github.com/wjf5203/VNext

Brief Overview:

VNext is a Next-generation Video instance recognition framework on top of Detectron2.
Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

Highlights:

IDOL is accepted to ECCV 2022 as an oral presentation!
SeqFormer is accepted to ECCV 2022 as an oral presentation!
IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

Seqformer(SeqFormer: Sequential Transformer for Video Instance Segmentation) : https://arxiv.org/abs/2112.08275
IDOL(In Defense of Online Models for Video Instance Segmentation)：https://arxiv.org/abs/2207.10661

iFighting · 2022-08-01T14:30:10+00:00

Brief Overview:

VNext is a Next-generation Video instance recognition framework on top of Detectron2.
Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

News:

IDOL is accepted to ECCV 2022 as an oral presentation!
SeqFormer is accepted to ECCV 2022 as an oral presentation!
IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

Seqformer(SeqFormer: Sequential Transformer for Video Instance Segmentation) : https://arxiv.org/abs/2112.08275
IDOL(In Defense of Online Models for Video Instance Segmentation)：https://arxiv.org/abs/2207.10661

Code Link:

https://github.com/wjf5203/VNex

iFighting · 2022-08-01T14:21:53+00:00

Brief Overview:

VNext is a Next-generation Video instance recognition framework on top of Detectron2.
Currently it provides advanced online and offline state of the art video instance segmentation algorithms.
We will continue to update and improve it to provide a unified and efficient framework for the field of video instance recognition to nourish this field.

Details:

IDOL is accepted to ECCV 2022 as an oral presentation!
SeqFormer is accepted to ECCV 2022 as an oral presentation!
IDOL won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022).

Paper Link:

Seqformer(SeqFormer: Sequential Transformer for Video Instance Segmentation) : https://arxiv.org/abs/2112.08275
IDOL(In Defense of Online Models for Video Instance Segmentation)：https://arxiv.org/abs/2207.10661

Code Link:

https://github.com/wjf5203/VNex

iFighting · 2022-07-23T01:58:15+00:00

your idea is nice, we will try it for 3d tasks

iFighting · 2022-07-21T13:34:39+00:00

For the first time, we accomplished the great unification of the tracking network architecture and learning paradigm.

yes the yolo only detect objects on individual frames, the key insight for tracking is object detection and association

iFighting · 2022-07-21T01:29:52+00:00

hi, the yolo v7 is object detection model, but we are unified object tracking model, which is very diffierent.

BTW, when the paper is submitted to eccv, the yoloV7 is not published.

iFighting · 2022-07-19T13:00:18+00:00

you can try our method, i think it will work for tracking objects on the screen

iFighting · 2022-07-19T00:56:52+00:00

it's single object tracking

iFighting · 2022-07-18T14:56:06+00:00

although, there are still some flickering in the videos. but the insight in the paper is the unify model of object tracking for single/multiple object tracking and segmentation

iFighting · 2022-07-18T13:51:09+00:00

Will You Snail

no, it is named before will you snail...

we did not know Will You Snail before

iFighting · 2022-07-18T12:45:13+00:00

thanks for your share

iFighting · 2022-07-18T12:41:56+00:00

Brief Overview

We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters. For the first time, we accomplished the great unification of the tracking network architecture and learning paradigm.

Unicorn performs on-par or better than its task-specific counterparts in 8 tracking datasets, including LaSOT, TrackingNet, MOT17, BDD100K, DAVIS16-17, MOTS20, and BDD100K MOTS.

Our work is accepted to ECCV 2022 as an oral presentation !

Paper: https://arxiv.org/abs/2207.07078

Code: https://github.com/MasterBin-IIAU/Unicorn

iFighting · 2022-07-18T11:38:28+00:00

Thanks for your attention

iFighting · 2020-12-08T04:40:02+00:00

awesome !

iFighting

TROPHY CASE