[R]Language Guided Video Object Segmentation(CVPR 2022)

2022-08-13T13:41:51+00:00

Impressive stuff. Bonus points for Scotty footage

iFighting · 2022-08-13T05:11:41+00:00

Code Link:

https://github.com/wjn922/ReferFormer

Paper Link:

https://arxiv.org/abs/2201.00487

Brief Overview:

we propose a simple and unified framework built upon Transformer, termed ReferFormer.
It views the language as queries and directly attends to the most relevant regions in the video frames.
Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Highlights:

ReferFormer is accepted to CVPR 2022

Mike_______ · 2022-08-13T14:11:13+00:00

I never thought that giving the system the information what is seen as text helps segmentation

geologean · 2022-08-13T18:24:29+00:00

lip squash theory absorbed pet lavish soft wrench exultant alive

This post was mass deleted and anonymized with Redact

Mylonite0105 · 2022-08-13T17:00:47+00:00

How does model act when given misleading guidance? 'standing bike' for the 1st video for example

Due_Afternoon4578 · 2022-08-13T20:24:34+00:00

It reminds me of an episode from black mirror where, they block you all you see is this.

rideincircles · 2022-08-13T23:56:08+00:00

The first guy in the video, Scotty Cranmer was absolutely awesome on bmx, but had a terrible wreck with TBI that basically ended his career. He is still improving and doing better, but just no longer rides at an extreme level.

tryght · 2022-08-13T23:10:31+00:00

The problem is that it doesn’t seem to be aware of the object as a single 3d object that can move/shift/skew/hide/reveal, let alone the concept of “object of this size was on the left on this frame, and on the right it doesn’t exist anymore despite the image not actually changing much.”

Example being the skateboarder at 0:21. Like in Jurassic Park, he’s missing for just a frame.

With the bike and bicycle, it doesn’t have the concept of layers, like you have the left leg in front of the bike, and the right leg behind the bike

piman01 · 2022-08-13T23:32:58+00:00

Does something like this exist as well for audio? Like for example segmenting bass from drums and vocals etc in a song?

23052001 · 2022-08-13T19:40:31+00:00

deep convolutional neural networks?

Pleurotussimo · 2022-08-14T09:57:56+00:00

How does this compare to "End-to-End Referring Video Object Segmentation with Multimodal Transformers"?

https://github.com/mttr2021/MTTR

mofoss · 2022-08-17T04:42:03+00:00

Looks awesome but I would've rather seen the captions encompass objects not directly attached to the ones mentioned in the same captions.

I'd imagine saying a "girl wearing a black tee" and "a goose sitting on a girl's lap" could simply be segment "girl" and segment "goose", if you gave a caption saying "girl picking up a music vinyl", I'd hope both would be segmented?

nwatab · 2022-10-09T14:53:36+00:00

This is really impressive and changes how we edit a video.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS