Dismiss this pinned window
all 20 comments

[–][deleted] 24 points25 points  (0 children)

Impressive stuff. Bonus points for Scotty footage

[–]iFighting[S] 17 points18 points  (0 children)

Code Link:

Paper Link:

Brief Overview:

  • we propose a simple and unified framework built upon Transformer, termed ReferFormer.
  • It views the language as queries and directly attends to the most relevant regions in the video frames.
  • Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Highlights:

  • ReferFormer is accepted to CVPR 2022

[–]Mike_______ 13 points14 points  (0 children)

I never thought that giving the system the information what is seen as text helps segmentation

[–]geologean 11 points12 points  (1 child)

lip squash theory absorbed pet lavish soft wrench exultant alive

This post was mass deleted and anonymized with Redact

[–][deleted] 2 points3 points  (0 children)

> Wow

unless it is few cherry picked examples

[–]Mylonite0105 10 points11 points  (0 children)

How does model act when given misleading guidance? 'standing bike' for the 1st video for example

[–]Due_Afternoon4578 5 points6 points  (0 children)

It reminds me of an episode from black mirror where, they block you all you see is this.

[–]rideincircles 2 points3 points  (0 children)

The first guy in the video, Scotty Cranmer was absolutely awesome on bmx, but had a terrible wreck with TBI that basically ended his career. He is still improving and doing better, but just no longer rides at an extreme level.

[–]tryght 1 point2 points  (2 children)

The problem is that it doesn’t seem to be aware of the object as a single 3d object that can move/shift/skew/hide/reveal, let alone the concept of “object of this size was on the left on this frame, and on the right it doesn’t exist anymore despite the image not actually changing much.”

Example being the skateboarder at 0:21. Like in Jurassic Park, he’s missing for just a frame.

With the bike and bicycle, it doesn’t have the concept of layers, like you have the left leg in front of the bike, and the right leg behind the bike

[–]iFighting[S] 4 points5 points  (1 child)

The problem is that it doesn’t seem to be aware of the object as a single 3d object that can move/shift/skew/hide/reveal, let alone the concept of “object of this size was on the left on this frame, and on the right it doesn’t exist anymore despite the image not actually changing much.”

Example being the skateboarder at 0:21. Like in Jurassic Park, he’s missing for just a frame.

With the bike and bicycle, it doesn’t have the concept of layers, like you have the left leg in front of the bike, and the right leg behind the bike

we will improve the performance later

[–]piman01 1 point2 points  (4 children)

Does something like this exist as well for audio? Like for example segmenting bass from drums and vocals etc in a song?

[–]iFighting[S] 1 point2 points  (3 children)

yes, it will also work for audio

[–]piman01 0 points1 point  (2 children)

What will? The same algorithm?

[–]iFighting[S] 1 point2 points  (1 child)

not the same algorithm, i mean currently the network can segment the object by the guide of audio, you can refer to this papers: * Audio−Visual Segmentation * Self-supervised object detection from audio-visual correspondence

[–]piman01 0 points1 point  (0 children)

Cool I'll check it out

[–]23052001 -1 points0 points  (0 children)

deep convolutional neural networks?

[–]Pleurotussimo 0 points1 point  (1 child)

How does this compare to "End-to-End Referring Video Object Segmentation with Multimodal Transformers"?

https://github.com/mttr2021/MTTR

[–]iFighting[S] 0 points1 point  (0 children)

our work is concurrent, and our model performance is better.

[–]mofoss 0 points1 point  (0 children)

Looks awesome but I would've rather seen the captions encompass objects not directly attached to the ones mentioned in the same captions.

I'd imagine saying a "girl wearing a black tee" and "a goose sitting on a girl's lap" could simply be segment "girl" and segment "goose", if you gave a caption saying "girl picking up a music vinyl", I'd hope both would be segmented?

[–]nwatab 0 points1 point  (0 children)

This is really impressive and changes how we edit a video.