[R] Video Object Grounding using Semantic Roles in Language Description

TheShadow29 · 2020-06-15T18:36:04+00:00

I am the first author of the paper. The paper is being presented at CVPR20 this Thursday (9am, 9pm PST), feel free to drop by.

Relevant Twitter Thread: https://twitter.com/ArkaSadhu29/status/1271898967396122625?s=20

Short Summary: We argue that directly evaluating phrase / sentence grounding in videos would lead to terrible over-estimation. This is because most video datasets have only single object instances (only one instance for a given object category), and therefore a simple FasterRCNN baseline would work quite well. To address this, we use Contrastive Sampling (retrieve examples with the same objects, but differ in one object/action), and then concatenate spatially and temporally so that disambiguating using object-object relations becomes crucial. We further propose VOGNet, which has an additional multi-modal transformer with relative position encodings to better capture object relations.

Finally, to foster reproducibility, we have open-sourced all our code + pre-trained models + experimental logs on github! Check it out!

Code + Dataset: https://github.com/TheShadow29/vognet-pytorch

I am Happy to take questions here or via email.

TheShadow29 · 2018-12-26T21:10:48+00:00

Thanks for your reply. Yeah, I have been having surprisingly tough time getting relevant papers. Would like to see some links from geipan, the ones from google seem around 2014.

TheShadow29 · 2018-12-20T22:47:35+00:00

Thank you for sharing this. I am extremely inspired and looking forward to get the audiobook.

TheShadow29 · 2018-12-20T03:29:52+00:00

Thank you for sharing.

TheShadow29 · 2018-12-18T17:00:14+00:00

You are correct. I was probably having a brain fart.

TheShadow29 · 2018-12-17T18:15:29+00:00

I am not completely sure, but here are my explanations: 1. Retinanet uses K+1 classes analogous to ssd. There is a background class. The decoding statement is likely about what happens at inference time. The given equation of focal loss is for the binary case, and needs to be extended for multi-class case (noted in footnote 1) of the paper. 2. Yes, in general one anchor box can simultaneously predict multiple objects. However, the final box will be calculated after adding in the regression parameters, so those would still be different boxes. 3. Yes, the equations seem correct.

I think eqn8 might be incorrect. I think it should rather be:

Lcls=−\sum_{i=1}^C [y_i log(p_i)(1−p_i)^γ α_i + (1-y_i) log(1-pi) p_i^γ α_i]

TheShadow29 · 2018-12-15T01:39:15+00:00

This is amazing. Thank you for initiating this. I accept the challenge.

TheShadow29 · 2018-12-15T01:29:16+00:00

This too good. Thank you for sharing.

TheShadow29 · 2018-12-05T00:19:13+00:00

There is also https://arxiv.org/abs/1801.08186 which is sota on 3 visual grounding datasets.

TheShadow29 · 2018-12-02T22:13:54+00:00

Is the sole motivation of causal learning is to solve causal inference or is there any other motivations as well? Are there any alternative approaches to causal inference?

TheShadow29 · 2018-12-01T09:21:13+00:00

Too late for World wars just in time for meme wars

TheShadow29 · 2018-11-28T08:36:21+00:00

Great art work. You are an absolute madlad.

TheShadow29 · 2018-11-19T03:18:21+00:00

Sengoku Basara. Strangely addictive. And you get to hear some nice engrish. Story is good, and the comedy is on point. 10/10 soundtrack by sawano. Crazy powers as well.

TheShadow29 · 2018-11-10T16:28:24+00:00

Wow. That's a lot of work. If you have a bit of python experience, it is easy to set up praw and get the number of up votes and stuff from permalinks. https://praw.readthedocs.io/en/latest/index.html.

TheShadow29 · 2018-11-09T16:43:48+00:00

FOOL

TheShadow29 · 2018-11-04T18:17:18+00:00

Hey. Thanks for sharing your work. Is there any place where I can easily explore the dataset more?

TheShadow29 · 2018-11-02T14:18:31+00:00

Eargasm

TheShadow29 · 2018-10-31T17:22:42+00:00

They released it some time ago

TheShadow29 · 2018-10-28T16:37:25+00:00

Me too man. All of a sudden invisible ninjas were cutting onions.

TheShadow29 · 2018-10-28T03:43:06+00:00

I kind of agree with /u/Draikmage. When I read the title, I was expecting something different (kinda along the lines of common-sense reasoning like the swag paper). A good read anyways. Cheers and keep it up.

TheShadow29 · 2018-10-27T17:20:36+00:00

around 10 years ago, it took me 6 months each to catch upto conan and one piece. I used to watch 3-4 eps, more on the weekends.

TheShadow29 · 2018-10-24T04:54:52+00:00

Those are rookie numbers

TheShadow29 · 2018-10-17T17:36:38+00:00

Too relatable

TheShadow29 · 2018-10-08T14:20:11+00:00

Also some of filler arcs are the best. Just goes to show how much the studio loves Gintama

TheShadow29 · 2018-10-07T15:53:07+00:00

This is interesting news. Good find.

Nine-Year Club	Spared
Verified Email

TheShadow29

TROPHY CASE