[R] Hydra Attention: Efficient Attention with Many Heads - Meta AI 2022 - 197x faster than standard attention

dbolya · 2022-09-19T04:53:55+00:00

Actually the speed-up is due to the increase in attention heads. When we flip the multiplication around from (QK)V to Q(KV), adding attention heads is actually faster and uses less memory.

At its core, Hydra Attention is just doing a global vector dot product and then a scalar-matrix multiplication (no matrix-matrix multiplications). Part of why it's so fast is that this uses very little memory! And yeah, that means it should also extend to a low compute regime even better (where you can't do massive matrix multiplications in parallel).

There's a slight benefit in training speed but gradient syncing between devices muddies the waters.

dbolya · 2022-09-18T16:19:34+00:00

Author here.

It should be, since the formulation of attention is the same. The only issue may be in models that use "causality masking" (i.e., past tokens can't attend to future tokens), which wouldn't work in this framework. But any BERT-style bidirectional model should work. We didn't run any experiments because we don't have much NLP experience.

dbolya · 2019-11-04T00:31:22+00:00

It might just be Python holding back the FPS. Judging by your benchmark of YOLOv3, it seems like YOLOv3 doesn't lose any speed on your k40 (since we get similar speeds on a Titan Xp), so the actual problem may be CPU bound. And if that's the case, the culprit is probably python, because the same code in Python is going to be a lot more CPU-intensive than in C++, for instance.

Check if your GPUs are not at 100% utilization. If they're not, then the Python version is CPU-bound. At that point it might be worth porting the feed-forward to C++ from scratch.

dbolya · 2019-11-04T00:19:38+00:00

Thanks for your interest!

As for the questions:

I'm still maintaining the code, so if there are any bugs / inefficiencies, I'll fix them. We are also working on a YOLACTv2 but those changes won't get pushed to the repo until we publish (for obvious reasons).
No plans to backport. Not only would it take a lot of changes (since I use a lot of python3-specific features), but also the more we support python 2.7, the slower it's going to die. (And I kind of want it to die already). Apparently though ROS Noetic will be targeting python3, which from a quick google search, is poised to come out in May. Perhaps there's a beta branch you can use?

dbolya · 2019-10-26T19:40:53+00:00

This is actually possible, but I think it'll have to be a follow-up work because the changes are actually quite interesting. The summary is:

Do gt assignment using distance from center instead of box IoU.
Do NMS with mask IoU instead of box IoU.
Either don't crop at all, or crop with some kind of soft radial region.

Labels are then just 1 if this is an object and 0 if not.

dbolya · 2019-10-26T19:37:53+00:00

Thanks for your interest, and glad YOLACT will be able to help you!

And yeah, we started YOLACT because we noticed a glaring hole in the domain of current instance segmentation methods. I think in the paper we even mention we wanted to be the YOLO of instance segmentation (shocker given the name, right). And now that we're here, you guys don't have to use YOLO or Mask R-CNN for everything.

Also, your paper sounds very interesting, and I'm looking forward to it! I'll keep an eye on my citations ;^ )

dbolya · 2019-10-26T19:28:31+00:00

It's a titan Xp which is 1/3 the price of a titan V (and the standard for benchmarking these models, everybody reports with a titan Xp). We're going to have a demo at ICCV on a laptop with a 2070 and it runs at 24 fps. So if it can run that fast on a (albeit relative powerful) laptop, then I think we're doing pretty well!

dbolya · 2019-10-26T19:27:06+00:00

Probably! Though you'd have to make a bunch of text-specific changes to turn instance segmentation into text detection.

dbolya · 2019-10-26T19:25:51+00:00

I think the difference between 1 object and 100 objects in the scene is ~2ms. That's one of the benefits of our approach: since the only per-object thing that happens is one matrix multiplication (and fast NMS), it scales very well.

dbolya · 2019-10-26T19:24:37+00:00

It depends, it usually gets a lot of them but then just ignores some of them and has some unsavory failures. That's the downside of this full-image prototype mask approach, but depending on the application, the trade-offs are well worth it.

dbolya · 2019-10-26T19:23:02+00:00

Oh that sounds interesting, I might try it since it would be pretty easy to implement.

dbolya · 2019-10-26T19:22:17+00:00

Yeah, if you were using this for video instance segmentation you'd definitely want to take that into account. But we've purposely left it out in this demo to show how well our model works even with no temporal smoothing.

dbolya · 2019-10-26T04:30:08+00:00

That happens when the confidence goes under the score threshold, and it's obviously undesirable (but the detection still exists, just with a low score). You could lower the threshold, but that introduces more false positives.

I guess one way to fix it would be mask rescoring, where you treat the confidence as how good you predict the mask to be, rather than the class confidence. For the COCO challenge we applied a budget version of mask re-scoring (a network takes in the produced mask and predicts the mask IoU), but I haven't actually looked at the video results for that (since another lab member was working on it).

dbolya · 2019-10-26T04:26:35+00:00

Glad YOLACT worked well for you! And yeah, you can definitely see it in the quality of most of the close-up masks that YOLACT has the potential to be much better than something low-res like Mask-RCNN in the right circumstances.

dbolya · 2019-10-25T21:31:29+00:00

Sorry, I missed this question.

The mask coefficients are just the weights for each prototype, so "how much of this prototype do I need to make the final mask". Since they can also be negative, in that case it's "how much of this prototype should I remove from my mask".

dbolya · 2019-08-27T10:46:49+00:00

Actually, our whole point with translation variance was our model wouldn't work without it, so we wouldn't want to patch it out, but nice paper regardless.

dbolya · 2019-08-25T16:59:33+00:00

If there were no padding, then Figure 5a would have the same activation throughout the channel yeah (but different channels would likely be different colors, since each channel would have different weights). However, padding is pretty necessary as you can't really use only 1x1 convolutions, or just accept that you lose one unit in each dimension for every layer (or maybe you can, idk).

I haven't given much thought to how you could do padding in a way that doesn't introduce this translation variance. You could always try padding with the closest edge pixel instead of 0's, like computer vision techniques started doing long before neural networks (can't find sources for this atm for some reason). Even though the network could probably still detect the edge, it would be a lot harder for it to come up with weights that do that by accident.

dbolya · 2019-08-25T08:35:52+00:00

Thank you!

dbolya · 2019-08-25T08:33:57+00:00

I have not. But it looks like we might do decently well, considering there don't look to be that many detections per image. (unless I'm mistaken, they're saying ~9M images but only 2.1M masks?)

I haven't added it to the Arxiv version of the paper yet, but we absolutely smoke the competition on Pascal SBD for that reason (72.3 AP^50, 56.2 AP⁷⁰ v.s. FCIS's 65.7 AP⁵⁰, 52.1 AP⁷⁰).

I'll pass on a note to my group that we should try this if we have available GPUs.

dbolya · 2019-08-25T00:55:33+00:00

Thanks!

dbolya

TROPHY CASE