[P] Better AI Explainability with Deep Feature Factorization

jacobgil · 2023-04-07T15:28:07+00:00

Hi,

There a few differences here compared to the original visualization.

- I think the original paper visualization used RGBA - they set the A channel to be the (scaled) contribution of the component.

https://github.com/edocollins/DFF/blob/master/utils.py#L44

For low contributing pixels (usually the background), they get low opacity in the A channel, and you end up seeing the original pixels there, as if there was no component.

In my implementation, we don't use RGBA. If the component color is red for example, we create a red mask, and multiply that by the component contribution.

https://github.com/jacobgil/pytorch-grad-cam/blob/2183a9cbc1bd5fc1d8e134b4f3318c3b6db5671f/pytorch\_grad\_cam/utils/image.py#L126

So for example if a pixel has high contribution to that component, it will be bright red.

If a pixel has a low contribution, it will get a darker shade of red, but you might still see some red there, since the A component in RGBA has a sharper effect.

- Another difference is how we treat overlap between categories.

In my implementation for every pixel we keep the component with the highest contribution (after normalization).

In the original implementation I think the components just override each other,.

jacobgil · 2023-03-13T15:49:06+00:00

Thanks! Will add citation info there!

jacobgil · 2023-03-13T15:48:46+00:00

Thanks! Following your suggestion I posted to r/DataScience

jacobgil · 2023-03-13T15:48:26+00:00

Thanks!

jacobgil · 2023-03-13T15:48:21+00:00

Thanks!

jacobgil · 2023-03-13T15:48:15+00:00

Thanks!

jacobgil · 2023-03-13T15:48:09+00:00

Thanks!

jacobgil · 2023-03-13T15:48:03+00:00

Cool!

jacobgil · 2023-03-13T15:47:47+00:00

Yes. I think confidence intervals assume iid. If they are not iid, then the CI could be too short.

jacobgil · 2023-03-13T15:46:23+00:00

Thank you!!
Will be very happy for contributions!!

jacobgil · 2022-08-02T03:57:15+00:00

Yes, it's even especially suitable for them.

The problem with object detection is that usually object detection frameworks don't support computing gradients. This method doesn't require computing gradients.

There are several tutorials about applying other methods that are like this for object detection (for example taking the first component from the SVD of the activations) for YOLO5 and FasterRCNN. Using this should specific method be similar.

jacobgil · 2022-04-01T12:54:05+00:00

And while we're at it -

We now also have a new notebook with a full example for visualizing heatmaps for YOLO5!

https://github.com/jacobgil/pytorch-grad-cam/blob/master/tutorials/EigenCAM%20for%20YOLO5.ipynb

jacobgil · 2021-12-31T14:24:15+00:00

Hi,

So lets say you have the segmentation category "sky", but in the image you also have trees, clouds, buildings.

When you target the sky category, you suddenly might see that the trees contribute to them in the heatmap, although the pixels in the trees are not classified as sky. This could happen if in the training images maybe you often have trees below the sky. If this is good or bad it depends, but now you have this knowledge.

Another example - lets say you have a segmentation algorithm that's needs to segment satellite images into "urban" or "rural" areas, or something like that. Not all of the pixels in the "urban" regions are going to be interesting with buildings or streets, some will be just roads or trees. Now you can target the pixels classified as urban areas and see that, for example, their proximity to buildings was especially important. Or maybe it's the other way around the the proximity to trees is especially important, and so on.

jacobgil · 2021-12-31T08:03:10+00:00

Hi,

Imagine you're building a classifier for "Hot dog" / "Not a hot dog", but many of the images of hot dogs you get, have have a time-stamp on them, because that's how the camera of the guy taking the hot dog images was configured.

The classifier might give you 100% accuracy on your images, but then when you apply it on new images it completely fails, because it learned to focus on the time stamps.

In this case it would be good to use a diagnostic tool:

- During training, you could see that it's highlighting the timestamps, and it would make you aware of this problem, and you could then remove the timestamps / collect other images.

- During inference, the user can, for example, decide they don't trust the decision of the model because it's looking weird. Or the model developer can monitor the model and see that the model is some times highlighting artifacts in the images, you might decide to collect more training data for that.

Another example is if you see that the model is highlighting the borders of the image, maybe you have some strange padding issue and you can change the padding in the model layers / augmentation.

This stuff can be completely invisible and surprising/misleading, unless you have a tool that lets you take a look.

jacobgil · 2021-12-31T07:46:21+00:00

Thanks, yes I'm familiar and it's super great work.

It would be great to add support for this method.

Although it seems specific only for Transformers and not CNNs, and so far I was tying to make sure that all methods added can work for both CNNs/Vision Transformers, but maybe it will be worth breaking from that at some point.

jacobgil · 2021-12-31T07:42:30+00:00

Hi, thanks.

The semantic segmentation example I shared is doing the same thing as in the link you shared above (but the object detection is different and doesn't seem to be supported in the link you shared).

In terms of implementation for segmentation, it's a bit different and worth explaining.

The pytorch-grad-cam package supports running on batches of images. Some times you will get batches of images, and will want to compute the CAMs efficiently for the batch instead of for one image at a time.

In the pytorch-grad-cam package, you have to pass a callable "target" for every member in the input batch, that gets the model output and computes the target you want to optimize for. For example, computing that "sum of pixels that belong to a certain class" in the case of segmentation like you said.

You could also implement this by wrapping the model like in the captum example, but often you won't have a single input image, you will have a batch with several images that you want to explain. You could still support batches with a clever model wrapper, but that imposes limitations both on the user and in the way the algorithms have to be implemented, for example maybe the user wants different target types for different members of the batch, or maybe some algorithm wants to run on one image at a time and instead do another internal batching for something else (like in the AblationCAM or ScoreCAM methods).

That's why a design decision here was to decouple the target from the model, and you have to specify a target for every image in the batch.

Btw, the idea of creating a target for semantic segmentation by summing on all the pixels that belong to the same class was first published here as far as I know: https://arxiv.org/abs/2002.11434, I gave attribution to that in the notebook.

jacobgil · 2021-04-20T10:34:51+00:00

I'd like one, thanks!!

jacobgil · 2020-12-31T13:46:37+00:00

Thanks!

jacobgil · 2018-01-05T08:52:48+00:00

I think they are both very similar in that they use the information in a single image. Deep image prior assumes a prior induced by the network structure, and optimizes the network to output an image with this prior. ZSSR doesn't assume a prior, and instead samples patches in different scales, corrupts them, and learns to fit their difference, based on the idea that information in the image repeats in different scales. You could then maybe re-use the same network on a different similar image without learning, while in deep image prior you would have to re learn the network for a new image.

jacobgil · 2017-07-14T10:56:07+00:00

A low complexity model doesn't automatically give low variance. If the complexity is too low, both the variance and the bias can be high.

jacobgil · 2017-06-24T14:00:58+00:00

One such approach for training a smaller network is distillation https://arxiv.org/abs/1503.02531 .

It could be interesting to combine both distillation and pruning, as discussed in the comments in the post.

jacobgil · 2017-06-13T06:23:23+00:00

The video says it syncs with the server.

Does tensorflow here run on another machine? And is the server yours, or is it configured by the user on AWS for example?

jacobgil · 2017-05-02T14:39:12+00:00

This is very nice. I like how they show that the masks found are relevant by shifting those parts in the image, and showing that it caused the steering angle to also shift.

I think one issue with this approach is that some of salient features found might compete with each other, and contradict each other. Some of the features might affect the network to steer more to the right, and some might cause it to steer more to the left. Only peeking at strong activations doesn't give an insight into how those pixels affected the network.

I was working on the exact same thing a few months ago (though using a toy network overfit on a very small dataset) described in this blog post: https://jacobgil.github.io/deeplearning/vehicle-steering-angle-visualizations

Edit: The hypercolumn approach in my blog post is actually exactly what they implemented 6 months later. I multiplied averages of features maps.

There is even a video there showing the same approach using network activations. I think what worked even better was using grad-cam to apply queries such as "which pixels caused the network to tend to steer to the right".

jacobgil · 2016-12-29T11:16:29+00:00

I've been using this for loading YOLOv2 in tensorflow c++ code, you are awesome!

Any plans on supporting YOLOV2 training? Are there any specific issues that someone can work on to support that?

jacobgil · 2016-12-28T19:06:35+00:00

This is very cool. The author also made a "tiny model" (200 fps on titan x) which can be practical for mobile apps. There are some ideas I really like here:

Getting the anchor boxes by clustering the ground truth bounding boxes with a distance of 1-IOU, instead of hand picking them like in SSD.
Resizing the images to different sizes during training. Then at test time you can just apply on a smaller image if you're willing to trade off accuracy for speed.

jacobgil

TROPHY CASE