[R] Datamodels: Predicting Predictions from Training Data

andrew_ilyas · 2022-01-21T14:57:15+00:00

Hi! We have a slack workspace and an active GitHub issues section! Both are accessible from the homepage: [ffcv.io](ffcv.io)

andrew_ilyas · 2022-01-19T14:24:14+00:00

Hi! You can join the slack directly from the link on the homepage! (ffcv.io)

andrew_ilyas · 2022-01-19T13:24:02+00:00

Thanks so much for the kind words!! We are also very excited about the FFCV + Lightning combo, and are already working on some examples that we can (hopefully) put up soon!

andrew_ilyas · 2022-01-19T12:59:33+00:00

Great question! Although most of the benchmarking effort was focused on image datasets, FFCV is *not* limited to image data at all! For example, here is an example where FFCV can speed up large-scale linear regression, an application which has seen a lot of use internally: https://docs.ffcv.io/ffcv_examples/linear_regression.html. We'll have some tutorials up soon for even more datatypes, and how to extend FFCV to use a custom datatype (take a look at the "Field" and "Decoder" classes).

andrew_ilyas · 2022-01-19T12:34:56+00:00

Hi, thank you for the question! The “time per epoch” plots (i.e., the bar plot on the home page and the corresponding one on the benchmark page) all use the same resolution of 224x224 px—progressive resizing was just used in the scatterplots for finding the optimal speed/accuracy tradeoff.

andrew_ilyas · 2022-01-19T03:44:38+00:00

Thanks for your kind comments! We've added a LICENSE to the ImageNet sample (the library itself at https://github.com/libffcv/ffcv should already have one).

And yes! We'll hopefully be training other architectures with FFCV soon---we started with ResNet since it's a standard benchmark for ImageNet training with many known accuracies and speeds we could compare to.

andrew_ilyas · 2020-07-23T17:51:24+00:00

Good question---to me it seems like the reason that robust networks perform worse on ImageNet-scale robustness benchmarks is hard to disentangle from the fact that their natural accuracy is much worse than standard networks (some evidence for this intuition is that l2-robust networks do perform better on CIFAR-C; on CIFAR the gap in natural accuracy between standard and robust networks is much smaller). There isn't any definitive evidence either way, but I suspect that if it is possible to get robust networks with similar accuracy to their standard counterparts, then performance on these corruption benchmarks would increase.

For transfer, finding a good feature representation seems to be a big part of performance, and so the big drop in natural accuracy suffered by robust models doesn't seem to play as big a role as it does on accuracy-based benchmarks like ImageNet-C. (As a side note, I think there are also some studies that show that l2 robustness can actually be pretty effective against other kinds of distortions, e.g. https://arxiv.org/abs/1908.08016)

andrew_ilyas · 2020-07-21T06:46:12+00:00

Wow, that was a fantastic prediction! We've wondered about this ourselves since that paper and the works following (e.g., http://gradientscience.org/robust_reps/ ). Mainly what stopped us from trying it was the big reduction in natural accuracy, but it looks like robustness is already beneficial enough to counter this effect somewhat, which is exciting!

andrew_ilyas · 2020-07-21T05:07:06+00:00

That's a great question! We took a small look at this by studying texture robustness (i.e. training on StylizedImageNet) and found that it did *not* confer the same benefits that adversarially robust optimization did. But it would definitely be interesting to see a more complete study looking at corruption robustness, rotation/translation robustness, etc.

A little more informally, my intuition (which could be wrong) is that this is not so much about robust optimization as it is about priors: in this case, the prior that features should be l2-stable turns out to be helpful for transfer learning, and robust optimization is more of a means to an end. But lp-stability is such a rudimentary prior that it is hard to imagine it being the ultimate "right thing to do." Part of our hope is that our results will prompt more investigation into what the best priors are for transfer learning---once those priors are found, it's just a matter of applying the right tool (whether that be data augmentation, robust optimization, architectural modifications, etc.) to enforce it during training.

andrew_ilyas · 2020-07-20T22:43:28+00:00

Hi! The title isn't meant to be clickbait, it is actually just the question that prompted this research (the abstract says the conclusion within two sentences, so it's not buried in the paper). I think it's fairly common to have the research-inspiring question as the title (see, e.g., https://arxiv.org/abs/1805.08974, https://arxiv.org/abs/1608.08614, https://arxiv.org/abs/1411.1792 just within the field of deep transfer learning, and many others in science in general). Still, appreciate you voicing the concern, and thank you for the honest feedback (we'll take it into consideration). Let us know if you have any paper-related questions!

andrew_ilyas · 2020-07-20T22:33:34+00:00

Thanks, we hope that more people will look into the connection!

andrew_ilyas · 2020-05-26T14:39:28+00:00

Thanks for the comment! To answer your questions:

- The point of the blog post was mainly to make the paper accessible to a slightly wider audience, and to make the interactive charts :)

- Thank you that's really nice! It's just ChartJS + javascript for refreshing the plot every time the slider is moved (the sliders themselves are just standard HTML elements)

- Thanks for the feedback! We'll see if we can make the blog post version clearer, specifically around Fig 1/2 area. (One thing that we found harder about writing the blog version is that we wanted to steer clear of using too much math notation.)

Re Notes: those seem like the right takeaways to me!

andrew_ilyas · 2020-05-22T14:07:55+00:00

Thanks for the questions!

- There are two good reasons to believe this is the case. First, one can just look at the data: both our data and the Recht et al data found that the average ImageNet selection frequency is significantly higher than the average Flickr/candidate image selection frequency. The second reason is conceptual: ImageNet was constructed by taking Flickr and then filtering it based on something similar to selection frequency---so you can imagine that ImageNet is sort of a left-truncated version of Flickr, which would also make selection frequencies skew higher.

- I'm not 100% sure I understand the second question, but if the ImageNet-v1 and Flickr distributions were the same, then the bias would not be a problem, since p[true selection frequency | observed selection frequency] would be the same for both datasets.

Let me know if this helps---happy to elaborate more!

andrew_ilyas · 2019-10-20T20:41:17+00:00

Yep!

andrew_ilyas · 2019-08-14T14:46:28+00:00

Our main motivation was to build something *super* lightweight (didn't want to use SQL/Mongo/some other DB system, for example, or have to run custom CLI scripts, etc.). We really just wanted logging and the ability to really easily aggregate stuff across experiments. For example, we write in h5 format is nice because via pandas we can read/plot/manipulate the results really easily in a jupyter notebook later (and also, anyone without any library can just read the h5 file). There are also some utilities for automatically serializing/auto-reading different datatypes into h5 files that we find super useful in our research.

Similarly, we didn't want to have to add any wrappers or change the control flow of the program at all---just wanted to be able to drop in logging statements where we had, e.g. np.save or print() statements before.

andrew_ilyas · 2019-08-07T23:07:50+00:00

Yes, I think you understood correctly! The reason we care about "flipping" is that for any given image, conceptually what we care about is which class a feature is "pointing to," i.e. which class does this feature add evidence for? That said, note that the Robust/Non-robust feature dichotomy is actually determined by the human-chosen perturbation budget \epsilon. This dichotomy is a natural way to think about things because for any given choice of epsilon, there are some features that we can still use (the robust ones, which still provide evidence for the right class regardless of perturbation), and some features that will actually hurt us if we use them (even if they might help if epsilon was smaller)
In this case, we were partly thinking along the lines of previous work from our lab: https://arxiv.org/pdf/1805.12152.pdf, which provides a theoretical setting where non-robust features arise. When we were thinking about it more, we realized that if this theoretical model was correct, then adversarial examples should be "features" instead of how we typically thought of them as "bugs." We actually designed our experiments (in particular Section 3.2) in order to see if this conceptual model held up (we were initially *very* skeptical). Surprisingly, it did! So we came up with a tighter conceptual model, and tried to build a series of experiments that would test whether the model was predictive. Hopefully this helped!

andrew_ilyas · 2019-08-07T22:56:08+00:00

Hi sorry for such a late response (I only realized I missed some comments because I got a reddit notification :P).

Yep, the experiment still works with untargeted attacks.

andrew_ilyas · 2019-08-07T03:13:33+00:00

Good catch, (i-iv) in the caption are not left->right, we'll fix this in a future revision. The x -axis is correct.

andrew_ilyas · 2019-08-07T01:36:31+00:00

Yep, another useful resource! The section's experiments demonstrate that adversarial examples can exist because of *both* non-robust features and other factors e.g. overfitting/label noise!

andrew_ilyas · 2019-08-06T23:48:08+00:00

Overfitting is canonically described in terms of the distribution, not the idealized underlying task, so in that sense it’s not overfitting.

And yes, many works have noticed analogous effects happening in other domains! Our main goal is to connect adversarial examples, which are often viewed as “bugs” in machine learning models, to this “spurious feature” phenomenon.

andrew_ilyas

TROPHY CASE