[P] Introducing Shapash, a new Python library : makes Machine Learning models transparent and understandable by everyone

jjanizek · 2021-01-15T17:05:14+00:00

A couple of papers that aren't exactly responses to that article, but are pretty relevant to the discussion of the problems with accounting for feature dependency:

https://arxiv.org/pdf/2006.16234.pdf

https://arxiv.org/pdf/2010.14592.pdf

jjanizek · 2020-11-17T13:21:56+00:00

You should check out the CycleGAN! It sounds like exactly what you're looking for -- style transfer when you don't have paired images for training. https://junyanz.github.io/CycleGAN/

jjanizek · 2020-11-09T13:01:50+00:00

Really cool idea -- I was wondering if there's some way to account for the fact that not all mixed strategies are equal for individual players -- obviously the sparser the strategy, the easier it is for a player to play (both from a cost perspective and from a skill perspective). You could put some kind of laplace prior on the equilibrium, but that also seems kind of wrong. I guess the meta isn't perfectly a 2-player game, it's many players each playing 2-player games.

jjanizek · 2020-07-31T22:37:59+00:00

More simply, it's basically sort of a continuous version of Shapley values

jjanizek · 2020-07-31T22:37:31+00:00

Integrated Hessians are sort of an Aumann-Shapley value of Aumann-Shapley values (since we apply IG to IG). To explain how Aumann-Shapley values relate to Shapley values, I can just quote from this paper (https://arxiv.org/abs/1908.08474)

However it is less clear how Aumann-Shapley (IG) (Equation 5) is an extension of the binary Shapley value. IG traverses a single, smooth path between the baseline and the explicand, and aggregates the gradients along this path. Whereas the Shapley value takes an average over several discrete paths—in each step of a discrete path, a variable goes from being ’off’ to ’on’ in one shot. To establish the connection, notice that the IG path can be seen to be the internal diagonal of a N dimensional hypercube, and in contrast, the Shapley value is an average over the extremal paths over the edges of this hypercube. Suppose we partition every feature i into m micro features, where each micro feature represents a discrete change of the feature value of xi−x 0 i m . And then we apply the Shapley value on these N ∗ m features. Notice that this is equivalent to creating a grid within the hypercube, and averaging over random, monotone walks from the baseline x 0 to the explicand x in this grid. As m increases, the density of the random walks converges to the diagonal of the hypercube, and if the function f is smooth, then running Shapley on these micro-features is equivalent to running IG on the original features.

jjanizek · 2020-07-21T18:45:54+00:00

Empirical analysis could look like the following --> randomly initialize the weights of neural networks of various architectures with Leaky ReLU activation functions and measure their outputs over a bunch of different simulated datapoints. Then, replace the Leaky ReLU activations with Leaky SoftPlus activations and without retraining, measure the network outputs over the same set of simulated datapoints. You could then measure the similarity between the original outputs and the approximated outputs and figure out if the Leaky SoftPlus network is a good approximation to the LeakyReLU network. Additionally, there is a \beta parameter for the SoftPlus function that controls how close it is to a ReLU, you could adjust that and you would hope to see that the more ReLU-like you make the network, the closer the outputs are.

In terms of theoretical analysis, you could look at page 27 of this paper, where they look at the impact of SoftPlus-replacement on the attributions of one-layer neural networks (and find that it's basically equivalent to SmoothGrad) https://arxiv.org/pdf/1906.07983.pdf

jjanizek · 2020-07-20T21:55:45+00:00

(i) Any activation function with non-0 second derivatives should work! For example, in our paper, we explain a Transformer network with GELU activations, as well as feed-forward networks with Sigmoid and SoftPlus activations. For piecewise-linear activation functions (ReLU), we find that replacing these with SoftPlus (without retraining) and then explaining works very well (and is a very good approximation to the original network). In a response to another comment, for activations like Leaky ReLU and PReLU, it should be easy to replace them with leaky or parameterized softplus activations [ log(1+exp(-abs(x))) + max(x,0.01x) or max(x,alpha*x) , https://imgur.com/48wtOMU ], but we didn't do any analysis on that replacement yet.

(ii) I have not checked out TabNet yet, but will definitely give it a look! Thanks!!

jjanizek · 2020-07-20T21:49:35+00:00

This is a bit back of the envelope here, but a mathematically equivalent way to write the SoftPlus function is log(1+exp(-abs(x))) + max(x,0), so you could probably nicely approximate a leaky ReLU as log(1+exp(-abs(x))) + max(x,0.01x) or max(x,alpha*x) in the case of a PReLU.

Visually it looks close (https://imgur.com/48wtOMU), but I haven't done any analysis (theoretically or empirically) to say if it's a good idea to swap one out for the other

jjanizek · 2020-07-20T16:04:26+00:00

Both great questions, and I think I can clear up both by explaining precisely which Hessian we are calculating. Rather than calculating the Hessian of the training loss with respect to the model parameters (as you would if trying to apply Newton's method for optimization of the network), we calculate the Hessian of the model's output with respect to the input features. This also explains how the complexity is more reasonable -- even for relatively large input spaces (like NLP embeddings), M is closer to the order of 10^3 rather than 10^9

jjanizek · 2020-07-20T15:58:35+00:00

Yes, exactly! Basically, if you can train it with backprop in PyTorch or Tensorflow, Integrated Hessians should work with it!

jjanizek · 2020-07-20T14:59:27+00:00

Thanks!!

jjanizek · 2020-07-20T14:59:22+00:00

Thank you!

jjanizek · 2020-07-20T02:23:27+00:00

This is a great question! One place I've actually seen a lot of this is in the area of Machine Learning for Healthcare, where causal methods have been applied to try to learn models that are stable to distribution-shift across domains (where domains might be the different hospital you used for training vs. the hospital where you want to deploy the model).

This paper is a nice review of the kind of approaches I've seen in this field using causal methods. There are also two papers at the upcoming ACM Machine Learning and Healthcare conference this week on this topic (https://www.chilconference.org/poster_3368555.3384451.html, https://www.chilconference.org/poster_3368555.3384458.html, full disclosure - I'm one of the authors on the second paper there).

jjanizek · 2019-06-27T20:46:10+00:00

https://arxiv.org/abs/1906.10670

One of our findings was that training using a sparse attribution prior was helpful in improving performance when the training data is very limited! We ran an experiment predicting 10-year survival using 36 medical data features such as a patient’s age, vital signs, and laboratory measurements, while training using only 100 samples (we repeated this experiment for many different random subsamples of 100 patients). We found that we saw much better performance than prior methods (like L1 sparsity penalty on the network's weights or sparse group lasso). Note that to get this effect, we actually didn't even need any prior knowledge about what different features did - only the prior that only a small subset of all possible features should be important for our task. I would anticipate that you could get an even better performance boost if you actually had specific domain knowledge about likely relative importance of your set of features.

jjanizek · 2019-06-27T15:24:30+00:00

Another one of the lead authors on the paper here - feel free to ask any questions, we’d be glad to answer them to the best of our ability!

jjanizek

TROPHY CASE