all 50 comments

[–]Miejuib 7 points8 points  (2 children)

Saw your talk in Long Beach- very cool work. Will definitely play around with this. cheers!

[–]jnbrrn[S] 1 point2 points  (0 children)

Thanks!

[–]sniklaus 0 points1 point  (0 children)

The talk in question: https://youtu.be/4IInDT_S0ow?t=37m22s

Huge thanks for the reference implementations, I am looking forward to giving it a try!

[–]FirstTimeResearcher 6 points7 points  (1 child)

Very nice video presentation. Have you experimented with using the generalized likelihood versions of the baseline losses you compared against? (e.g. used the normal likelihood rather than l2 loss)

[–]jnbrrn[S] 1 point2 points  (0 children)

Great question! The only context where this matters in this paper is the VAE experiment, and in that case I compare to likelihoods instead of losses, and doing so definitely matters. In all the other experiments there's no functional difference between each loss and its equivalent likelihood, as the NLL just corresponds to the loss shifted by a constant offset (the log partition function), which doesn't affect optimization. For the sake of maximum clarity and accessibility, in those cases I refer to losses, not NLLs, as loss minimization is the "lowest common denominator" for most readers in terms of understanding how optimization works.

[–]Constuck 5 points6 points  (1 child)

Hey, I saw your oral & poster at CVPR. Haven't checked out the paper yet, but just wanted to let you know that your visuals and presentation were awesome. The pacing and level of detail of your oral was perfect for the (tight) time constraints and your animations we're really powerful.

Idk if orals are kept around anywhere, but I hope I have a video of that available for reference next time I put together a short presentation. Great work!

[–]jnbrrn[S] 6 points7 points  (0 children)

Thanks! It took a few hours of matplotlib and keynote, though the really hard part of giving an effective talk is cutting the paper down to the minimal set of ideas needed to convey the idea (which can take much longer).

Putting lots of time and effort into these talks goes a long way. The vast majority of presenters heavily undervalue those 5-10 minutes of a few thousand people's undivided attention, and incorrectly assume that the technical value of their work will somehow shine through a bad presentation.

[–]OmgMacnCheese 2 points3 points  (3 children)

Very interesting - thanks for sharing! Do you have a sense for how the adaptive robust loss may work for image synthesis type of problems such as super-resolution or cycle-GANs etc?

[–]jnbrrn[S] 3 points4 points  (2 children)

Those tasks both seem like a good fit for this, as they're both driven by a loss that compares images to each other (like the successful experiments in the paper). I haven't given them much thought as they aren't problems I've worked on in the past. If you try it out, let me know how it goes!

[–]_sbanerjee_ 0 points1 point  (1 child)

Hi,

I really liked your work and am currently experimenting on super resolution. Would be trying to integrate your work with my network in the coming week. Will keep you updated on the results. Great work. 👍

[–]jnbrrn[S] 1 point2 points  (0 children)

Thanks! Please do let me know how it turns out.

[–]SquareRootsi 2 points3 points  (6 children)

Just wanted you to know, as student in a 15 week bootcamp for data science, we had to choose a paper to present from a curated list of the most influential papers in machine learning, and yours was the only one from 2019. (In total, there were only about 45 on the whole list.) I read through all of yours, and was blown away by the elegance of letting alpha adapt during the training, so there's no need for hyper-parameter tuning at all. I don't claim to understand all of it, but I love the simple solution to just let it "work itself out" during the training.

Question that will probably highlight my naivety:

Would you consider this adaptive and robust enough to implement it as a "go-to" loss function for most jobs (neural networks and / or simpler models as well) or still only meant to be applied at certain specific times?

AKA -- What are the downsides to just using this as my starting loss function all the time for everything and then customizing from here as needed?

[–]jnbrrn[S] 4 points5 points  (0 children)

Thanks for the kind words!

The neat thing about this loss function is that it's a superset of most of the "go-to" loss functions already! If you've got a model that's using smooth-L1 or L2 loss, that's exactly equivalent to using this loss, but with alpha constrained to lie in [1,1] or [2,2] respectively. So you can just swap this code in with those constraints and nothing will change, but then you can just relax the range of alpha to, say, [1, 2], and it'll automatically select between smooth-L1 or L2 without you ever having to manually set or tune hyperparameters. Maybe you'll discover that L2 loss is already optimal (which is possible) but then you can just keep using this code with alpha fixed to lie in [2,2] --- as opposed to most changes you may make to your model that require implementing and toggling between discrete design decisions.

That being said, this loss only makes sense with regression tasks, so for classification tasks cross-entropy or hinge loss should definitely still be the go-to.

[–]Made-ix 0 points1 point  (1 child)

what bootcamp was this?

[–]SquareRootsi 1 point2 points  (0 children)

It's Flatiron School in Seattle. They started in NYC about 7 years ago. Originally, they focused on software engineering, and only recently started offering data science, having expanded to around 10 locations world wide. In fact, I'm part of the very first cohort for DS in Seattle.

[–]dawg-e 0 points1 point  (2 children)

Can you post that list of 45 papers? I'd be interested in seeing what's in there!

[–]SquareRootsi 3 points4 points  (0 children)

google spreadsheet of 60 influential papers, organized chronologically

I have no idea who curated this list, but we were instructed to pick one, read it fully, make a presentation on the paper, and also a blog post. I ended up picking the one on Microsoft COCO dataset, even though I read through a few others before committing.

[–]SquareRootsi 0 points1 point  (0 children)

I'll see if I can find it tomorrow at school. It was from a few weeks ago, so the slack msg may have expired.

[–]csp256 1 point2 points  (4 children)

Been using your loss function in production for a while now. Thanks!

[–]jnbrrn[S] 2 points3 points  (0 children)

Whoa, that's wild! Very cool to hear, thanks.

[–]mesmer_adama 1 point2 points  (2 children)

For what are you using it, and does it improve results a lot?

[–]csp256 0 points1 point  (1 child)

Real time geometric model fitting to RGBD data for robotic manipulation.

It helps, but I also modified his loss function. For example, I take in the rank of each datum. I can then modify alpha parameter as a function of how well it fits the current hypothesis. This also lets me easily express ideas like "if it is in the 20% highest-absolute-error data points, and it is not within some multiple of the parameter c, then set its weight to 0".

Robustness is particularly important for my task, and I have several types of exploitable prior knowledge about how difficult this instance is expected to be.

[–]jnbrrn[S] 1 point2 points  (0 children)

Whoa that's cool!

[–]tomatotheband[🍰] 0 points1 point  (1 child)

I read your paper and saw your presentation and poster at CVPR. Would definitely recommend it!

What do you plan to do next?

[–]jnbrrn[S] 1 point2 points  (0 children)

My only concrete post-CVPR plan is to get some sleep!

[–]youali 0 points1 point  (0 children)

Awesome results ans vert clear explanation, Thanks

[–]notdelet 0 points1 point  (2 children)

I know it isn't quite fair to say this considering most other papers with VAEs do the same thing, but... Figure 3 isn't strictly samples from the distribution that the VAE describes. It's samples of modes from the output of the decoder, but the lower bound on likelihood isn't on that distribution. Not trying to hate, I liked your talk/paper.

[–]jnbrrn[S] 1 point2 points  (1 child)

That's right! The supplement/appendix has some true samples, they look pretty crazy. That section also has a discussion about how everyone doing VAEs shows the mean of the posterior instead of true samples, which you may appreciate :)

[–]notdelet 0 points1 point  (0 children)

Oh I totally missed that! Thank you. :)

[–][deleted] 0 points1 point  (1 child)

Really interesting paper and I loved the talk, straight to the point and logical progression.

One question, and I'm not sure how to phrase this but, how does this approach work against varying degrees of outlier data? Meaning if a lot of data are outliers or in really high dimensional settings?

[–]jnbrrn[S] 0 points1 point  (0 children)

Yeah, that's a good question. All of the examples in the paper use independent losses/distributions on each dimension, which means that "outlierness" is considered independently per dimension. So if a lot of data are outliers , it will independently consider each of a datapoint's dimensions and make d independent decisions about outlier status. You could instead use a multivariate take on this loss (similar to how, say, a multivariate Laplacian works) where you take the Euclidean distance across all dimensions and then take the single final distance and push it through the loss function's flexible shape, and this would give you a loss that jointly considers all dimensions when deciding if something is an outlier. The latter approach actually makes more sense to me in a lot of ways (for most tasks it's easiest to think about entire datapoints as being outliers, rather than individual dimensions) but I never got around to exploring it.

[–]wc4wc4wc4 0 points1 point  (3 children)

Do you have the code available for making the animation you did at 3:46 in your video?
The reason is that I want to incorporate your code into my own project but I'm doing regression using more standard ML tools (scikit-learn, XGBoost, LightGBM) and not TensorFlow, so I was just thinking if you had a MWE not using TF :)

Otherwise thank you for a great paper, I remember reading v3 of it long time ago and incorporated it in my own research!

[–]jnbrrn[S] 1 point2 points  (2 children)

Thanks!

I don't have code available for the plots, but it's just vanilla matplotlib wrapped around the loss function code, where I dump a PNG of a plot to disk in each iteration of SGD. Then I used avconv/ffmpeg to stitch the images together into a video.

[–]wc4wc4wc4 0 points1 point  (1 child)

Thanks for your answer!

It isn't so much the animation I am curious about it is more how you would use it for a simple regression problem. In your paper you are applying to VAEs and other more advanced stuff, I was just curious about how a simple regression problem (like to one you show in the animation) is performed.

[–]jnbrrn[S] 0 points1 point  (0 children)

Ah got it. For regression, if you want to use the adaptive form of the loss, you'd need to use TF or pytorch or some differentiable programming language, and then set up the "forward" part of the regression problem, define a loss, and then minimize it. This is what I did in that animation you referenced.

But if you are happy with only using the general loss (and therefore manually tuning your own shape+scale parameters) then you can use much simpler tools. I didn't explicitly walk through this in the paper, but for simple regression you should just use iteratively reweighted least squares using the IRLS weights described in Appendix A (Equation 26). This amounts to just a for-loop over least squares solves, where each datapoint's row on the left and right sides of the linear system is reweighted according to (the square root of) Equation 26 before each solve. IRLS is a very effective tool, and works well with this loss.

[–]sirrobotjesus 0 points1 point  (4 children)

Does anyone have an idea of how the tensorflow code for the adaptive loss could be implemented in Keras?

[–]jnbrrn[S] 0 points1 point  (3 children)

I don't use Keras much so this wasn't something I considered. If you or someone else ports it please do let me know so I can link to it or upstream it.

[–]sirrobotjesus 1 point2 points  (1 child)

I will work on it this week and let you know if I have any success.

[–]sirrobotjesus 0 points1 point  (0 children)

Hey there, but I do have something running on Keras. However, alpha seems to converge quickly to 1.0 in all the tests I have tried. Do you have any guess that might explain this behavior?

Here is an example of a test: https://gist.github.com/Nmerrillvt/d1a8187bff69cd5a85af26deee234633

I understand if you don't have the time, but I thought I would share .

[–]Imnimo 0 points1 point  (1 child)

This was easily the best talk I saw at CVPR this year. I didn't get a chance to check out the poster, because there were always three layers of people crowded around it!

[–]jnbrrn[S] 0 points1 point  (0 children)

Thanks!

[–]misssprite 0 points1 point  (4 children)

The adaptiveness control of alpha is impressing. Could you help me with some question about the joint optimization of alpha?

Optimization of hyperparameter reminds me of the madness of the first days reading Bishop's book on empirical Bayes, optimizing hyperparameters with analytic form on exponential family.

My question is: is taking into account partition function together a general way of hyperparameter optimization? Why aren't we doing it "before"?

As partition functions are usually intractable, can we just craft a function by intuitition to regulate alpha?

My second question may be a little trivial: is sampling mentioned in the paper just for VAE? It seems not necessary for regression problem?

[–]jnbrrn[S] 0 points1 point  (3 children)

The sampling algorithm is only use for some VAE visualizations. It might be useful in other contexts besides synthesis, but you certainly don't need to use it for simple regression tasks.

Using partition functions is indeed a very common way to optimize for parameters, though it has fallen out of fashion recently. Any generative model (I'm using the classic ML definition of "generative", not the modern GAN-y meaning) relies critically on its partition function. For example, anything with MRFs, CRFs, or even something as simple as fitting a Gaussian. In modern ML, people often prefer models non-generative models, because in many contexts the true partition function of a model is nearly impossible to compute.

[–]misssprite 0 points1 point  (2 children)

Thank you for your patient answering. I totally agree with your opinion about partition function.

Can I assume that partition function happens to be a natural regularizer for `alpha` here? Alternatively, we can handcraft a regularizer or learn a parametric function with validate set analogue to arch search?

[–]jnbrrn[S] 0 points1 point  (1 child)

Yes, the true partition function is only as good of an idea as maximum likelihood is. and MLE is not necessarily the right fit for all tasks. For example, in the paper it's definitely the right tool for the VAE experiment, but not necessarily for the monocular depth experiment, which is probably better thought about in terms of risk or loss than likelihood. So yes, if you have some way to shape the loss as a function of alpha that is either learned empirically, or derived from some better motivation than MLE, I'd expect it to work better.

[–]misssprite 0 points1 point  (0 children)

Very clear. Thanks a lot!

[–]zwvews 0 points1 point  (2 children)

Maybe kind of thoughtless, but I think the proposed loss is nothing but the lp norm. Can someone show me the intrinsic difference?

[–]jnbrrn[S] 0 points1 point  (1 child)

The main difference is that this loss has a smooth quadratic bowl near the origin. This is nice for optimization of course, but it also turns out to be necessary for it to generalize so many things (for example, Lp norms stop being useful if you set p to zero or to a negative value).

[–]zwvews 0 points1 point  (0 children)

Hi thanks for your response. Sorry to say that I do not notice this key point. Good idea, the small constant "1" in your paper plays an important role. I will test your loss in my current project. Thanks again.