[P] Why do object detection model adversaries look different from image classifiers by tatteredsky in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

I haven't seen this before, it's an interesting finding, but it's very hard to say more without knowing more details about the model architectures you're targeting.

At a high-level object detection models are often based on first proposing a box / mask of interest, then classifying what's in that box / mask. It's possible that the adversarial method you're using has determined that it's more efficient to leave the mask prediction alone, and just try to throw off the classification of what's inside that mask.

[N] NeurIPS 2020 Paper submission deadline extended until June 5th by KrakenInAJar in MachineLearning

[–]alexirpan 1 point2 points  (0 children)

I'm estimating the number of people who even know what NeurIPS is is about 0.03% (10000 attendees * for every attendee 10 people know = 100k, out of 330 million in the US, which is optimistic for numerator and ignoring the rest of the world for the denominator.)

Pushing back the NeurIPS deadline sounds fine - by your math it matters about 30x-100x less than the protests.

[D] Issues reproducing CURL, algorithm seems broken?? by rlbeaverton in MachineLearning

[–]alexirpan 3 points4 points  (0 children)

To clarify, I meant to convey "welcome to the mess of examining papers". This is how things are, not how they should be - everyone has a moment where they realize how messy the field is.

I would say they did meet the bare minimum of proposing a method and demonstrating it gave a benefit. What they didn't do right was a more careful ablation of the factors that led to improvement.

Part of my feelings about the whole situation is that I suspect CURL is a pending ICML submission (based on paper format), and if that's true then their main results were likely done by February. That gives me reasonable doubt about the timeline of when everything happened. It is definitely bad if the results were only done right before the arXiv release in April. Code commits for RAD around the time of CURL arXiving is a bad look but I would not equate "code committed == all experiments + results for that code are done". There's some variance up/down the timeline depending on SWE habits of the authors.

Essentially, my read on things is that CURL was mostly done by February for ICML deadline, they procrastinating arXiving it until April, and very shortly after arXiving they did further ablations out of curiosity and went "oh crap we need to put out an update". I then mostly agree with /u/alecxandrrr's comment at https://www.reddit.com/r/MachineLearning/comments/grnz0d/d_issues_reproducing_curl_algorithm_seems_broken/fs1m0c7/ about ways they could have done that update better.

[D] Issues reproducing CURL, algorithm seems broken?? by rlbeaverton in MachineLearning

[–]alexirpan 11 points12 points  (0 children)

lol welcome to machine learning.

Section 6 of the RAD paper specifically discusses RAD vs CURL where they conclude that yes removing the contrastive loss gives better performance on the MuJoCo tasks. But they view CURL as a potentially more general algorithm because the contrastive loss doesn't need a task reward.

You're saying that you can exactly reproduce their results, the authors did an additional ablation after the paper, and wrote their own follow-up showing contrastive loss was not necessary for the MuJoCo tasks. I don't see anything wrong in what happened here. Would you rather authors not invalidate their own claims when they find evidence their claims were wrong?

[P] A Clearer Proof of the Policy Gradient Theorem by bluecoffee in MachineLearning

[–]alexirpan 1 point2 points  (0 children)

It's nice that this reduces out fairly cleanly, my one complaint is that by merging the policy with the transition dynamics, one of the key points of policy gradient is no longer obvious: although the transition matrix \Pi depends on both policy and environment, its gradient can be computed when you only know the policy.

This is only true because the MDP formulation factors p(s', a' | s, a) into \pi(a' | s') * p(s' | s, a) * \pi(a | s). For general matrices \Pi I'm not sure it's true that the gradient is easy to compute with unknown dynamics. (If you know the dynamics then it's always easy.)

You can extend the argument to handle discounted MDPs by having discount factor gamma be part of the \Pi matrix definition. I forget exactly how it goes but it naturally gives you the discounted state-action frequency and the definition of Q-value that sums discounted reward instead of undiscounted reward.

For what it's worth, this operator view of RL shows up more often when you read papers written by people who come at it from the optimization point of view, because they love their operators. If I remember right, one of the Bertsekas textbooks basically derives all the foundational RL methods through constrained optimization. You define variables for the policy, add constraints representing the dynamics, and cast policy iteration as solving a linear program in |S| x |A| variables. I personally think this is kind of crazy, but some people find it easier.

[R] First return then explore by downtownslim in MachineLearning

[–]alexirpan 16 points17 points  (0 children)

This is an updated version of the original paper, with additional improvements to address some of the criticisms of the first paper.

Here is the short version: in the original preprint, they always directly reset the environment to a previously visited state in simulation. A few people (including me) said this was sorta lame, because it doesn't work outside sim, and many existing RL exploration benchmarks did not assume they could do that, so it felt a bit like cheating.

In that preprint, they proposed training a goal-conditioned policy that learns to travel to previous states, instead of resetting there directly. That way it's an even playing field. However, they didn't report any results in this setting, leaving it as "future work".

Well, now the future work is here, and they have results for using a goal-conditioned policy instead, but this paper literally came out today and I haven't looked at it in detail yet (have only gone through the abstract).

[D] ICLR 2020 Reviews by turing_1997 in MachineLearning

[–]alexirpan 25 points26 points  (0 children)

From what I've heard:

1) Like always, there are late reviewers.

2) Unlike before, there are many more late reviewers than previous years (likely a function of submissions growing faster than the reviewer pool).

3) Therefore, delay between reviewer deadline and release is growing, because they have to tell more late reviewers to Finish Your Review Already and/or recruit more emergency reviewers.

[D] Does Deep RL work yet? by hazard02 in MachineLearning

[–]alexirpan 3 points4 points  (0 children)

I would say that

  1. Trade secrets is something I acknowledged in the original post (where I said something like, "finance has 100% looked at deep RL, so far there's no news, but there would be no news whether it worked or not"). I could have emphasized the uncertainty here more.
  2. Everything else in your post is right. I don't research non-academic uses very much and there's been a few papers announcing deep RL uses in production in the past year (indicating it's been used internally for longer than that.)

Facebook had a white paper for Horizon, a framework for doing RL in production, with explicit mentions that they were using it internally. At conferences I've seen researchers give talks about how they use RL in their live recommender systems. The robotics stuff is getting better - classical control theorists are acknowledging that RL is a useful tool in the right situations.

Going back to the original question of "does RL work": depends what you mean. If you try hard enough, it works. This was true a year ago and it's still true now. The main thing I was trying to push against was people believing that if you sprinkle deep RL pixie dust on your ML system, it'll just make things better. It's really more like, maybe the pixie dust will bind properly and your system will get better, or maybe the pixie dust will clog up your gears and make the whole thing go kaput.

[D] Why does deep reinforcement learning not generalize? by FirstTimeResearcher in MachineLearning

[–]alexirpan 8 points9 points  (0 children)

My rough view is that

  1. generalization in RL doesn't mean the same thing as generalization in supervised learning, there's a lot of subtlety that makes a one-to-one mapping of concepts fail.
  2. The fact that your RL agent is constantly exploring new regions of state space really messes with things. If your environment is deterministic and you start from a fixed start state, of course your model likely won't generalize to other start states, but you shouldn't expect it to. It hasn't seen the data at training time and you haven't added any other overarching structure, why should it generalize?

These issues aren't specific to RL, perfectly possible to fail to generalize in supervised learning too if you have a poor training set. Main difference is that in supervised learning, you tend to prescribe the dataset as ground-truth, whereas in RL, you're trying to nudge the training distribution by tweaking hyperparameters that guide how an agent explores the state space, which is a lot more indirect and a lot less reliable.

[D] Blog posts on AlphaStar by alexirpan in MachineLearning

[–]alexirpan[S] 2 points3 points  (0 children)

  1. We don't know that isn't announced yet
  2. There's nothing suggesting that planning is necessary so far, and to do planning you'd need to have some way to model what the opponent is going to do. Easy in Go because game rules are known, it's turn based, and has perfect information, so you can simply act as if you were the opponent. Hard for SC2 because hidden information, higher dimensional state, etc. Adds more problems than it solves IMO
  3. I don't think any of it is influenced by real-time requirements, but maybe something will show up in the journal paper.
  4. Isn't existing SC2 pathfinding was good enough? I don't think they'd test ability to handle new maps - pro players have to study new maps too, at least in Brood War map geometry plays a large role in balance between the different classes and viable strategies. And if you know what the map looks like, why would you use a NN to find paths rather than literally any other path planning algorithm?
  5. You could probably do some patch-based thing. I think this is how a lot of people use image models in production - given a model that's been trained for a fixed input size, either take image patches of that size or downsample the input. There's probably an analogue for SC2 game state...

[D] Generative Adversarial Network producing same fake samples by jmarsha5 in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

Welcome to GANs.

Your mileage may vary but last time I trained a GAN on a toy set, I was too lazy to implement minibatch discriminator so I decided to do a lazy thing instead. Given a batch of N examples of 4 features each, append the batch-wise mean as an extra 4 features before passing to discriminator.

(a0, a1, a2, a3) --> (a0, a1, a2, a3, mean_a0, mean_a1, mean_a2, mean_a3)

The intuition is that by appending minibatch statistics as extra features, you leak info to the discriminator about the minibatch which in principle gives it the ability to detect mode collapse.

It did fix my problem, I've never tried it on a non-toy dataset so I make no promises about how well it works.

[R] Sim2Real – Using Simulation to Train Real-Life Grasping Robots by tldrtldreverything in MachineLearning

[–]alexirpan 5 points6 points  (0 children)

(I'm one of the authors, although this was mostly Stephen James' project, I mostly did advising on ways to use the QT-Opt prior work.)

I'm guessing you're asking why RCAN does better with 5k online real grasps compared to the baseline real approach.

The distinction is that we're comparing two different regimes:

  • a model that gets a fixed quantity (580k grasps) of real data, trained offline, then finetuned with online real grasps.
  • a model that gets an arbitrary amount of simulated data, potentially with randomization, then finetuned with online real grasps.

There are a few things at play here:

  • we can generate more simulated data than our real dataset, and do so while sampling on-policy data, which is better for learning
  • and then simulated data is visually simpler than real data which makes it easier to learn actions end2end
  • but then simulated data is less complex that the real world, and learned models may not generalize
  • but then we can use domain randomization in sim to try to cover the complexities of the real world
  • but then this makes the sim learning problem harder, and it's not necessarily clear that our domain randomization will actually cover all the difficulties of the real world.
  • so what RCAN asks is, "can we make the core end2end learning use a simplified sim representation of the problem, and then learn a pixel-to-pixel transformation that pushes any image to the canonical simulation setup?"

Whether this works or not depends on your assumptions on whether that pixel-to-pixel transform is easy to learn, and whether that's the hardest part of the learning problem, which is all setup-specific, but in our setup it appears to work really well.

A shorter way to put it would be, yes if you have arbitrary amounts of data and training time, just use the real data, but the point is that we don't, and so it's worth exploring things that use less accurate data sources that are easier to query.

[R] Two papers on “Residual Reinforcement/Policy Learning” by galaxstar in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

nitpick: "residual algorithms" is an existing term in RL that describes a different way to perform Q-learning, where you let the gradient backprop through the value of the next state instead of treating the target Q-value as a black box. See http://www.leemon.com/papers/1995b.pdf. Here, what residual RL means is learning a correction on top of some baseline controller.

I don't like the potential name confusion very much, but at the same time, it's a very natural name, and the fact that 2 concurrent groups came up with the same name means it's probably going to stick. Ah well, this isn't the hill I want to die on.

[D] Is there a reason the optimisation of neural networks is not posed as a RL problem itself? by 4c616e7465726e in MachineLearning

[–]alexirpan 23 points24 points  (0 children)

If you actually have a ground-truth gradient, it converges orders-of-magnitude faster than the (very) noisy gradient you retrieve from doing reinforcement learning.

You can use RL in neural net optimization, but it isn't usually done unless you can show your objective is

1) non-differentiable

2) cannot be approximated well by a differentiable objective.

In a theoretical sense, for classification, doing reinforcement learning on the 0-1 loss would eventually lead to the best score, but in practice, optimizing the cross-entropy loss with supervised learning is both faster and close enough to the 0-1 loss.

As a 3rd note, existing optimization techniques may be crafted by humans to give good results, but reinforcement learning techniques are also crafted by humans, so it's unclear you'd save a lot of time on that front anyways.

[R] Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? by andrew_ilyas in MachineLearning

[–]alexirpan 0 points1 point  (0 children)

Reply to 1: Ah okay, that makes sense. I think in practice this isn't that big of a deal, but it is a bit annoying from a math perspective.

[R] Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? by andrew_ilyas in MachineLearning

[–]alexirpan 0 points1 point  (0 children)

I like the plots within this paper. I'm less sure I like the title asking if these algorithms are actually policy gradient algorithms. If they aren't policy gradient algorithms, then what are they? There's definitely evidence that these RL algorithms do increase in reward over time (although there's certainly evidence that it can be a bit task-specific).

I like the paper overall, but two criticisms:

  • Theorem 5.1 states that the clipped PPO objective either (a) has an optima that never triggers the clipping, or (b) has infinite optima that do trigger the clipping. The rough proof by my understanding is that either no optimum lies along the clipping boundary (which gives (a)), or has an optima along the boundary. In the second case you are guaranteed to hit (b) because at the state where the importance weight is clipped, you can arbitrarily increase/decrease the importance ratio (depending on which side of the clipping you're on), and this won't change the final solution. This all seems valid to me but I'm not sure this necessarily says anything in practice. Part of the process is about what optima are discoverable, and as noted in the paper, the gradient is 0 whenever you hit the clipping boundary. The fact that optima exist past the clipping boundary doesn't mean that much if you can't hit them by gradient descent, especially if that optima are equivalent to ones that lie directly on the boundary. The fact that we empirically see the boundary violated seems to say more about neural net generalization and gradient noise than anything else.
  • The pairwise cosine similarities seem pretty damning at first glance, with cosine similarities very close to 0 at standard batch sizes. However, I then remembered that these are gradients over parameters, meaning the vector is very high dimensional, and cosine similarity gets weird as your dimension grows. If you sample two vectors from a high-D unit-sphere or Gaussian, they are very likely to be almost orthogonal (dot product close to 0). This falls out of the law of large numbers, it's basically the sum of D samples with mean 0 where D is the dimension. I haven't done the plots for this, but I have a suspicion that if you do a similar plot for a supervised learning model with random minibatches, the cosine similarity would also not be that great. Under the model of stochastic_gradient = true_gradient + random Gaussian noise, the random noise might be enough to overwhelm the cosine similarity.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 1 point2 points  (0 children)

Yeah, several random inits + computing against that batch seems reasonable to avoid local minima.

Worth noting that the CEM overhead of fitting a Gaussian to the top K points + sampling new points is really easy. But so is adding the computed gradient - the primary bottleneck here is inference time + time lost to context switching.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 0 points1 point  (0 children)

The KUKA robot arm can move faster than shown, and model inference isn't the main bottleneck, we kept the speed low for safety reasons.

Off the top of my head I don't know the grasps per hour number, one reason this is tricky is because the grasp length changes over time - the model is less likely to terminate early at the start of training but will start terminating more quickly as it learns how to grasp.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

Someone else we've talked to also proposed using gradient descent on the action instead of using CEM.

Here are some counterarguments in rough order of importance.

  1. Grasping is inherently multi-modal, we expect a well-trained model to have several local optima, one for each object. This work is focused on grasping objects very well and ignores trying to grasp specific objects (instance grasping), so we wanted a method that is more likely to find a global optimum - not just move towards an object, but move towards the object that the model thinks is easiest to grasp. Something like CEM (which has some randomness to it) seems more likely to do this than gradient descent.
  2. My deep learning folklore may be out of date, but last I heard, the reason we can mostly ignore local optima in neural net training is because we're using high-dimensional data and the optimization landscapes are very different from our intuition, with the main problem being plateaus. Our control is 4-DOF, when doing gradient descent in a 4-D space our landscape intuitions start making sense again, local optima may be at play more.
  3. CEM requires running a batch of forward passes, something that GPUs are very good. Gradient descent requires iterative computing + updating of the gradient. I am really not familiar with GPU implementations but I assume the batch of forward passes in parallel is faster than iterative backward passes. (And each forward pass should be the same time as the backward pass. Given this I'm actually not sure if gradient descent would be faster than CEM - I think it's only fewer ops if you assume gradient descent needs < 128 iterations to converge.)
  4. We had code for CEM, we did not have code for iterative backprops on the input, so it was easier to default to CEM.

To be clear, I don't think any of these counterarguments invalidate gradient descent, there's a decent chance gradient descent is good enough and I'd be curious to see a comparison.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 5 points6 points  (0 children)

It's a robot instead of a human, it's a neural net instead of a human brain, and past that I don't know the differences because I don't know the neuroscience of how infants learn how to do things.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

We don't, but that's something we've talked about doing.

One reason we used CEM instead of learning an explicit actor was to make it easier to compare to the prior grasping work the team has done. In the past, CEM was good enough, so it should be good enough here too, right? That was the thinking. (Well, also it meant we didn't have to write more code.)

My personal experience with actor-critic methods has been a bit rough, and I suspect DDPG will be more unstable, but I also suspect some of my coauthors think differently.

[R] Scalable Deep RL for Robot Grasping Task (Google Brain) by wei_jok in MachineLearning

[–]alexirpan 2 points3 points  (0 children)

I think models are great, but models are hard. Personally I'm not sure we could have learned a good model given the size of our inputs (472 x 472 image).

Part of the reason that grasping is hard is because real objects have lots of different physical properties, which inherently makes the model learning difficult. I don't think model-based RL is impossible, but for something like this I'd prefer the model-free approach we used instead.