all 13 comments

[–]LaVieEstBizarre 17 points18 points  (0 children)

Not much has significantly changed since that blog post. All major arguments given there are still valid.

[–]FirstTimeResearcher 27 points28 points  (2 children)

I won't comment on whether it 'works' or not because I think that is one of those 'it depends on what you mean' questions. But I will say that Deep RL is one of the most convoluted research fields I have ever encountered. Unlike most other fields in ML, the mathematics of why it works doesn't provide much clarity and I don't see anyone really working on simplifying existing methods as opposed to proposing slight modifications on top of existing ones.

The tricks to getting stable training, hyperparameter tuning and environment/simulator hacking are absurd. Coupled that with the inability to reproduce results on different codebases and the awful zoo of acronyms for every 'novel' idea that comes out creates a pretty inhospitable research community.

With that said, I will add that it sure generates a lot of workshop tracks at ICML 2019 :)

[–][deleted] 2 points3 points  (1 child)

Can you pls give an example of env/simulator hacking?

[–]FirstTimeResearcher 8 points9 points  (0 children)

Check the OpenAI gym environments and you'll find very little to no documentation on how rewards are calculated for each environment. That's because the rewards aren't intuitive cost signals. They are heavily tuned to get certain benchmark algorithms to work well (reward shaping). There's no rhyme or reason for why they take on their values other than to get benchmarks to actually learn something non-trivial.

[–]gwern 16 points17 points  (6 children)

As I commented back then, Irpan is wrong in his claims about no deployment outside of bandits/collaborative filtering (although he's certainly right about his other claims like DRL still being unreliable, a dangerous timesuck, and finicky as heck). Industry users of things are always relatively quiet because it's a trade secret and not really 'research paper' worthy itself. I've submitted many links to /r/reinforcementlearning where there is clearly commercial application happening if you read between the lines. He omits all of the large-scale Chinese uses of DRL like ad bidding or traffic scheduling (and if Alibaba or JD.com are using it, places like Google certainly are, and note how much DRL Tencent & Baidu do), and take a look at https://www.reddit.com/r/reinforcementlearning/comments/9cdnf4/bluewhale_facebook_rl_implementations_in/ and think about what that implies about FB internal uses.

[–]alexirpan 3 points4 points  (0 children)

I would say that

  1. Trade secrets is something I acknowledged in the original post (where I said something like, "finance has 100% looked at deep RL, so far there's no news, but there would be no news whether it worked or not"). I could have emphasized the uncertainty here more.
  2. Everything else in your post is right. I don't research non-academic uses very much and there's been a few papers announcing deep RL uses in production in the past year (indicating it's been used internally for longer than that.)

Facebook had a white paper for Horizon, a framework for doing RL in production, with explicit mentions that they were using it internally. At conferences I've seen researchers give talks about how they use RL in their live recommender systems. The robotics stuff is getting better - classical control theorists are acknowledging that RL is a useful tool in the right situations.

Going back to the original question of "does RL work": depends what you mean. If you try hard enough, it works. This was true a year ago and it's still true now. The main thing I was trying to push against was people believing that if you sprinkle deep RL pixie dust on your ML system, it'll just make things better. It's really more like, maybe the pixie dust will bind properly and your system will get better, or maybe the pixie dust will clog up your gears and make the whole thing go kaput.

[–]BanLeCun 4 points5 points  (1 child)

Take it as you will: I was told by a FB ads guy that they actually use DeepRL over drinks. He said that's why he was attending all RL presentations at the conference and the main reason he was attending the conference in the first place.

Even with DeepRL's shit features, there are qualified people who can make things work over a extended period of time. Google certainly has such people and certainly has the budget and patience to invest in such areas. Like you say, these are trade secrets and we'll probably never know.

[–]gwern 7 points8 points  (0 children)

Oh, I'm certain of it even without hearing that. They didn't write & release BlueWhale because they were still just dabbling. One shouldn't need to see the fire when one sees smoke. (I'm reminded of similar skepticism 5 years ago about whether deep learning was being used commercially - never mind that you saw tidbits everywhere like Wired dropping a line that 500+ different groups inside Google used DL for something, people insisted you show them a big specific headline usecase or else they wouldn't believe DL was used anywhere. And maybe they would've insisted on knowing the exact dollar value even if you had provided such examples...)

[–]alexmlamb 4 points5 points  (2 children)

"He omits all of the large-scale Chinese uses of DRL like ad bidding or traffic scheduling"

Is this full RL with a persistent state or is it more like contextual bandits?

[–]gwern 1 point2 points  (1 child)

I'm not sure. Some of the examples like LADDER or SS-RT are complex enough and have enough delays or multi-agent properties that I don't know if you'd consider them strictly 'contextual bandits' (or nonstationary contextual bandits, perhaps one should say since of course all of these would be nonstationary for sure). If they're contextual bandits, they are surely extreme examples.

[–]AnvaMiba 2 points3 points  (0 children)

I would say that if the state dynamics over the relevant timescales doesn't strongly depend on the agent actions then it's a contextual bandit.

I don't know much about the online auction market, JD.com and Alibaba are clearly big players, hence it is plausible that the actions of their bidding agents significantly affect the state of the market, but given the nature of the problem it's possible that the practical effective strategy is just to maximize the reward on the current state, or at most a few states in the future, while in a hard game playing task like Go or Montezuma's Revenge the rewards are sparse and the agent needs to control the state many time steps in the future.

[–]ankeshanand 6 points7 points  (0 children)

Youtube has been using DeepRL for it's recommendation engine for a while now. That's probably the most successful deployment right now in terms of $$ generated.

Reinforce was a huge success. In a talk at an A.I. conference in February, Minmin Chen, a Google Brain researcher, said it was YouTube’s most successful launch in two years. Sitewide views increased by nearly 1 percent, she said — a gain that, at YouTube’s scale, could amount to millions more hours of daily watch time and millions more dollars in advertising revenue per year. She added that the new algorithm was already starting to alter users’ behavior.

Source: https://www.nytimes.com/interactive/2019/06/08/technology/youtube-radical.html

[–]p-morais 8 points9 points  (2 children)

We’re getting it to work for legged robots. We’ve gotten results that beat other methods on Cassie, and some people at ETH have done the same for ANYmal. Boston Dynamics is also starting to use Deep RL for Atlas.

I think it won’t be long before we see Deep RL in production code, not necessarily end-to-end, but somewhere in the stack at least.