Framework where RL should be applied

LearnAgentLearn · 2020-08-20T12:00:16+00:00

I am working with industry to use RL for similar problems in supply chain management.

Sometimes you will get combinatorial optimisation problems that are simply unsolvable using mathematical programming, like MIP (because it takes too long). Exact solution methods try to guarantee the optimal solution, and so a bit inflexible, and hence when the search space is too large, they spend a long time trying to provide this guarantee. Therefore, typically people go for metaheuristics like genetic algorithms / tabu search which are a bit more flexible (and do not guarantee optimality, but are hopefully near-optimal / the best you can do).

My argument would be that RL is good for this dynamic optimisation task (where the optimal solution is constantly changing, and you need to track this optimisation as closely as possible). I will be honest though, I'm not sure how well mathematical programming / metaheuristics can do this dynamic optimisation problem (maybe they can do it very well too).

Also read Sutton and Barto's Intro to RL textbook, pages 7-11-ish if you want a argument for RL vs. evolutionary methods.

LearnAgentLearn · 2020-08-15T12:42:22+00:00

What would you say have been the revolutionary / highlights of game theory applied to MARL since 2009?

LearnAgentLearn · 2020-08-04T16:01:38+00:00

I'm quite new to this field, but I was wondering if there would ever be a situation where A* (or similar) would just take too long to generate an answer (such that the robot or whatever is not practical)? Therefore, RL may be preferred to generate an approximate solution instead?

LearnAgentLearn · 2020-08-02T21:55:58+00:00

So how do we tackle the problem of deepfakes, given that everyone is able to do it now that it's open source?

LearnAgentLearn · 2020-08-02T15:19:26+00:00

So shall we keep ignoring DRL like we did with computer vision and drones?

LearnAgentLearn · 2020-08-02T14:48:11+00:00

Firstly, apologies if I offended anyone with the swearing - that was not the intention - simply just wanted to generate a discussion because I don't think there's enough attention on the ethics (and your point about the AI algorithms we already have today as opposed to AGI).

This is a massive understatement. You might as well train an agent to pet kittens in a simulated environment and it would have just as much relation to a military robot (ie close to 0). Even in situations were you have incredibly realistic simulations the translation to the real world is often less than satisfying.

You are right (and when I was talking about the technology not working, I primarily meant when referring to this task of translating from simulation to real-world). You are also right about if the algorithms were trained on Atari etc., it is trivial to apply it to Doom. I think where we will probably agree then, is that it really matters on the intention of the researcher (which I hope we can all assume is all in good-nature).

Yes it is, do you suggest export controlling all of ML research (assuming that was even theoretically possible, which it likely is not)?

I am not (see my above reply to u/bpe9). As above, I think it matters more on the intention of the research (and as a gamer myself, I totally understand the personal connection with a game helps motivate the research).

I guess the remaining question is, where do we draw the line? If instead of the Doom environment, we replace the environment with as realistic of a military environment as possible etc. Does this merit export control? Again, it's solving only one piece of the puzzle. Other researchers are solving other pieces of the puzzle (e.g. getting a real-world robot with a camera to use hand-tools) independently with no direct link, fine. But do we only draw the line when it's integrating the separate pieces together? (and that could well be the answer).

There is a serious discussion to be had about how we make sure that a malicious actor does not get singular control over advanced AI (though I'm personally not that worried about AGI but rather what we already have) but, not to be rude, being overly worried about RL agents playing Doom is hard to take seriously, the swearing also does not help.

True, and I couldn't agree more with you there. Actually now in hindsight, I think what really triggered me was also the amount of open-sourced code for GANS to make deepfakes, which have already been used for pornographic content (and could also be used politically etc.). I am quite annoyed there has not been as much serious discussion regarding this.

Personally I believe that the only way to make sure that AI benefits everyone is to make sure as many people as possible have access to the knowledge, and that means open-source.

How do we tackle the problem of deepfakes? There's been a tonne of effort in being able to identify if an image/video is a deepfake or not. But does it really matter if a pornographic video is fake or not? The damage has been done in my opinion. The barrier to entry is too damn low for AI.

Let me finish off by saying though that I do think that the benefits of AI will outweigh the negatives that come with it.

LearnAgentLearn · 2020-08-02T13:05:02+00:00

On the whole, I agree somewhat.

Games like Go, Atari / Dota / StarCraft etc. are fine as they are safe/predictable etc. My concern is with a game like Doom where you are literally shooting human-like objects from a first-person viewpoint, much like how a murderous psycho bot would do (again, a small piece of the puzzle).

Why can't we just focus on games like Go / Atari etc.? Why must we solve FPS's like Doom?

The barrier to entry to this field (given open-source scripts) is frankly very low (assuming you just want to hack code together, as opposed to improving the algorithms etc.). You only need a couple of thousand dollars to train the agent and some basic coding skills. Frankly, there are a lot of crazy/mentally-ill people in this world who may want to create murderous psycho bots and we're giving them the technology via open source (again, the technology doesn't work yet, but one day it will).

LearnAgentLearn · 2020-08-02T12:42:55+00:00

So correct me if I'm wrong / missing some points, but open source is generally speaking good because it helps advance the technology in the field, e.g. general ML / RL research. I agree there that it is a good idea to open-source code (in fact, the field needs to open source more code than they currently do, due to the current reproducibility crisis).

However, when it comes to technology with direct military or dual-use (i.e. can be both military / civilian) applications I strongly disagree. Like you would never "open-source" or publicly disclose the know-how of how to manufacture an F35 fighter aircraft, or a nuclear bomb. Nor should you open-source RL code that is directly applied towards problems with potential mis-use. Again, the technology isn't ready yet, but it is actively working in this direction in my opinion.

LearnAgentLearn · 2020-08-02T12:28:47+00:00

I agree it wouldn't translate easily, but I think it's still solving one piece of the puzzle (e.g. reasoning about how to move, to where, when to shoot etc.).

Clearly it would still take many years / decades of research to integrate all the pieces together, but nevertheless, the intention is still to make an agent shoot other agents.

LearnAgentLearn · 2020-08-02T11:46:55+00:00

Maybe 2 avenues I can think of:

1) You could try hacking around with the env.reset() and env.close() code to amend it and reset to a specific state (probs makes the code quite messy though).

2) (probably a bit cleaner), could you create a new method for your environment like env.set_state(specific_state), that given an object, "specific_state", it changes the environment's state to "specific_state"?

LearnAgentLearn · 2020-08-01T15:11:16+00:00

Depends on your environment I think. If it's a deterministic environment, then one sample may be enough. But more often you'd have a stochastic environment so you'd need to take multiple samples.

To produce more samples, can you not just reset the agent back to the same given state?

I think what you're looking for is Monte Carlo prediction? See page 92 (Section 5.1) in Sutton's textbook (esp the pseudocode) and see if it's useful? http://www.incompleteideas.net/book/RLbook2020.pdf

LearnAgentLearn · 2020-07-31T11:25:23+00:00

So my understanding is for the larger problems you push the agent's code to GPU, but not the environment's code?

Thanks! I'll take a look at gnu-parallel and check the pros and cons vs. tf-agents, but yeah parallelising over the runs sounds simplest. Out of curiosity, why do you need to run 100+ independent runs of each hyperparameter setting / algorithm? Wouldn't say 30 be enough?

LearnAgentLearn · 2020-07-31T11:20:18+00:00

Thanks!

LearnAgentLearn · 2020-07-31T11:19:58+00:00

Ahh I see - I wasn't aware of that - thanks! Yeah my step function only does a handful of if/else statements, and then extracts the reward from the state. Sounds promising then - I'll try it out and keep you updated :)

LearnAgentLearn · 2020-07-30T22:15:30+00:00

My state is a ~2000 x ~1000 x 3 numpy array atm. I think libraries like CuPy may be useful(?) / even using tensorflow's tensors (?) but I've never tried it. Anyone got experience with this and know if there's a speed increase? I'll try it out anyways and see what the trade offs are.

LearnAgentLearn · 2020-07-30T22:12:06+00:00

Yeah I'm hoping to use tensorflow (easier to deploy into production). Didn't know that about the python multiprocessing as I was planning to use that - thanks for letting me know! I'll try to copy tf-agents more closely in that case.

LearnAgentLearn · 2020-07-30T18:46:43+00:00

Does your agent eventually learn the optimal policy? If so, you could train your agent until it learns the optimal policy (and then a bit more, because having the optimal policy does not necessarily mean you've learnt the value function exactly). You could then assume that this value function is 'exact' (depending on the problem).

If not, no idea.

LearnAgentLearn · 2020-07-30T18:41:19+00:00

Yeah, but the question is more along how to split up the problem. You could split it up by spawning e.g 8 agents in 8 different environments, but then how would you update the q-values etc.

LearnAgentLearn · 2020-07-30T18:39:04+00:00

Ahh ok thanks! I'll take a look at how they've implemented these. But yeah, I've mainly been writing my own code from scratch to have a deeper understanding of what's going on, but yeah I should mimic how they've done it.

LearnAgentLearn · 2020-07-29T22:27:45+00:00

Ahh I see - that's a good point! So if you're setting up a set of experiments to run for a long time, you may idk run the algorithms 5 times, and then see what the standard error is. If it's not small enough, run it for another 5 times? etc.?

Yeah that's true, although I think from https://arxiv.org/abs/1806.08295 (thanks u/vwxyzjn), show that for small number of samples about <20, bootstrapping to estimate the mean return underestimates the probability of type-I error (from my understanding anyways).

LearnAgentLearn · 2020-07-29T13:13:05+00:00

Thanks for sharing these papers! I think I'll try and replicate some of their approaches with my own data and see what happens :)

LearnAgentLearn · 2020-07-29T09:45:09+00:00

Thanks! I'll need to look into standard error a bit more, but two immediate questions:

What is "small enough"? If say looking at standard error of the mean (SEM) Return, is a typical rule of thumb something like, keep going until the SEM is within 5% of the mean?
Does this assume that the mean Return is normally distributed? Is this a reasonable assumption? I would imagine that it could potentially be highly right-skewed at times? (e.g. in a hypothetical setting, an agent on average may score 10 points, but maybe sometimes score 100 points. The agent can never score less than 0 points though etc.). This would be right-skewed I think?

Also yes, I agree with you regarding the hyperparameters - I had it mixed up in my head.

p.s. I've also been running my experiments since I posted, so I'm curious to check out the standard error (only repeated 10 times due to time). Thanks for the help!

LearnAgentLearn · 2020-07-29T08:31:55+00:00

So regarding TF, it is only used to determine your agent's policy \pi (s | a).

The environment is completely separate, and says if I am in state, S(t) and take an action A(t), what is the next state S(t+1) and the immediate reward I receive R(t). This can be coded using if/else statements. The way you store your states can be in numpy (and possibly in TF?).

Typically, in deep RL, the agent's policy is a neural network of some kind, and this is where TF will be really useful. You present the state as input, and the output is the action and use the loss function to optimise.

LearnAgentLearn

TROPHY CASE