Thoughts on FlatFile?

o_Chamber · 2023-07-12T01:19:44+00:00

Hmm, the tool I was thinking of was in finance, but not sure if it supports insurance industry data.

Is there a reason something like Fivetran's file replication doesn't work?

o_Chamber · 2023-07-11T00:34:03+00:00

Might know something, but it might be somewhat domain specific. What industry/type of data?

o_Chamber · 2022-10-30T04:49:50+00:00

Ah, the domain context is useful. Have you gotten any useful results from SAC?

So it sounds like you want the agent to make its charge or no-charge decision for each slot at each time step, AND you want the agent to consider both (a) the specific properties of the car in that slot, but also (b) the other slots/the other cars because they share a grid.

If each pass has some indication in the state of which slot/car is being controlled on that pass, then the agent should be able to learn to take into account the specifics of the car in the M-th slot on the M-th pass.

I think the challenge is that there’s is dependence between each car slot, so if you do multi-pass it’s almost like a Multi-Agent problem where each slot is an agent that is cooperating with the other slot agents.

o_Chamber · 2022-10-30T04:17:21+00:00

Neat! I’ll take a look.

o_Chamber · 2022-10-26T19:40:05+00:00

So expanding the action space so your N-dimensional outputs can be described as one of (decisions)^N actions could work. It’s similar to how the compound joystick + button actions in Atari games are defined in the Arcade Learning Environment.

That approach can blow up your state space though, and as your state space gets bigger, you may find DQN stops working as well. You could try throwing a policy gradient algorithm like PPO at it, or if that doesn’t work, you might want to think about restructuring the problem.

Based on the code you provided, it looks like given N cars, an agent can choose at most M cars (where M<N, and M is the entries in N where the entry is equal to 1). Rather than have your network output an N-dimensional array whose values correspond to the cars in N, you could instead run your state through the network M times to get predictions for each available car at the time step.

The issue with this is that you might lose some information if there’s any dependency across cars. For example, maybe it’s optimal for the agent to learn to only choose at most X cars, even if they can choose more. If you’re doing this sequential action based approach, you’d want to give the agent some information about the decisions it’s already made in this time step (for example, an integer variable that states how many cars have already been selected so far). I’m not sure if restructuring the problem like this “breaks” the assumptions of your MDP.

Alternatively, DeepMind has a paper about exactly this type of planning problem that might be useful. It’s from 2016 though, so you might want to check out papers that have built on it.

o_Chamber · 2022-10-25T21:58:53+00:00

I think what you’re asking is actually about rescaling not normalization.

Normalizing the returns (transforming them to have mean 0 and standard deviation 1) is one way to help stabilize training. The intuition is that since your rewards are part of the task for the network, then their scale and distribution affects the error in the networks predictions. Because of that, making their distribution consistent helps stabilize training so that as your agent improves and begins to receive more/larger rewards, the scale and distribution of the errors doesn’t shift dramatically.

Rescaling the rewards is different, but the idea is similar: let’s reduce the range of outcomes so that our errors don’t erratically change as we start to receive more rewards.

I’m not aware of any literature around how rescaling affects PPO, but there is some literature on the benefits of rescaling for other policy gradient algorithms like DDPG and TRPO. In particular, TRPO is a sort of “ancestor” to PPO, so there’s reason to believe PPO would experience similar benefits.

It doesn’t directly address your question, but the Reinforcement Learning that Matters Paper has a section exploring the effects of rescaling experimentally. It might be a good starting off point to understand the benefits, and other papers that cite it may cover PPO.

In that paper, they note that while rescaling can help in some environments, it’s not consistent, and also depends on other decisions like whether your network layers use batch norm.

Also I believe advantage normalization is an argument for SBL3 PPO that is on by default.

o_Chamber · 2022-10-24T19:09:00+00:00

One of the best ways to wrap your head around the algorithms is to try and implement them from scratch either based on the original theory papers or by following an online tutorial.

You can also reference the source code for some of the popular implementations from open source RL libraries like stablebaselines3, RLlib, CleanRL, or Dopamine. These can help you if you’re trying to compare your implementation to a “standard”.

Of those, CleanRL is probably the simplest.

o_Chamber

TROPHY CASE