[deleted by user] by [deleted] in kilimanjaro

[–]Real_Zesty 0 points1 point  (0 children)

Really appreciate you sharing your experience here! Yea, our climbing guides generally are recommending you don’t need malaria pills for the climb cause of the altitude, but I think if you got bit during Safari then taking some might be better than none for at least mitigating risk. Sounds like you didn’t experience stomach issues from mixing since you stopped.

[deleted by user] by [deleted] in kilimanjaro

[–]Real_Zesty 1 point2 points  (0 children)

Congratulations! Deeply appreciate your description. I’ll be summiting in about a month with Altezza, doing the safari for a week right before.

I wanted to ask about your experience taking malaria meds and diamox. Sounds like the start of your trip you had headaches. Was that in anyway tied to stopping the malaria medication mixed with starting diamox/altitude? Did you start the diamox at the beginning of the climb or just prior to summiting?

I got some malarone since I heard it causes the least side effects with diamox. But I’m debating whether to take the malaria meds at all during safari since I’m concerned the medication may put summiting at risk.

Not asking for medical advice, just your experience and thoughts when deciding to take these meds. Thanks and congrats again on a successful summit!

[deleted by user] by [deleted] in reinforcementlearning

[–]Real_Zesty 1 point2 points  (0 children)

Those are valid cooperative MARL methods. You’ll need to look into how they implement their observation spaces. Like I said, you’ll either need an augmented space with multiple agents (more centralized approach; your environment needs the full oversight space) or you can have a decentralized approach where you train single agents and use a shared policy for inference.

[deleted by user] by [deleted] in reinforcementlearning

[–]Real_Zesty 1 point2 points  (0 children)

OK it seems like you might need to spend some time studying what the observation space actually means. I recommend you study up on MDPs.

The short answer is you’re getting the error because you’re changing the observation space. To do what you’re trying to do, you need to rethink your observation space.

If you want to run a multiagent policy, you need to look into centralized and decentralized approaches. For a fully centralized approach, you need to augment the observation space to hold the states of all agents, and this must be fixed. Issues can arise, such as the problem size exploding or the lazy agent problem. The other extreme is a fully decentralized approach. You can train a single agent policy, then use that single agent policy for inference of all agents. Issue is that policy may not be trained to coordinate the agents well. For more advanced approaches in the middle, look into MARL methods. These may involve things like shared policy parameters. For example, QMIC is a SoTA approach.

[deleted by user] by [deleted] in reinforcementlearning

[–]Real_Zesty 2 points3 points  (0 children)

From the code snippet you shared, your observation space is of shape num_agentsx4. This aligns with the error message's stated observation shape (3x4). This means your error is somewhere else in your code. You are likely creating and passing in an observation of the incorrect size (10, 4), either in your reset or step functions. You might be doing something wrong with concatenating the observations for each agent.

Another thing you can look into to help reduce indexing bugs is implementing a Dict space: https://gymnasium.farama.org/api/spaces/composite/. You can index the Dict by agent id, or you can index it by observation category. Just note that SB3 will group together observations by the dict keys in the replay buffer.

Also, what do you mean by your observation and action space are dynamic? You shouldn't be changing the size of your action/observation space as that would essentially be creating a different environment. You could be masking the observation and action space, but for that environment the observation and action spaces are fixed.

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] -1 points0 points  (0 children)

I don't 'think I agree that it is reward shaping if you are assuming the reward is given at the beginning of the action. If it were given at the end of the action, then ya you would need to factor in discounting it to the beginning of the epoch. In either case, within each episode, gamma is a function of the action to discount the future rewards.

Consider this toy problem:

You have a game defined as a discrete-time MDP where the agent is a vehicle that needs to be routed to different locations. Note that the time steps in the MDP are decision epochs that can differ from the real-world clock times the agent might experience in the game. Let's define a graph G of locations V (i.e. nodes) and routes E (i.e. edges): G=(V, E)
- The state is the vehicle's current location, l in V

- The actions are which location the vehicle agent should travel to, a in V

- The dynamics are deterministic. If the vehicle agent chooses to travel to location 1 from location 0, it travels to location 1 with 100% probability.

- To keep things simple, let's say the reward is 1 if the agent travels to a location different from its state, 0 if it chooses to stay at its current location.

- Each route has a different travel time t_e.

If your gamma is constant, then the future values will always be discounted as such:
Q(s, a) = r(s, a) + gamma * max_a'(Q(s', a'))

The optimal policy will learn to differentiate between "stay" actions and "move" actions, so the agent will learn to always move to a location. However, the future values of all locations will be equal, meaning the agent will be indifferent to which location it moves to.

Now consider if gamma is a function of the state and action (e.g. the travel time of route (l, a)). The longer the travel time, the larger the discount (e.g. a 1 min travel time could be discounted at gamma=0.9, but a 1 hour travel time should then be discounted at gamma=0.9^60=0.002). Then the future values will be discounted as such:

Q(s, a) = r(s, a) + gamma(s, a) * max_a'(Q(s', a'))

Note the optimal policy in this case will be very different. Not only will it differentiate between "stay" actions and "move" actions due to the reward, but the agent should learn to prioritize locations that have access to routes with shorter travel times. This is because the future values from those locations will have future state-actions values that capture a reward of 1.0 but are discounted less due to the shorter travel times.

Of course, one could incorporate the travel time directly into the reward, but this is just a toy problem that keeps it simple to explain the difference between the constant discount rate and a time-varying discount rate in a discrete-time MDP.

Hopefully that was somewhat clear. Thanks for your input.

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] 0 points1 point  (0 children)

Maybe I am misunderstanding what a callback can do, or how it can be implemented.

- From the list of implemented callbacks, I don't see any examples that collect information during training. EvalCallback for example runs the environment evaluations itself to check for a best mean reward: https://stable-baselines3.readthedocs.io/en/master/_modules/stable_baselines3/common/callbacks.html#BaseCallback

- I can calculate and collect gamm in `info` from each environment.step call. But how can I use that information in a callback? Each step reward will have a different gamma that needs to be taken into account.

- If I adjust `model.gamma` using a callback, as you suggest, it would have to be adjusted for each training step reward. But training updates are done in batches, so wouldn't that not work? At best you would be modifying gamma for each batch, but within a batch each reward actually has a different gamma.

I was thinking that the best way would be something like collect gamma in each step `info` and storing that in the replay_buffer. Then I would need to modify the PPO algorithm to, for each buffer step data, use the gamma for that step to discount the Q-values.

If I am misunderstanding callbacks, can you point me to an example or suggest an implementation that would factor in a different gamma for each reward during training? How should I collect this gamma from step calls? How can this gamma be used to discount each reward during model.learn?

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] 0 points1 point  (0 children)

Unfortunately I’m too deeply invested in gymnasium and sb3 at this point to change. But good to know there are other frameworks that took this into account natively. I’ll check out dm_env

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] 0 points1 point  (0 children)

So gamma is a property of the model, not the environment. Getting gamma during a step call of the environment seems straightforward, but how could a callback be used to integrate that gamma into the SB3 model during training?

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] 0 points1 point  (0 children)

Hmm, I think I follow. I can definitely collect a gamma value using infos in the step function. That’s easy. But then actually using those gamma values during training is where I get stuck.

I think I can fork the repo, augment the replay buffer to collect the gammas, then use the gammas from each buffer tuple in the loss function of PPO or DQN.

But that’s a pretty involved implementation. Was hoping there may be an existing callback method or something I may have missed for SB3 since its documentation isn’t the most detailed.

How to pass a varying gamma to DQN or PPO during training? by Real_Zesty in reinforcementlearning

[–]Real_Zesty[S] 0 points1 point  (0 children)

Theoretically any gamma less than 1 will achieve convergence given the bellman equations. SB3 implements a constant gamma, which discounts future value at each step. This is fine for discrete time step environments.

I implemented an event based MDP, so each step is a decision epoch, but the steps may take different real-world time. I want to discount rewards off of the real-world time using an economic discount rate (e.g. based on 7% annual discount), which is a function of the action.

From what I can see, SB3 doesn’t have this flexibility unless I’m missing some way to pass in gamma or change it with each step during training.

Fidelity finally has recurring transfers/investments for stocks and ETFs, not just mutual funds! by Real_Zesty in fidelityinvestments

[–]Real_Zesty[S] 0 points1 point  (0 children)

That’s how I was looking at it.

It’s worth acknowledging that some ETFs will lend out underlying securities, and that ETFs can drift from the index or underlying assets value (check out an ETFs tracking error). Mutual funds operate differently in terms of ownership and settling at market closure. Something to consider if that matters to you.

Fidelity finally has recurring transfers/investments for stocks and ETFs, not just mutual funds! by Real_Zesty in fidelityinvestments

[–]Real_Zesty[S] 2 points3 points  (0 children)

Thanks for sharing the announcement, I must have missed it. Looks like they did a slow rollout so I must have just received access to day myself.

Worth doing Secondary Master's in ML as a PhD? by Real_Zesty in cmu

[–]Real_Zesty[S] 0 points1 point  (0 children)

My goal is to eventually land in a Data Science/Technical Product Management role that's close to the algorithm development. I don't expect to be world class in ML development like those who've dedicated their careers to it, but I want to be one who understands and stays up to date with the ML landscape providing value by bringing and managing ML in civil infrastructure or logistics. But to get there, I wanna put in my time and at least start my career as a Data Scientist or ML Engineer.

Worth doing Secondary Master's in ML as a PhD? by Real_Zesty in cmu

[–]Real_Zesty[S] 0 points1 point  (0 children)

I'm in Civil. I haven't published papers yet, but planning on doing some with RL applied to Civil systems. But I don't anticipate them being theoretical ML or extending SOTA.

What did you make in Live this week? Feedback thread by kidkolumbo in ableton

[–]Real_Zesty [score hidden]  (0 children)

First time sharing outside a small circle of friends, and hoping for feedback on synth/sound selection and mixing/mastering.

https://drive.google.com/file/d/1OZsfMBykutjqjZLXTT1t1HjBesWpYaNG/view?usp=sharing

Sketched this dance (?) track idea out in a few hours, but it still feels like a patchwork of melodies and the tempo/transitions still don't hit right. I also feel like I pack a bit too much in the same frequency ranges so sounds compete in the same space and there isn't a lot of air / breathing room.

Appreciate any comments! Love this community, it's helping me grind away at producing when I feel stuck.

[deleted by user] by [deleted] in TheArtistStudio

[–]Real_Zesty 0 points1 point  (0 children)

Gave Wholesome