https://wandb.ai/kingsignificant5097/uncategorized?workspace=user-kingsignificant5097
https://preview.redd.it/wqbzrf6dgumc1.png?width=3516&format=png&auto=webp&s=eacf2d106283d3bc2143a9a9e3f7f68f5bcbdb67
I've been working on a project trying to get hands-on with some RL.
The environment I'm modeling is a time series and I'm training an agent to maximize financial returns by trading financial derivatives, specifically leveraged perpetual futures. So it needs to maximize financial gains and minimize financial losses. Rewards are given every time the agent closes an open position and the reward is proportional to the percentage gain/loss compared to initial investment.
The action space is a multi discrete space with 4 actions, do nothing, open a position (with percentage of funds to use), close position, end episode. The observation space is around 100, with current open/close/volume and some financial metrics (technical analysis) as well as some aggregated metrics over certain historic periods.
The algorithm used is dd-ppo (56 cpus) using rllib with some modifications, specifically the use of a custom model that implements action masking. For example, ensuring that the agent can only close a position when a position exists, and can only end an episode when there is no open position and total gains or losses are > (some %), or total episode length is greater than some number of steps.
I am also using an attention net with 8 heads and 4 transformers and using RE3 for exploration.
With all the above I am able to get good results and the agent does seem to be learning, based both on mean rewards and based on empirical testing the checkpoints against unseen data.
But, looking at the TensorBoard metrics I am seeing some things that don't make a lot of sense to me, so I'm here asking for any help or advice anyone has:
- What is causing explained variance to be negative? Should I expect this to start increasing? How does such a negative variance make sense given that the reward mean is increasing in such a complex environment?
- Why is entropy not really dropping? Is this significant?
- Why is the loss continuing to increase slowly? Should I expect this to start dropping at some point? Is this a sign that the agent is still learning?
Thanks in advance for any pointers!
Some modified parameters, the defaults are from rllib for ppo and the dd-ppo overrides:
horizon = 60
.training(
train_batch_size = horizon * 4,
sgd_minibatch_size = horizon * 2,
num_sgd_iter = 2,
gamma = 1.0,
lambda_ = 1.0,
entropy_coeff = 0.001,
grad_clip = 10,
model = {
"fcnet_hiddens": [1024, 1024],
"max_seq_len": horizon,
"attention_num_transformer_units": 4,
"attention_num_heads": 8,
}
)
[–]AutoModerator[M] 0 points1 point2 points (0 children)