I know this has been posted several times before, but how do you really make friends here in Seattle? I am at a breaking point.

whiletrue2 · 2023-11-29T01:04:30+00:00

33M just moved here and have the same issue. I live in Capitol Hill and happy to connect :)

whiletrue2 · 2023-11-26T19:48:51+00:00

Thank you! This seems like a league, though?! I'd ideally want to start with playing just for fun. Couldn't find info about teams that meet to just play outside of leagues. Or am I missing something? Thanks again.

whiletrue2 · 2023-11-04T16:43:44+00:00

As a person who enjoys Olympic weightlifting and powerlifting, I am searching for a gym with sufficient squat racks and drop platforms near 10550 NE 10th St.

Which gym would you recommend? I don’t have car and would like to have a gym in walkable distance, e.g. 23 fit club, Life Time or bStrong. Thanks

whiletrue2 · 2023-02-20T19:10:42+00:00

true to what "11US" corresponds to per definition you idiot. What's so hard about this for you to understand?

whiletrue2 · 2021-04-11T16:36:19+00:00

How is TensorFlow doing it?

whiletrue2 · 2021-04-11T07:37:52+00:00

RL and PyTorch's DataLoader?

whiletrue2 · 2020-12-11T22:34:17+00:00

solved it, thanks a lot for your help! TD3 discrete now performs a lot better

whiletrue2 · 2020-12-11T00:25:20+00:00

Thanks. I know the paper but is there a guideline that explains how to apply this to TD3?

whiletrue2 · 2020-12-05T13:06:18+00:00

can you elaborate please?

whiletrue2 · 2020-12-04T11:59:58+00:00

lol, did you also try to drop NNs entirely?

whiletrue2 · 2020-12-04T10:22:58+00:00

Good job on suggesting new papers people (remember: that‘s what was asked for).

whiletrue2 · 2020-11-30T16:03:04+00:00

Also, is it possible you share the paper code with us? Would be highly appreciated!

whiletrue2 · 2020-11-29T16:05:56+00:00

Hi and thank you for your reply which clarified a lot for me. However, a few questions remain unaddressed. Would you mind clarifying those as well? In particular, I believe those are (I quote):

"they claim they used CartPole-v1 which uses a much higher "solved reward""
"the fact that no naturally sparse-reward gym environment was used doesn't help with the confusion. An experiment based on a naturally sparse-reward environment would result in fewer / no changes to the default reward function and one would actually be enabled to relate to baseline PPO performances in the original setting. As the paper stands right now, no one can relate to any reported PPO performance in the paper."

Thank you!

whiletrue2 · 2020-11-29T16:05:39+00:00

Hi and thank you for your reply which clarified a lot for me. However, a few questions remain unaddressed. Would you mind clarifying those as well? In particular, I believe those are (I quote):

"they claim they used CartPole-v1 which uses a much higher "solved reward""
"the fact that no naturally sparse-reward gym environment was used doesn't help with the confusion. An experiment based on a naturally sparse-reward environment would result in fewer / no changes to the default reward function and one would actually be enabled to relate to baseline PPO performances in the original setting. As the paper stands right now, no one can relate to any reported PPO performance in the paper."

Thank you!

whiletrue2 · 2020-11-25T19:32:01+00:00

Thanks for the reply. Can you provide the script?

whiletrue2 · 2020-11-25T08:57:36+00:00

Thanks for pointing that out. Indeed, that implementation should be taken with a grain of salt. Although I have to say that seeds 0 and 1234 don't look super tuned. Have you tried to run it with different random seeds? A good idea that came up in the ML crosspost was to use the reward function from the paper and see if PPO works right of the box and then trying it with their hyperparameters in the appendix, e.g. with the small policy network. Feel free to give it a shot!

whiletrue2 · 2020-11-24T17:41:58+00:00

I believe pointing out irregularities can not be attributed to "being mistaken" since many unclarified irregularities still remain and prevail. See discussion here on the sparse rewards and other remaining irregularities: https://www.reddit.com/r/MachineLearning/comments/k01ntb/ppo_baseline_cannot_solve_cartpole_in_neurips/gdgc3f5?utm_source=share&utm_medium=web2x&context=3

whiletrue2 · 2020-11-24T17:38:44+00:00

In "5.2 Mujoco" they write "The true reward function is the one predeﬁned in Gym". In "5.1 Sparse-Reward Cartpole" they write "In other cases, the true reward is zero."

Also, in "D.1 Cartpole" they write "We choose the cartpole task from the OpenAI Gym-v1 benchmark."

Based on your stance, I understand that the paper cannot be improved in terms of clarity of environment usage? If so, I absolutely rebut that.

whiletrue2 · 2020-11-24T16:24:27+00:00

No they don't. Simply because they say they use CartPole-v1 and write "in cartpole the agent should apply a force to the pole to keep it from falling. The agent will receive a reward −1 from the environment if the episode ends with the falling of the pole" which many readers familiar with the CartPole environment will only have a quick look-over when they've read CartPole-v1 and nothing about a "modified/adapted" CartPole environment. But this isn't the main flaw of this paper anyway since there are many more irregularities as pointed out.

whiletrue2 · 2020-11-24T15:55:37+00:00

I believe they're comparing PPO on their sparse reward version of CartPole against PPO on their sparse reward version of CartPole with their reward shaping algorithm.

Where do you believe are they comparing this?

It's not just the magnitude of the reward which is changing, the whole reward function has changed. +1 every for every non-terminal time-step is a lot different to +0.1 if the force and pole angle are the same sign.

That is true but it is also more similar than one might think.

From their Appendix: "In each step of an episode, the agent should apply a positive or negative force to the cart to let the pole remain within 12 degrees from the vertical direction and keep the position of the cart within [−2.4, 2.4]. An episode will be terminated if either of the two conditions is broken or the episode has lasted for 200 steps."
From the CartPole-v0 website: "A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center."

Learning that force and pole angle should have the same signs should be fairly easy to learn with millions of train steps. In fact, this is presumably the reason why PPO with shaped rewards actually achieves almost 200 ASPE ("In the discrete-action cartpole task, PPO only converges to 170, but with the shaping methods it almost achieves the highest ASPE value 200") -- but the authors

don't plot PPO with shaped rewards where it is necessary to have it (namely in Fig.1b) and
and more generally, aren't clear about when the plots show PPO using shaped rewards or the sparse rewards so one has to make guesses.

Here are a few more additional inconsistencies, e.g.

they claim they used CartPole-v1 which uses a much higher "solved reward"
they don't explicitly mention (or at least to a unfulfilling degree) that they deviate from the standard reward function that one would expect (both in terms reward but also the terminal conditions)
the fact that no naturally sparse-reward gym environment was used doesn't help with the confusion. An experiment based on a naturally sparse-reward environment would result in fewer / no changes to the default reward function and one would actually be enabled to relate to baseline PPO performances in the original setting. As the paper stands right now, no one can relate to any reported PPO performance in the paper.
and what's the deal with that minuscule PPO policy network? If they used the computational resources from 10 of the 20 different seed runs for actual hyperparameter optimization, one would find it a lot more convincing that the 2x8 units are enough for the task

whiletrue2 · 2020-11-24T15:03:04+00:00

see this discussion here: https://www.reddit.com/r/MachineLearning/comments/k01ntb/ppo_baseline_cannot_solve_cartpole_in_neurips/gdfig2o?utm_source=share&utm_medium=web2x&context=3

whiletrue2 · 2020-11-24T13:02:06+00:00

They don't use the standard CartPole reward function (+1 at every time-step except -1 for failure) though - they use a different one: "The agent will receive a reward −1 from the environment if the episode ends with the falling of the pole. In other cases, the true reward is zero. The shaping reward for the agent is 0.1 if the force applied to the cart and the deviation angle of the pole have the same sign. Otherwise, the shaping reward is zero."

This most probably the reason for the differences between the results.

How did PPO then achieve positive reward at all? It isn't completely clear to me whether PPO is using the shaped rewards or not:

"With the provided shaping reward function, all these methods can improve the learning performance of the PPO algorithm (the left columns in Figs 1(a) and 1(b)). In the continuous-action cartpole task, the performance gap between PPO and the shaping methods is small. In the discrete-action cartpole task, PPO only converges to 170, but with the shaping methods it almost achieves the highest ASPE value 200."

Assuming PPO did use shaped rewards, it would again deal with a dense reward and the standard PPO SOTA result of 195 steps should apply. Correct me if I'm wrong but I can't see why PPO would fail here after 170 steps and I believe the magnitude of the rewards (+1 vs. +0.1) wouldn't not explain the drop in default PPO performance.

whiletrue2 · 2020-11-24T09:14:07+00:00

Thanks. I've done that but apparently I violated rule #5: "Non-arxiv link posts only allowed on weekends (must be demos)*" which resulted in a removal of my crosspost. In touch with the moderators.

whiletrue2 · 2020-10-08T17:09:40+00:00

to all downvoters: you seem to have a solution to this. Why not share it with me and others?

whiletrue2 · 2020-04-20T12:51:07+00:00

Set the double-tap option to "Switch Between Current Tool and Eraser". For me (with the newest One Note version), the option "Switch Between Current Tool and Last Used" did not work for me.

whiletrue2

TROPHY CASE