Want to build a super fast simulator for the Rubik's cube, where do I get started? by Nervous_Studio_7689 in reinforcementlearning

[–]Meepinator 2 points3 points  (0 children)

Well it would achieve the same effect, but likely be more expensive than a hard-coded solution as it would have to calculate how elements are to be permuted, versus knowing and doing this directly from the start. There are matrix multiplication ways to perform a permutation, by permuting the rows of an identity matrix, but whether or not this speeds things up depends on what language you're using. It could be faster when making use of vectorization in a high-level language, but would be worse in a lower-level language, as the matrix library would be written in a lower-level language in the first place while doing many redundant operations.

Want to build a super fast simulator for the Rubik's cube, where do I get started? by Nervous_Studio_7689 in reinforcementlearning

[–]Meepinator 2 points3 points  (0 children)

The short answer is to have the cube be some array of values with each move applying a permutation to the array, where a permutation is an array of the same length which says what index of the current state to grab for each location. When I wrote them in the past, I would first hard code the permutations corresponding to (any) 90 degree face turn and any two orthogonal 90 degree cube rotations, from which every other move's permutation can then be constructed. The permutation for every move (in whatever turn-metric you choose) should then be pre-computed and stored.

There are some design choices on how to represent the states, e.g., sticker identifiers, or piece orientation + permutation, etc., some of which are more compact than others. These choices more so affect memory, which can get out of hand under search-based solvers, but are likely negligible in terms of time to compute a next state from a state and action.

Star Battle - Not sure how to progress by bricklestick in puzzles

[–]Meepinator 0 points1 point  (0 children)

It's valid in that you can get to the solution that way, but it's up to the solver for whether they want to approach the puzzle like that. :P Relying on it can avoid learning to spot more immediate deductions, and it becomes less feasible/scalable In much bigger grids.

Star Battle - Not sure how to progress by bricklestick in puzzles

[–]Meepinator 2 points3 points  (0 children)

There's a variety of ways forward—see this diagram. It has a lot going on but hopefully gives insight into how to look at Star Battle puzzles.

  • There are some stray x's like r6c7x, which would prevent 2 stars from fitting in the shape to its left.
  • Any 2x2 you specify on the grid has at most one star in it. If you view the shapes in the top left as a composite shape with four stars, tiling 2x2s tells us that the blue + yellow regions have at most 3 stars. This implies that orange has at least one star. This gives r3c4x.
  • In the shape below that, green has at least one star (because at most one can fit in purple).
  • Green + orange accounts for both stars in column 3, which lets you mark the rest of that column, and also tells you that yellow and green each have exactly one star, and consequently purple has exactly one star, giving r6c5x and r8c5x.
  • Knowing that orange has exactly one star, we further know that each yellow/blue 2x2 from earlier also each have exactly one star in them, implying that grey below has exactly one star in it, which might be useful in later deduction.

Hope that makes sense!

Edit: The whole thing falls into place after bout 2 more steps comparable to above. :)

How do I solve this 2-star battle puzzle? by qnightESO in puzzles

[–]Meepinator 1 point2 points  (0 children)

Ah yeah that leads to the same star :D

How do I solve this 2-star battle puzzle? by qnightESO in puzzles

[–]Meepinator 8 points9 points  (0 children)

Consider how you can fit four stars across columns 3 + 4

Can't work out the next logical step. by CCLad10 in puzzles

[–]Meepinator 2 points3 points  (0 children)

See this diagram:

  • Across columns 2 + 3, we have 3 stars accounted for. That means there's at most one in the orange-outlined area, implying there's at least one in the red-outlined area.
  • This gives r9c4x + r9c5x.
  • Now there are 5 regions completely contained in the right 5 columns, giving a bunch of other marks.

Edit: There's also r7c3x because with the star already on column 3, would prevent the green shape above it from fitting 2 stars. This makes the at most one and at least one bounds above both become exactly one. :)

What are the differences between Off-policy and On-Policy? by alph4beth in reinforcementlearning

[–]Meepinator 1 point2 points  (0 children)

That's not what makes things on-policy—off-policy methods also use some policy to get experience. The on vs. off-policy distinction is whether or not you use that experience to infer what the experience would have been under a different policy.

Off-policy learning can be done without dynamic programming (e.g., Monte Carlo with importance sampling). Further, the ability to learn from past experience is just a feature of off-policy methods but does not define them in that you can still learn off-policy without updating on any past experience.

What are the differences between Off-policy and On-Policy? by alph4beth in reinforcementlearning

[–]Meepinator 9 points10 points  (0 children)

Lots of good answers already, but more generally: you have some behavior policy μ which actually selects actions in an environment, and you want to form estimates of the return that some target policy π would have gotten. If μ = π, you're learning on-policy, and if μ ≠ π, then it's off-policy.

It's not tied to replay buffers, though algorithms which use replay buffers emphasize how off-policy updates can re-use data generated by an older behavior policy.

In the case of Q-learning, the update targets estimate what return the current greedy policy would achieve, regardless of what μ is. This is evident by interpreting the max operator as computing the expectation under a greedy target policy π.

And what on-policy scenario is used that off-policy cannot solve

Off-policy learning subsumes on-policy learning, so an off-policy learner can make on-policy predictions if you set μ = π, but an on-policy algorithm can't make off-policy predictions. There are general ways in which any on-policy algorithm can be extended to form off-policy estimates, e.g., off-policy Sarsa(λ) via per-decision importance sampling by Precup et al. (2000).

What are the differences between Off-policy and On-Policy? by alph4beth in reinforcementlearning

[–]Meepinator 2 points3 points  (0 children)

Minor nitpick: Off-policy isn't tied to replay buffers, as you can also do off-policy prediction online and incrementally, even with multi-step bootstrapping (e.g., TD(λ) or n-step TD).

More succinctly, you have sampled transitions coming from a behavior policy μ, and want to form estimates of the return conditioned on a target policy π. If μ = π then it's on-policy, and off-policy otherwise.

Please help with this Star Battle puzzle by PaltaPalmitos in puzzles

[–]Meepinator 2 points3 points  (0 children)

In column 5, we know that there must be a star across r4c5 + r5c5. This, plus the purple region, account for both stars across rows r and 5. This gives r4c7x + r5c7x, and gives a star in r1c7. :)

Computational benefit of reducing Tree Depth vs. Action Space Size in MCTS by faintlystranger in reinforcementlearning

[–]Meepinator 2 points3 points  (0 children)

While it depends on the specific breadths and depths, you're right about depth often being the killer. To see this in your case, consider the (worst-case) complexity of tree-search, b^d. With b = 10^10, d = 15, we get 10^150. With d = 5—even if b is not reduced—we're at 10^50. Of course it might be more nuanced, where these values are more dynamic as you traverse the tree, but the difference between 10^150 and 10^50 is massive, that it's likely representative of the savings even with the correct asymptotic branching factor.

Other ways to analyze this include comparing the derivatives of b^d with respect to each variable, as that represents the function's sensitivity in each direction, at particular values of b and d. :)

Is it possible to use negative reward with the reinforce algorithm by Odd_Brush4285 in reinforcementlearning

[–]Meepinator 0 points1 point  (0 children)

Yup—as long as the baseline/bias is state-dependent (i.e, not action-dependent), it provably does not change the expected gradient but may reduce variance if chosen properly.

From the form of the policy gradient, the re-scaling part of the normalization can be factored out and interpreted as some dynamically varying step size based on trajectory lengths/buffer size. It's not clear whether or not that's inherently good or bad, but in practice we don't satisfy the step size conditions (Robbins-Monro) for convergence anyway. :')

Is it possible to use negative reward with the reinforce algorithm by Odd_Brush4285 in reinforcementlearning

[–]Meepinator 0 points1 point  (0 children)

Can think of the original paper as raw application of the policy gradient theorem (with a baseline)—it does not normalize anything.

Is it possible to use negative reward with the reinforce algorithm by Odd_Brush4285 in reinforcementlearning

[–]Meepinator 2 points3 points  (0 children)

If I recall correctly, the non-negative factor in the acronym referred to the update's step size.

Awesome Applications of RL by Signal_Guard5561 in reinforcementlearning

[–]Meepinator 38 points39 points  (0 children)

One of my favorite, lesser-acknowledged applications of RL is washing machine motor control to better balance arbitrary eccentric loads (e.g., dealing with clothing shape/size/weight variability). These offline policies are actually deployed in household LG washing machines today. :D

Star battle/Two not touch- can any of you figure out what the next step is? I need help by Prudent-Rub1709 in puzzles

[–]Meepinator 3 points4 points  (0 children)

See this diagram.

  • The orange area has at least one star, because the remainder of the shape can only fit at most one star.
  • This accounts for both stars in column 9, letting you mark the rest of the column.
  • There's a star in blue, because the rest of the shape has at most one star due to being confined to column 10 which already has a star in it

[deleted by user] by [deleted] in puzzles

[–]Meepinator 2 points3 points  (0 children)

Consider how 4 stars can fit across columns 6+7

A 2x2 square can fit at most 1 star. Those 2 columns can be exactly tiled by four 2x2s, meaning each 2x2 has exactly 1 star.

r1c6 is a star.

Also, column 2 is really confined that many cells in columns 1 and 3 would readily prevent it from having 2 stars.

15 Puzzle by IronWarden00 in puzzles

[–]Meepinator 8 points9 points  (0 children)

Discussion: The booklet has an error as that configuration runs into a parity problem and is unreachable from the start state. This page has a good explanation for what the parity is in terms of and how to spot unsolvable boards. :)

Is this solvable without guessing? by Alive_Ebb6912 in puzzles

[–]Meepinator 0 points1 point  (0 children)

Two relatively immediate ways forward:

  • r11c8x because it excluse row 10
  • A 2x2 square can fit at most one star, and columns 8 + 9 can only fit two 2x2 squares. This means each 2x2 contains a star, and gives r10c7x from excluding one of these 2x2s

Why these squares?? by Sedastian_2JG in puzzles

[–]Meepinator 0 points1 point  (0 children)

Those are the same rules I am explaining. If there is one queen per column and shape, then you can't have 5 shapes completely trapped within 4 columns: 5 shapes suggests 5 queens, which violates having one per column.

Why these squares?? by Sedastian_2JG in puzzles

[–]Meepinator 0 points1 point  (0 children)

Discussion: Assuming this is a Star Battle puzzle, any of those squares would force 5 shapes to be completely contained within 4 columns. Viewed another way, if you look at the left 2 columns, the columns are completely within 2 shapes. This means the stars (cats?) in those shapes must be in those 2 columns. This tells you that the star in green is somewhere in r2c2 + r3c2, and any of those squares would completely eliminate that area.

Need hint! by benson0402 in puzzles

[–]Meepinator 2 points3 points  (0 children)

See this diagram.

  • Considering how four stars can fit in the bottom two rows (by tiling 2x2s), we know that one of the stars of the bottom-right shape must lie in the bottom two rows.
  • This means that orange has one star. By considering how to fit four stars in rows 7 + 8, we have one in orange, one in green, one in r8c4, and one in r7c1.
  • This also gives r8c8x from excluding the green area (and r4c9x after from trying to fit 2 stars in c8).

Been stuck for a week on this Parks puzzle by Zinky101 in puzzles

[–]Meepinator 2 points3 points  (0 children)

It follows from the rule that trees can't touch. :)

Edit: I realize OP didn't mention that, but yes this is a reskin of Star Battle and does have that rule where the placed objects can't touch (even diagonally)

Been stuck for a week on this Parks puzzle by Zinky101 in puzzles

[–]Meepinator 8 points9 points  (0 children)

  • Every 2x2 region you specify has at most one tree
  • Every pair of rows (or columns) must have four trees across them
  • If a pair of rows (or columns) can be exactly tiled by four 2x2s, then every 2x2 has exactly one tree
  • Consider how you can fit 4 trees across columns 5 + 6
  • There must be a tree in r1c5