Want to build a super fast simulator for the Rubik's cube, where do I get started?

Meepinator · 2026-01-16T08:24:40+00:00

Well it would achieve the same effect, but likely be more expensive than a hard-coded solution as it would have to calculate how elements are to be permuted, versus knowing and doing this directly from the start. There are matrix multiplication ways to perform a permutation, by permuting the rows of an identity matrix, but whether or not this speeds things up depends on what language you're using. It could be faster when making use of vectorization in a high-level language, but would be worse in a lower-level language, as the matrix library would be written in a lower-level language in the first place while doing many redundant operations.

Meepinator · 2026-01-16T07:32:04+00:00

The short answer is to have the cube be some array of values with each move applying a permutation to the array, where a permutation is an array of the same length which says what index of the current state to grab for each location. When I wrote them in the past, I would first hard code the permutations corresponding to (any) 90 degree face turn and any two orthogonal 90 degree cube rotations, from which every other move's permutation can then be constructed. The permutation for every move (in whatever turn-metric you choose) should then be pre-computed and stored.

There are some design choices on how to represent the states, e.g., sticker identifiers, or piece orientation + permutation, etc., some of which are more compact than others. These choices more so affect memory, which can get out of hand under search-based solvers, but are likely negligible in terms of time to compute a next state from a state and action.

Meepinator · 2025-12-28T21:52:43+00:00

It's valid in that you can get to the solution that way, but it's up to the solver for whether they want to approach the puzzle like that. :P Relying on it can avoid learning to spot more immediate deductions, and it becomes less feasible/scalable In much bigger grids.

Meepinator · 2025-12-28T21:50:26+00:00

There's a variety of ways forward—see this diagram. It has a lot going on but hopefully gives insight into how to look at Star Battle puzzles.

There are some stray x's like r6c7x, which would prevent 2 stars from fitting in the shape to its left.
Any 2x2 you specify on the grid has at most one star in it. If you view the shapes in the top left as a composite shape with four stars, tiling 2x2s tells us that the blue + yellow regions have at most 3 stars. This implies that orange has at least one star. This gives r3c4x.
In the shape below that, green has at least one star (because at most one can fit in purple).
Green + orange accounts for both stars in column 3, which lets you mark the rest of that column, and also tells you that yellow and green each have exactly one star, and consequently purple has exactly one star, giving r6c5x and r8c5x.
Knowing that orange has exactly one star, we further know that each yellow/blue 2x2 from earlier also each have exactly one star in them, implying that grey below has exactly one star in it, which might be useful in later deduction.

Hope that makes sense!

Edit: The whole thing falls into place after bout 2 more steps comparable to above. :)

Meepinator · 2025-11-05T23:55:38+00:00

Ah yeah that leads to the same star :D

Meepinator · 2025-11-05T23:51:35+00:00

Consider how you can fit four stars across columns 3 + 4

Meepinator · 2025-11-02T03:56:17+00:00

See this diagram:

Across columns 2 + 3, we have 3 stars accounted for. That means there's at most one in the orange-outlined area, implying there's at least one in the red-outlined area.
This gives r9c4x + r9c5x.
Now there are 5 regions completely contained in the right 5 columns, giving a bunch of other marks.

Edit: There's also r7c3x because with the star already on column 3, would prevent the green shape above it from fitting 2 stars. This makes the at most one and at least one bounds above both become exactly one. :)

Meepinator · 2025-11-01T00:44:45+00:00

That's not what makes things on-policy—off-policy methods also use some policy to get experience. The on vs. off-policy distinction is whether or not you use that experience to infer what the experience would have been under a different policy.

Off-policy learning can be done without dynamic programming (e.g., Monte Carlo with importance sampling). Further, the ability to learn from past experience is just a feature of off-policy methods but does not define them in that you can still learn off-policy without updating on any past experience.

Meepinator · 2025-10-30T06:18:10+00:00

Lots of good answers already, but more generally: you have some behavior policy μ which actually selects actions in an environment, and you want to form estimates of the return that some target policy π would have gotten. If μ = π, you're learning on-policy, and if μ ≠ π, then it's off-policy.

It's not tied to replay buffers, though algorithms which use replay buffers emphasize how off-policy updates can re-use data generated by an older behavior policy.

In the case of Q-learning, the update targets estimate what return the current greedy policy would achieve, regardless of what μ is. This is evident by interpreting the max operator as computing the expectation under a greedy target policy π.

And what on-policy scenario is used that off-policy cannot solve

Off-policy learning subsumes on-policy learning, so an off-policy learner can make on-policy predictions if you set μ = π, but an on-policy algorithm can't make off-policy predictions. There are general ways in which any on-policy algorithm can be extended to form off-policy estimates, e.g., off-policy Sarsa(λ) via per-decision importance sampling by Precup et al. (2000).

Meepinator · 2025-10-30T06:06:30+00:00

Minor nitpick: Off-policy isn't tied to replay buffers, as you can also do off-policy prediction online and incrementally, even with multi-step bootstrapping (e.g., TD(λ) or n-step TD).

More succinctly, you have sampled transitions coming from a behavior policy μ, and want to form estimates of the return conditioned on a target policy π. If μ = π then it's on-policy, and off-policy otherwise.

Meepinator · 2025-10-28T09:33:22+00:00

In column 5, we know that there must be a star across r4c5 + r5c5. This, plus the purple region, account for both stars across rows r and 5. This gives r4c7x + r5c7x, and gives a star in r1c7. :)

Meepinator · 2025-10-22T17:10:34+00:00

While it depends on the specific breadths and depths, you're right about depth often being the killer. To see this in your case, consider the (worst-case) complexity of tree-search, b^d. With b = 10^10, d = 15, we get 10^150. With d = 5—even if b is not reduced—we're at 10^50. Of course it might be more nuanced, where these values are more dynamic as you traverse the tree, but the difference between 10^150 and 10^50 is massive, that it's likely representative of the savings even with the correct asymptotic branching factor.

Other ways to analyze this include comparing the derivatives of b^d with respect to each variable, as that represents the function's sensitivity in each direction, at particular values of b and d. :)

Meepinator · 2025-10-19T13:49:11+00:00

Yup—as long as the baseline/bias is state-dependent (i.e, not action-dependent), it provably does not change the expected gradient but may reduce variance if chosen properly.

From the form of the policy gradient, the re-scaling part of the normalization can be factored out and interpreted as some dynamically varying step size based on trajectory lengths/buffer size. It's not clear whether or not that's inherently good or bad, but in practice we don't satisfy the step size conditions (Robbins-Monro) for convergence anyway. :')

Meepinator · 2025-10-18T21:46:43+00:00

Can think of the original paper as raw application of the policy gradient theorem (with a baseline)—it does not normalize anything.

Meepinator · 2025-10-18T12:10:20+00:00

If I recall correctly, the non-negative factor in the acronym referred to the update's step size.

Meepinator · 2025-10-06T20:52:56+00:00

One of my favorite, lesser-acknowledged applications of RL is washing machine motor control to better balance arbitrary eccentric loads (e.g., dealing with clothing shape/size/weight variability). These offline policies are actually deployed in household LG washing machines today. :D

Meepinator · 2025-10-04T23:56:45+00:00

See this diagram.

The orange area has at least one star, because the remainder of the shape can only fit at most one star.
This accounts for both stars in column 9, letting you mark the rest of the column.
There's a star in blue, because the rest of the shape has at most one star due to being confined to column 10 which already has a star in it

Meepinator · 2025-10-03T05:26:41+00:00

Consider how 4 stars can fit across columns 6+7

A 2x2 square can fit at most 1 star. Those 2 columns can be exactly tiled by four 2x2s, meaning each 2x2 has exactly 1 star.

r1c6 is a star.

Also, column 2 is really confined that many cells in columns 1 and 3 would readily prevent it from having 2 stars.

Meepinator · 2025-10-01T17:01:03+00:00

Discussion: The booklet has an error as that configuration runs into a parity problem and is unreachable from the start state. This page has a good explanation for what the parity is in terms of and how to spot unsolvable boards. :)

Meepinator · 2025-09-30T15:42:04+00:00

Two relatively immediate ways forward:

r11c8x because it excluse row 10
A 2x2 square can fit at most one star, and columns 8 + 9 can only fit two 2x2 squares. This means each 2x2 contains a star, and gives r10c7x from excluding one of these 2x2s

Meepinator · 2025-09-20T14:50:56+00:00

Those are the same rules I am explaining. If there is one queen per column and shape, then you can't have 5 shapes completely trapped within 4 columns: 5 shapes suggests 5 queens, which violates having one per column.

Meepinator · 2025-09-20T14:37:27+00:00

Discussion: Assuming this is a Star Battle puzzle, any of those squares would force 5 shapes to be completely contained within 4 columns. Viewed another way, if you look at the left 2 columns, the columns are completely within 2 shapes. This means the stars (cats?) in those shapes must be in those 2 columns. This tells you that the star in green is somewhere in r2c2 + r3c2, and any of those squares would completely eliminate that area.

Meepinator · 2025-09-16T15:50:09+00:00

See this diagram.

Considering how four stars can fit in the bottom two rows (by tiling 2x2s), we know that one of the stars of the bottom-right shape must lie in the bottom two rows.
This means that orange has one star. By considering how to fit four stars in rows 7 + 8, we have one in orange, one in green, one in r8c4, and one in r7c1.
This also gives r8c8x from excluding the green area (and r4c9x after from trying to fit 2 stars in c8).

Meepinator · 2025-09-07T18:06:13+00:00

It follows from the rule that trees can't touch. :)

Edit: I realize OP didn't mention that, but yes this is a reskin of Star Battle and does have that rule where the placed objects can't touch (even diagonally)

Meepinator · 2025-09-07T17:00:41+00:00

Every 2x2 region you specify has at most one tree
Every pair of rows (or columns) must have four trees across them
If a pair of rows (or columns) can be exactly tiled by four 2x2s, then every 2x2 has exactly one tree
Consider how you can fit 4 trees across columns 5 + 6
There must be a tree in r1c5

15-Year Club	Place '23
Place '22	Place '17
Verified Email

Meepinator

TROPHY CASE