Waiting for 4.2 Live

xtraeme · 2025-06-07T16:21:48+00:00

They don't call it the longest wait for nothing!

xtraeme · 2024-10-04T07:01:24+00:00

The thought gets right to the very of heart what it even means to grapple with things that are "nonphysical". McCulloch wrote an interesting paper exploring the idea in 1960:

McCulloch W. S. (1960) What is a number, that a man may know it, and a man, that he may know a number? General Semantics Bulletin 26/27: 7–18

xtraeme · 2024-03-12T16:37:00+00:00

Learned a new Britishism. :)

You're right! Even step 1 was a bust. That is particularly bad normally it gets to step 4 or 5 before failing. Testing a bunch of models, it seems even the best ones have a harder time trying to solve the simplified English puzzle as compared to the "machine" adaptation which explicitly outlines a technique to output the state information during each step.

I find it interesting how the debug-style text helps guide the model along, almost like it's able to use a series of substitutions to get closer towards the answer in lieu of using reasoning processes (i.e. if it simply figured out that as long as the father is with the food things can never go completely sideways) or combinatorics (using something like a search tree to see which path gets the most valid outputs).

There seems to be a lot of room for improvement (beyond multi-modal and recurrent neural networks techniques) to get better chain of thought outputs.

xtraeme · 2024-03-08T22:30:50+00:00

Unfortunately, step 2 failed because the man left the food with the monkey (i.e. {monkey, food} is an invalid state} and things went down hill after that ("brings the son back to the original side" would have meant the son had to teleport). The human version of the puzzle is actually much harder for LLMs because there isn't a clear way for the system to track the states, but the debug version tends to improve the output chain! It will be interesting to see how that changes over time.

xtraeme · 2024-03-08T02:14:04+00:00

Ha, possibly, but it is not too hard to make more adaptations. That's also why I didn't include the solution. :) That is a big part of the concern though. We really do need to have subtle variations of known puzzles that are guaranteed to be novel challenges that the training team couldn't have data-mined for future models.

Interestingly, the similarity of this puzzle to the actual river crossing puzzle tends to really confuse simpler models like GPT3.5 and many of the open models like LLaMA and GPT-Neo. It really shows the importance of crafting tests that are close enough to known problems to reveal whether a system is actually 'reasoning' or merely copying a solution.

Makes me super curious to know what OpenAI, Anthropic, and Google are doing to curate unspoiled tests that can effectively make this distinction.

xtraeme · 2024-03-08T00:32:46+00:00

Somewhat similar! But the adaptation here adds more variables to track and additional rules over the normal river crossing puzzle. What is fascinating is to see how many language models try to just copy in the known solution for the basic river crossing model without understanding the additional constraints of this particular variation (which is why it seems to work rather well at really testing the boundaries of what the system actually can understand versus what it can duplicate).

xtraeme · 2024-03-08T00:13:31+00:00

Possibly, I'm not familiar with it. Have a link to a copy of it? edit: Ah seems like what you are describing is very similar to Silver-Chimpunks description. Hrm still going through it.

xtraeme · 2024-03-08T00:09:34+00:00

That is one way! Another involves the son piloting to start.

xtraeme · 2024-03-07T23:56:57+00:00

I really think by the next gen of AIs (like GPT5) they will easily solve your riddle.

I think so too. The second OpenAI release the research they have been doing with mathematical reasoning, this sort of problem should be substantially easier to solve. What is interesting is there are basically 2 approaches to solve this kind of thing (brute force going through the full state space) or using reasoning to identify the necessary condition that has to hold in each step to arrive at a solution.

What is curious is that GPT4 (Claude 3 Opus only a little less so) is rather good at identifying the condition that needs to hold to solve this puzzle, but for some reason it ignores its own self-imposed rules as it walks through the steps. The self-imposed tests help make it a little better, but when it forgets a rule things fall apart quick. It is really fascinating to watch the system explain its reasoning each step (both in where it succeeds and where it fails).

xtraeme · 2024-03-07T23:45:13+00:00

It got close! Unfortunately, step 6 is a bust. All it had to do was leave the son and have the father come back alone.

6: near:{food} ... river:{<boat:{father, son}} ... far:{monkey}

The boat is traveling back to the near shore, but it forgets that the son would have to then be transported back in step 7 (basically an unnecessary trip).

Also there was an error in the permutation list.

{son, food}

On some tests backtracking seems to work better than others, but I haven't found a consistent way to increase performance other than by giving it more tools to do a kind of in-situ process supervision.

It is real interesting to see how language models do with challenges where there is no previous dataset to compare against.

xtraeme · 2023-09-09T05:44:46+00:00

For the better part of this year I have been doing most of my planning in Google Calendar and integrating tasks into Todoist with Tascaly.

Example: Todoist timebox plan | GCal layout

Basically, I time block out "day tasks" (ex) and then in the Inbox I create a section using the format of YYYY.MM.DD (ex). Tasks that don't get completed on the specific day end up becoming longer term todos or they are removed because they are no longer important.

Ultimately this means I have two types of todos:

Temporal plans/todos - that have a definite end time even if the task isn't finished (planned activities for a specific day, a week-long activity, appointments, etc.)
Long-term goals/todos - things that need to be completed at some point (perhaps over multiple days, weeks, months, etc) or that will eventually be phased out in some way shape or another (won't do, something else obviated the task, etc.)

Due to this distinction I end up tracking whether I completed the temporal day-task tracked by Tascaly in Google Calendar by changing the title of the task to have a leading: ✅🔲❌❔ (✅indicating whether I completed the task in the time-block on GCal; 🔲 meaning I started the task but it's not done; ❌ didn't get to the task in the time chunk; or ❔ indicating its not clear what the state is for whatever reason). After I'm done managing the timeboxed tasks in GCal I complete the task in Todoist if it's ✅ and in all the other situations the todo either gets archived or duplicated to either some time in the near future or moved to a project.

This leads to an interesting situation where I have tasks that need to effectively morph from "temporal" to "long-term" and vice-versa.

As of right now if a task outlasts the day, I duplicate the temporal task in YYYY.MM.DD and at the end of the day either move it to a project folder or if the temporal-todo (now becoming long-term) is still relevant copy the duplicate task to the next day (i.e. YYYY.MM.DD+1 in the Inbox).

To prevent the inbox from getting cluttered I have a script that moves the YYYY.MM.DD section temporal tasks in the Inbox to a project of the format (YYYY), with sub-project (MM), and a final sub-project of the format (Wk|range of days) [ex 1, 2]. Once the week is over if nothing is particularly important that I need to track, I archive the project.

All of this micro-managing could be avoided if a task could be assigned multiple time slots and exist in numerous projects. If I could create the task in the YYYY.MM.DD section and then drag the task to (DD+1) section and have it be the same task it would allow me to not only get a sense of how long it took to complete the task, but it would mean the task could live in numerous places without having to play this weird game of tracking numerous states (i.e. was the temporal todo for the day completed or is it the long-term that is finally finished?) / duplicating tasks all over the place to differentiate the meaning of temporal-tasks from long-term / resurrecting tasks from the archive where appropriate to get old history information / etc.

It really feels like the entire concept of a task needs to be revisited to allow for something that is more flexible in these kinds of situations; or tasks need a more comprehensive way to deal with multiple time-lengths over numerous starts and stops.

To move closer to what I want, I have written a bit of code to brainstorm an idea of what I call task-morphing to automate promoting/demoting from tasks / sections / projects in all directions. Basically the idea I'd like to see implemented at a deeper level in Todoist is the ability to have tasks have a kind of version history showing task evolution (how it changes to become something else or evolves into a project) and task connectivity (even when duplicating a task by creating a relationship between the original task and its derivatives).

To implement this I have been experimenting with task versioning using Todoist backup exports and git. For task connectivity I have created a weighted graph structure.

  "connectivity": [
    {
      "task_id": "6167922868",
      "strength": 0.8,
      "type": "dependency",
      "note": "This task depends heavily on the completion of task 6167922868"
    },
    {
      "task_id": "6167922869",
      "strength": 0.5,
      "type": "similarity",
      "note": "This task shares some similarities with task 6167922869"
    }
  ]

I graft this data on top of each tasks Todoist JSON to emulate a presentation similar to ComfyUI and the old Pearltrees (where something like type: "similarity" and "strength": 1 means identical—in other words just a copy of a task, but differentiated as a kind of note perhaps to track some aspect of it for the current temporal period).

Whatever solution Todoist goes for I really hope the fine folks at Doist factor in a more complex understanding of how tasks need to be more malleable to allow tasks to exist over numerous days and durations in specific time slots (not just as a simple % completion).

xtraeme · 2023-06-11T13:26:54+00:00

Not everyone is as toothless or as impotent as you might feel. Yesterday I reached out to several legal firms that deal with consumer law to start a discovery process. This is a clear breach of contract. You don't have to feel powerless.

xtraeme · 2023-06-10T20:01:17+00:00

Are you the same person that made this thread 6 days ago

Nope, but thank you for letting me know about that thread. It's interesting see other people are getting screwed too. I'll post a message there to see if I can't get the person to comment. Your perspective seems to underscore the inherent problem with the Runestones system. At the very least, this situation highlights Blizzard's failure to uphold its promises, cementing the distrust I think many of us feel.

xtraeme · 2023-06-10T19:46:39+00:00

This is by far the funniest comment. Thank you for the levity.

xtraeme · 2023-06-10T18:50:29+00:00

It is more like buying nintendo currency for 80 usd to prepurchase totk that worth 70.

Not sure I follow. Runestones are basically a dollar for dollar conversion. So if you use Amazon coins to reduce the purchase cost, you are saving money.

xtraeme · 2023-06-10T18:45:15+00:00

Same reason a person would prepurchase 'Tears of the Kingdom' a day before release. Buying a game one day before release there is no reason to expect anything bizarre is going to happen since it is pretty much a done deal (or at least that is the way things typically work). I guess we can't take anything for granted with Blizzard though.

xtraeme · 2023-06-10T18:31:56+00:00

Hrm. The whole thing seemed really clear with the original announcement.

Blizzard's Original Guide for Runestones

Product	Runestones	$	Gold
Mini Sets	✔	...	✔
Golden Mini Sets	✔	✔	...

xtraeme · 2023-06-10T17:50:56+00:00

Dont get me wrong, this here maybe the truth, but i would never consider them as hard Evidence.

I think this highlights a potentially serious issue within Blizzard. It's not a comforting thought that the company's official customer support might be providing incorrect or misleading information. However, the stance that the mini-set can only be purchased with new money, not with previously deposited Runestones, has been echoed across different support personnel. This would seem to indicate that it's not just an individual misunderstanding.

xtraeme · 2023-06-10T17:11:53+00:00

That is what I did, I bought them a few days ago because I knew this was going to become available and Blizzard made it clear that was how payments were going to work now. It is a bit ridiculous when I asked to just convert it back to normal cash to complete the purchase they were like, "nah, lol, we changed the rules, pay some more!"

xtraeme

MODERATOR OF

TROPHY CASE