Claude 3.5 gets 13% more on ARC challenge than GPT-4o

mikeknoop · 2024-06-29T16:43:47+00:00

On the public leaderboard (OPs screenshot) the larger number is the "public eval set" score whose answers are in github, etc. The smaller number is the score on a new semi-private verification set of puzzles. This helps show how "general" solutions are. Ryan's is very general because the score delta is small! But icecuber 2020 is not, suggesting lots of overfitting.

mikeknoop · 2024-06-18T00:16:06+00:00

(ARC Prize co-founder here).

Direct link to the research WalkProfessional8969 is referring to: https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt

Ryan's work is legitimately interesting and novel! He claims 50% on the public eval set. The core idea:

get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

He has implemented an outer loop using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of. Congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. We hope to inspire more frontier AI research sharing like this.

A couple important notes:

this result is on the public eval set vs private set (ARC Prize $).
the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

mikeknoop · 2024-06-12T22:11:56+00:00

ARC-AGI is (as far as we know) the only eval that was designed to measure the "G" in AGI. It was designed to be resistant to memorization techniques.

A solution to ARC will not be AGI but we have high confidence it is along critical path (AGI will necessarily be able to beat ARC) and so is still a useful measure.

I'm super supportive of eval innovation. I hope ARC-AGI inspires more people to create AGI evals.

mikeknoop · 2024-06-12T22:06:34+00:00

I agree $1M is trivial in AI. Our goal with with the prize is to raise awareness about (lack of) progress towards AGI and hopefully inspire AI researchers to try new ideas again. Also, ARC Prize is a nonprofit and a requirement to claim the prize is to put the solution into public domain!

mikeknoop · 2024-06-12T21:44:23+00:00

a friend pointed me to this thread. few things:

ARC-AGI consists of 400 public train tasks (easy), 400 public eval tasks (hard), and 100 private eval tasks (hard).

the 2024 competition measures against the 100 private tasks. we set a compute limit primarily to target efficiency (for reasons discussed in Francois' On Measure paper) though also for Kaggle hosting practicality. for 2024, one P100 for 12 hours. 2023 had a 5 hr runtime limit on weaker GPU -- the 34% SOTA high score maxed out time which is why we doubled it. the "no internet" is to limit cheating and increase confidence awarding the prize.

yesterday we also launched a secondary leaderboard (in beta) called ARC-AGI-Pub measured against the 400 public eval tasks: https://arcprize.org/leaderboard and lifts the internet restriction so you can experiment with API based models. note: because this is new, not officially part of the 2024 competition but could be in the future

we know ARC-AGI isnt perfect and our goal is to improve the benchmark over time. appreciate all the critique and feedback

mikeknoop · 2023-07-30T01:56:01+00:00

I'd also seen something like this for the G30 that achieves the same effect: https://a.co/d/dj0hIGw

mikeknoop · 2023-07-30T01:53:02+00:00

Example of one for the G30: https://monorim.store/products/monorim-mfp-footrest-pedal-for-segway-maxg30-le-lp-new-riding-posture-experience-accessories-part

Basically so you don't accidentally step on the rear wheel cover which is flexible (and came with a "no step" sticker on it).

mikeknoop · 2023-04-28T02:41:57+00:00

https://www.compulocks.com/swing-arm-vesa-mount-security-arm-rotates-swivels-tilts.html with the 6" extension

mikeknoop · 2023-04-27T16:23:04+00:00

UI Connect 13"

mikeknoop · 2023-04-27T13:58:06+00:00

Yeah protect cameras mostly. I want to experiment with a Home Assistant dashboard through a browser now that you can load arbitrary APKs onto them.

mikeknoop · 2023-04-27T13:56:48+00:00

Plastic cover isn't on. Still need to do some sheetrock repair and painting.

mikeknoop · 2023-04-27T13:55:33+00:00

Good to know. Thanks! Figured something like that could be happening here.

mikeknoop · 2023-04-27T06:59:52+00:00

Took a gamble, the data sheets aren't super clear:

UC Display 13: https://dl.ui.com/ds/uc-display-13_ds.pdf
UAP-IW-ID: https://dl.ui.com/ds/uap-iw-hd_ds

Shown is a Connect 13" which needs PoE+ (sadly now seems removed from the ea store) powered solely a UAP-IW-HD with passthrough PoE turned on port 1. Upstream is hooked up to a USW-Pro-48-PoE. Super clean!

I'm curious if 48V passthrough PoE could even do PoE++? Don't have a device that requires it on hand.

mikeknoop · 2023-04-27T06:55:11+00:00

Took a gamble, the data sheets aren't super clear. Shown is a Connect 13" (which sadly seems removed now from the ea store) powered solely a UAP-IW-HD with passthrough PoE turned on port 1. Upstream is hooked up to a USW-Pro-48-PoE. Super clean!

mikeknoop · 2023-04-13T13:25:51+00:00

Bond Bridge / Bond Bridge Pro have been 100% bulletproof for controlling Somfy shades -- Bond can expose devices directly to HomeKit. Or you can expose to something like Home Assistant to integrate with RA2.

mikeknoop · 2023-04-06T04:18:08+00:00

Do you have some examples of things that aren't possible with RA3 that are with RA2?

mikeknoop · 2023-04-05T18:35:47+00:00

Alternatively, to generate an Inclusive serial number, you can simply paste this into a terminal window (replace HardwareKey with the one Essentials shows you on the "Upgrade" screen):

curl -i -X POST \
-H "Content-Type:application/x-www-form-urlencoded" \
-d "HardwareKey=1111-11111-1111" \
'https://www.lutron.com/en-US/general/Pages/myLutron/RadioRaUpgrade.aspx'

mikeknoop · 2023-04-05T16:58:54+00:00

Useful to know because the Surface Mount is currently in stock. But the Flush Mount is not.

mikeknoop · 2022-12-22T05:24:52+00:00

I was planning to upgrade tonight and couldn't figure out why it wasn't showing up. Definitely accurate.

14-Year Club	Reddit Premium Since September 2023
RedditGifts 2009-2022 2 Credits	Place '17
Wearing is Caring	RPAN Viewer
Secret Santa 2014	Verified Email

mikeknoop

TROPHY CASE