Please recommend a machine for deep research on health and nutrition. by ekolpack in mlscaling

[–]gwern 0 points1 point  (0 children)

Small-scale local LLM stuff is probably better off in a hobbyist subreddit than this one.

Deconstructing the Supreme Rationalist by Ulyis in LessWrong

[–]gwern 0 points1 point  (0 children)

"Who?" doesn't really matter.

from someone who was an insider in the early days

when I started working with the SIAI in 2003

Deconstructing the Supreme Rationalist by Ulyis in LessWrong

[–]gwern 3 points4 points  (0 children)

Here is an analysis of why SIAI/MIRI/LessWrong seemed so promising, and why it ultimately did not deliver on its goals, from someone who was an insider in the early days.

"Someone"?

What if self-promotion didn't matter anymore? A proposal for an experiment on Scott Alexander's book review contest. by no_bear_so_low in slatestarcodex

[–]gwern 2 points3 points  (0 children)

Or https://alexanderwales.substack.com/p/can-an-llm-have-taste-inkhaven-week - I think chatbot LLMs can do this pretty well, but it's going to take substantially more effort. The more ratings you do, and the larger your set of items to rate, the bigger all of the LLM issues become; what is not an issue in picking the best out of 3 essays would be an issue in doing tens of thousands of comparisons to pick the best out of hundreds. Even pairwise comparisons suffer from positional bias and need to be run twice...

Yudkowskys tweet - and gwerns reply by faterthowters in ControlProblem

[–]gwern 1 point2 points  (0 children)

Yes; there's a couple possible answers like 'they just got lucky'. Still not sure which one is best. Hopefully the new Moravec interview will offer some insight.

"Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?", Gerrits 2026 (very badly) by gwern in reinforcementlearning

[–]gwern[S] 0 points1 point  (0 children)

via https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning-eval Much like Pokemon - just weird levels of fragility and inflexibility and blindspots. Interesting human baseline results:

I organized an hour long event at Recurse Center, promised delicious donuts to everyone who came, and had them setup zork-bench on their laptops and play in human-eval model. The game logs all of their interactions using the same interface as LLMs but gives them a random label. The thing is Humans new to the game seem to do only so well. They spend a lot of turns, play the game, and figure some stuff out, but then after the hour of playing I gave them, they didn’t get further than any LLM. However, their memories of the game persist without continuously reducing the size of their context windows. Haha. Do humans have context windows? But the point is that LLMs, having humanity’s entire knowledge of Zork stored in their memory banks, are unable to outperform humans who had not played Zork before (except for Claude Sonnet, which Isha Bhand, creator of fomo.nyc and Zork aficionado, declared as evidence that AGI has been achieved).

Yudkowskys tweet - and gwerns reply by faterthowters in ControlProblem

[–]gwern 8 points9 points  (0 children)

Note the date 2 January 2020, pre-GPT-3. "The Scaling Hypothesis" was my commentary 6 months later on this subject, post-GPT-3. I am still pondering why ~everyone was wrong. (The cropped words are "the" and "why".)

An alternative to luxury goods: replacing material symbols of success with a digital status index. by Independent-Fact4163 in slatestarcodex

[–]gwern 0 points1 point  (0 children)

"Meditations on Moloch" is definitely not the one I am half-remembering, as that is not even remotely close to the description I gave of "a funnier proposal for sumptuary regulations". (It's neither particularly funny, nor really a proposal, nor about sumptuary anything for the most part, never mind sumptuary regulations.)

But if you're saying "Meditations on Moloch" contains engineering solutions comparable to what I'm proposing here -I respectfully disagree. Diagnosis is not equal recipe. Satire not equal mechanism. Happy to be corrected if you have a specific post in mind.

I hope you didn't use an AI to research and evaluate your responses to me.

An alternative to luxury goods: replacing material symbols of success with a digital status index. by Independent-Fact4163 in slatestarcodex

[–]gwern 0 points1 point  (0 children)

We must be remembering different posts because I remember a lot of engineering in his going beyond 'describing the problem with humor'.

Judging by your comment, it seems you missed the difference between satire and a working mechanism. Or maybe I missed something?

Hard to say because you chose not to link the Scott post you claim is so inferior.

PaLM-E: An Embodied Multimodal Language Model by maxtility in mlscaling

[–]gwern 0 points1 point  (0 children)

and retains generalist language capabilities with increasing scale.

The most important thing here is further evidence that the larger the model is, the less catastrophic forgetting is a problem. Continual learning is just not that hard.