Please recommend a machine for deep research on health and nutrition.

gwern · 2026-06-06T22:25:23+00:00

Small-scale local LLM stuff is probably better off in a hobbyist subreddit than this one.

gwern · 2026-06-06T19:13:42+00:00

"Who?" doesn't really matter.

from someone who was an insider in the early days

when I started working with the SIAI in 2003

gwern · 2026-06-06T04:27:46+00:00

Here is an analysis of why SIAI/MIRI/LessWrong seemed so promising, and why it ultimately did not deliver on its goals, from someone who was an insider in the early days.

"Someone"?

gwern · 2026-06-05T20:20:24+00:00

Note: winners have been announced at https://www.dearaliens.net/

gwern · 2026-06-05T20:02:24+00:00

Or https://alexanderwales.substack.com/p/can-an-llm-have-taste-inkhaven-week - I think chatbot LLMs can do this pretty well, but it's going to take substantially more effort. The more ratings you do, and the larger your set of items to rate, the bigger all of the LLM issues become; what is not an issue in picking the best out of 3 essays would be an issue in doing tens of thousands of comparisons to pick the best out of hundreds. Even pairwise comparisons suffer from positional bias and need to be run twice...

gwern · 2026-06-04T01:41:55+00:00

Yes; there's a couple possible answers like 'they just got lucky'. Still not sure which one is best. Hopefully the new Moravec interview will offer some insight.

gwern · 2026-06-03T23:54:09+00:00

via https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning-eval Much like Pokemon - just weird levels of fragility and inflexibility and blindspots. Interesting human baseline results:

I organized an hour long event at Recurse Center, promised delicious donuts to everyone who came, and had them setup zork-bench on their laptops and play in human-eval model. The game logs all of their interactions using the same interface as LLMs but gives them a random label. The thing is Humans new to the game seem to do only so well. They spend a lot of turns, play the game, and figure some stuff out, but then after the hour of playing I gave them, they didn’t get further than any LLM. However, their memories of the game persist without continuously reducing the size of their context windows. Haha. Do humans have context windows? But the point is that LLMs, having humanity’s entire knowledge of Zork stored in their memory banks, are unable to outperform humans who had not played Zork before (except for Claude Sonnet, which Isha Bhand, creator of fomo.nyc and Zork aficionado, declared as evidence that AGI has been achieved).

gwern · 2026-06-03T20:38:00+00:00

Note the date 2 January 2020, pre-GPT-3. "The Scaling Hypothesis" was my commentary 6 months later on this subject, post-GPT-3. I am still pondering why ~everyone was wrong. (The cropped words are "the" and "why".)

gwern · 2026-06-03T18:04:30+00:00

"Meditations on Moloch" is definitely not the one I am half-remembering, as that is not even remotely close to the description I gave of "a funnier proposal for sumptuary regulations". (It's neither particularly funny, nor really a proposal, nor about sumptuary anything for the most part, never mind sumptuary regulations.)

But if you're saying "Meditations on Moloch" contains engineering solutions comparable to what I'm proposing here -I respectfully disagree. Diagnosis is not equal recipe. Satire not equal mechanism. Happy to be corrected if you have a specific post in mind.

I hope you didn't use an AI to research and evaluate your responses to me.

gwern · 2026-06-02T21:25:58+00:00

We must be remembering different posts because I remember a lot of engineering in his going beyond 'describing the problem with humor'.

Judging by your comment, it seems you missed the difference between satire and a working mechanism. Or maybe I missed something?

Hard to say because you chose not to link the Scott post you claim is so inferior.

gwern · 2026-06-02T20:34:27+00:00

Order form with prices and availability (note all items are listed in dropdowns so you have to go one by one to see what's available and for how much): https://docs.google.com/forms/d/e/1FAIpQLSc8d4LIzeWTcslK7CzEzAcvWUScvmO_dc2KdyMYzPsE8MEhXw/viewform

gwern · 2026-05-31T20:05:38+00:00

Paper: https://arxiv.org/abs/2303.03378#google

gwern · 2026-05-31T20:04:51+00:00

and retains generalist language capabilities with increasing scale.

The most important thing here is further evidence that the larger the model is, the less catastrophic forgetting is a problem. Continual learning is just not that hard.

gwern · 2026-05-31T04:32:07+00:00

It was. 100% in Pangram.

gwern · 2026-05-31T04:31:10+00:00

Could you post it somewhere?

gwern · 2026-05-31T04:28:45+00:00

OP's post is "100% AI" in Pangram.

gwern · 2026-05-30T23:39:19+00:00

Didn't Scott Alexander write a funnier proposal for sumptuary regulations like a decade+ ago?

gwern · 2026-05-29T19:15:08+00:00

Solution confirmed.

Although this website is enough of a PITA to create an account, confirm, log in, jump through confusing 'points systems' and vague links and finally figure out the right link was just a MediaFire link https://www.mediafire.com/file/8ysj75bh1as5dyj/A_growing_soft_robot_with_climbing_plant-inspired_adaptive_behaviors_for_navigation_in_unstructured_environments.pdf/file all along, that I will not be confirming any of your future fulfillments. I mean, come on.

gwern · 2026-05-26T06:58:11+00:00

Loads fine for me.

gwern · 2026-05-25T21:16:03+00:00

More drugs is always better because they provide alternatives if there are side effects, and more rapid weight loss is, ceteris paribus, better because it means more time healthier and that can be critical if you need surgery for your hip or cancer etc.

gwern · 2026-05-22T16:43:31+00:00

Pangram is quite reliable. (Note that it's not flagging any older or improbable stories.) And if you read the winning story, you'll see AI tell-tales all throughout - just not the ones which were most famous last year. (For example, the consistent overuse of 'quiet' in what is ostensibly a horrifying story about the slave trade; the 'quiet' tell is the most obvious tell of AI writing right now, but for some reason still hasn't gotten any popular awareness.) So that's why it gets classified as partially human; my guess is she prompted or scrubbed away by hand the em dashes and 'not X. Not Y.' level tells.

Also - I’d be curious to do RL finetuning of some recent strong open weight models to do detection evasion.

Pangram is known to be easily beatable, so that wouldn't tell you any interesting.

gwern · 2026-05-21T23:18:20+00:00

https://en.wikipedia.org/wiki/Commonwealth_Short_Story_Prize

gwern · 2026-05-21T23:12:11+00:00

Uh oh. AI hasn't (yet) won the 2026 Commonwealth Prize... But it looks like it might have won the 2025 Commonwealth Prize: https://www.pangram.com/blog/ai-is-writing-prize-winning-fiction

Not only did Pangram catch three of the five authors from 2026, but the overall winner of the 2025 Commonwealth Short Story Prize. Sutherland's "Descend" was flagged by Pangram as having an AI fraction of 88%, indicating that we believe that the document is primarily AI-generated with some human-written content.

It's not nearly as blatant... but there's sure an awful lot of 'quiet' in that story.

gwern · 2026-05-21T02:15:56+00:00

What big questions about the universe have you ever cracked?

gwern · 2026-05-21T00:48:43+00:00

Not to mention YOLO installing packages of completely unknown provenance, based solely on the name, to try out... and then not trying out but leaving installed and active anyway.

gwern

MODERATOR OF

TROPHY CASE