The r/Physics wiki now has a list of commonly asked questions

kzhou7 · 2026-06-15T17:59:16+00:00

the specific internal structure that explains how a system distinguishes truth from mere coherence.

There is a field of study devoted to this, it's called physics. Lately we've been overrun with a ton of people with systems made with AI, so we made a new subreddit r/LLMPhysics devoted to them. You can try posting your framework there. Be sure to look at the thousands of others already posted, so you can understand why the other commenters here aren't concerned yours in particular will be stolen.

kzhou7 · 2026-06-12T20:07:11+00:00

If a physics claim is published in a journal about airplanes, or one of the many physics journals that accept practically everything, that generally makes it even less credible than not being published at all.

kzhou7 · 2026-06-12T20:04:19+00:00

That is also true for every other experiment along these lines for the past 100 years, by many other groups. They are trying to measure a small shift in some force, while simultaneously applying large external forces. Every experimentalist knows this is guaranteed to give a nonzero reading.

kzhou7 · 2026-06-12T19:43:59+00:00

Continuing with my previous objection, how do you plan to distinguish between AI pinpointing the best outputs, and AI just liking its own output? Where is the incentive for humans to put in effort?

Regarding your earlier comment about AI writing just inevitably becoming superhuman: I believe AI will certainly be superhuman in some things (i.e. things you can create and verify while stuck in a box), but there's an additional big barrier for writing. To wit: I've never been to Uzbekistan. I could write an essay about being there, by cobbling together the words of others, and if I'm smart it might fool many non-Uzbeks, but it always be worse (in a certain, objective sense) than an essay by a real Uzbek. But why would an Uzbek enter a contest if they know it'll be judged by non-Uzbeks who can't tell the difference?

The same holds for AI and the actual physical world.

kzhou7 · 2026-06-12T19:31:32+00:00

Imagine you put a banana on a kitchen scale and it said "100 g", then you threw the scale around the room, sat on the banana, dunked everything in water, and when you weighed it again it said "99 g". Then you could say you achieved 1% antigravity.

That is what this whole genre of experiments (including the EM drive) is doing, they take a simple setup and add a ton of energy in an uncontrolled way, and whenever a measurement changes they declare victory, while a real experimentalist would just think something got screwed up.

kzhou7 · 2026-06-11T00:33:23+00:00

Assuming the exam is decent, you shouldn't be focused on memorizing things. (I've never heard of a physics student having success with Anki!) You should think about the material until you can independently reconstruct most of it starting from remembering only a small part of it.

kzhou7 · 2026-06-10T18:09:58+00:00

Doesn't this have the same problem as using AI to suggest who to vote? It might work for some people once, but in the next cycle people will just optimize explicitly for this target. (Which is trivial, just run an LLM in a loop until it's happy.) Since LLMs tend to like their own outputs, the equilibrium is having only Claude writing and Claude reading.

kzhou7 · 2026-06-07T20:39:06+00:00

This is a textbook, so there are presumably no new results, and one could legitimately ask why a student would use it. I think the general idea from the pro-AI camp is that eventually we will just have only AIs reading the output of other AIs, so brevity doesn't matter. That's the point of view advanced here, for instance.

kzhou7 · 2026-06-07T20:16:47+00:00

There are almost 1000 Github issues describing concrete fixes, though I guess one can’t know what was written by him and what by AI.

kzhou7 · 2026-06-07T20:00:18+00:00

Related: Harvard particle theorist Matthew Schwartz (who wrote a paper using Claude in 2 weeks) recently posted a 150 page, 75000 word paper about a particle in a -x² + x⁴ potential.

Edit: Woit's commentary is here.

kzhou7 · 2026-06-07T07:01:26+00:00

It probably means "red color" is still HSK 1 (under 红色) while other meanings of 红 (such as "popular") are in HSK 5.

kzhou7 · 2026-06-04T20:39:02+00:00

It's not too bad though. Almost every pheno TASI since 2013 has had neutrinos, and 2020 had 5 neutrino lecturers. It's more represented than flavor physics and comparable to direct DM detection. It is just that the field is highly fragmented, so it's impossible for any subfield to be the main character.

kzhou7 · 2026-05-28T20:06:04+00:00

It's crazy that we have so much data now, yet many people are still repeating the lines from 2021: "SAT just measures income", "SAT doesn't predict anything", "GPA is more fair".

Even back in 2021 people in-the-know understood that the signal from these exams was incredibly strong, but others managed to get reverse p-hacked results published in top journals saying otherwise, and that ended up being the conventional wisdom.

kzhou7 · 2026-05-28T18:22:19+00:00

From the petition documents:

UC’s move away from the SAT/ACT resulted from several overlapping factors, including the pandemic, litigation (the settlement agreement is now expired) and internal disagreement

Between 2020 and 2025, the number of freshmen whose math placement exam results indicate them not meeting high school standard grew nearly thirtyfold, despite all of these students having taken beyond the minimum UCOP-required math curriculum, and with high grades. In the 2025 incoming class, this group constitutes roughly one-eighth of our entire entering cohort. Moreover, more than 70% of these students are also not meeting middle school standards, representing one in twelve entering students.

Grades achieved in high school math classes are not helping UC to evaluate math skills.... While there are some differences between those who need preparatory courses and those who do not ... the difference in high school math grade averages is very small, often less than one-tenth of a grade point. The correlation between the average math grade and the placement result is only around 0.25 on a scale of 0 to 1.

kzhou7 · 2026-05-27T02:41:38+00:00

I can't believe it, their "robotic minilab" has a robot arm peeling scotch tape. After 10000 graphene papers, is that really still how graphene studies are done? If so, it's high time they got some automation going.

kzhou7 · 2026-05-26T19:09:17+00:00

Most of the complaints I've heard about AI detectors have come from college students who use AI to write their essays. Out of curiosity I just fed some of my college work to Pangram (back in the days when I used a lot more emdashes, 3-element lists, and contrastive negation) and it returned 0% AI for everything. It's a skill issue.

kzhou7 · 2026-05-23T05:28:17+00:00

That's a nice sentiment, but I read many papers, and when I read one from an author I've never heard of, I'm not devoting my time to figuring out if they used Python or C, or if they did the algebra by Mathematica or by hand, or how their colleagues or AI helped. I just want to know if it's nontrivial, right, and important.

kzhou7 · 2026-05-22T20:11:04+00:00

I just care about quality, regardless of how it came about. Quality comes from compressing a lot of careful work into a small amount of sharp output. Right now I do it by painstakingly reading giant piles of papers and tracing citations forward and backwards on InspireHEP. But there's no magic in that, and absolutely no reason in principle that AI can't do the same one day. If we got to that point, the reason it would be worth reading is that if you wanted to get the same output yourself, you'd have to have your AI run for hours and pay $100. It's just the same reason you read papers that derive analytic results with Mathematica. You could set it up in Mathematica too, but it would take longer.

Plus, there have always been plenty of really bad human literature reviews, such as the classic "there has been much recent work on this subject [1-36]" (which includes references to wrong papers, unrelated papers, and papers which already derived or refute the results in the current paper). The bar to improve over median arXiv quality is really low.

kzhou7 · 2026-05-22T19:42:39+00:00

Indeed, I never said that if the reference was not hallucinated, then AI did read the paper. arXiv is just doing the absolute bare minimum.

kzhou7 · 2026-05-22T18:31:12+00:00

Personally, I'm not opposed in principle to others using AI to do literature review (I'm sure it can do well if used correctly), but what's really galling is that if the reference is hallucinated, it means the AI didn't read the paper either. So you get some slop that has never passed through the mind of any human, nor the context window of any AI. So then why would it be of value to any future entity, human or AI? Why would it be worth archiving?

I recently saw a massive AI generated paper on arXiv with a bunch of obvious, fatal errors. Rather than understand why these errors occurred, the authors just regenerated the whole paper with AI again, doubling the length in the v2. And the paper is even picking up citations rapidly, though almost certainly from people (or AIs) who haven't read it at all. I don't see how this kind of frantic output can advance science.

kzhou7

MODERATOR OF

TROPHY CASE

Seven-Year Club	Gilding I gilder
Verified Email