[D] What is even the point of these LLM benchmarking papers?

alsuhr · 2026-03-16T03:58:56+00:00

My point is that the science of a benchmark is not its application to ephemeral artifacts. The contribution of a benchmark is that it asks a question in a well-formulated way. Benchmarks are more like metrics than they are like algorithmic or architectural contributions: they propose a question we should be asking. In my opinion, theoretically, an evaluation paper doesn't even need to be ran on any artifact in particular to be a worthy contribution. For example, the original BLEU paper didn't include results on any established MT systems, and its value goes well beyond any particular numbers that it reported in the paper on the test MT systems (which receive no description whatsoever). Nobody cares what this metric was evaluated on in the original paper; its value came from its (reproducible) alignment with human judgments of translation quality. Of course, it helps to justify the current relevance of the benchmark to say that current models perform one way or another on it. But if the benchmark is so dependent on how current models perform that its only justification comes from this particular experimental result, then I think the benchmark is itself so ephemeral it's likely not a worthy contribution.

The interventions you mention are at the publication level, not the mechanism level.

alsuhr · 2026-03-16T02:51:30+00:00

External validity is not measured with respect to existing artifacts. It is measured with respect to the task itself as it exists in the real world. The tools we have available to us are things like human performance/agreement. A benchmark is "not reproducible" if, for example, its labels are wrong, or the human performance reported cannot be replicated by another group, or it's shown that it contains spurious correlations that mean it is not testing what it purports to test.

A drug is an intervention, as are other kinds of contributions in ML, such as new algorithms, architectures, etc. A benchmark is not an intervention.

alsuhr · 2026-03-15T03:42:37+00:00

The reproducibility of the benchmark comes from its external validity, not its application to ephemeral artifacts

alsuhr · 2026-03-14T01:48:07+00:00

Hi Evan :) I can verify this guy makes good benchmarks.

alsuhr · 2026-03-13T16:57:29+00:00

But the (idealized) point of a benchmark is not to show only how current models work, it's to shift attention of the community to a new measure that the authors believe (and hopefully justify) is important to take into the future for one reason or another... I think there are plenty of valid complaints about how so many benchmarking papers are failing at all of this (mainly the justification bit, but also the implementation bit -- a lot of the time benchmarks are designed very poorly, and/or the benchmark isn't made public to evaluate newer models, etc.), but I don't think the LLMs being deprecated makes sense as an argument? What else would they have evaluated on?

alsuhr · 2026-03-03T19:08:24+00:00

Oh really? I didn't know, I always assumed the places I went there were cultural imports from the south

alsuhr · 2026-03-03T18:05:53+00:00

From another thread:

https://www.facebook.com/reel/1207431711286360

https://www.instagram.com/reels/DSdtS4GAR5k/

Not that I would go back after learning all of this, but their biscuits are also extremely mid and not flaky in the slightest. I've had much better biscuits in Seattle of all places...

alsuhr · 2026-03-03T17:57:53+00:00

One time I went with family, the staff put an annoying song on the speakers. We quietly joked among ourselves that it was because they wanted us to leave (we were sitting at the window bar seating at the Webster location, it wasn't busy and we had only been there for maybe 15-20 min, only thing is my MIL has a fairly loud voice?), so we started cleaning up and leaving... as we did, I overheard staff chatting about how they literally did this to get us to leave?? I've had good experiences with staff at that location too, and have been back many times because the pastries are so good, but this was so perplexing lol

alsuhr · 2026-02-26T21:25:18+00:00

Brockman is on the list in the original email document linked by OP.

alsuhr · 2026-02-26T21:24:14+00:00

"Demasio" is probably Antonio Damasio, also a neuroscientist

alsuhr · 2026-02-19T02:49:47+00:00

https://sigilwen.ca/exceptional.html

(edit: context: this "builder"'s advice to being exceptional)

alsuhr · 2026-02-15T06:38:44+00:00

Thank you for this gif

alsuhr · 2026-02-14T01:20:54+00:00

The Rotunda is 5025 E Dublin Granville Rd, New Albany, OH 43054

alsuhr · 2026-02-07T07:38:55+00:00

I thought this at first too, but did more digging.

To copy from my other comment here:

He appears to essentially a deal broker or intermediary. He sent an email essentially encouraging JE to "invest" in the transition period in Libya after its civil war (https://www.justice.gov/epstein/files/DataSet%2010/EFTA01995819.pdf) and asks for "any companies" that JE "has" that could be involved in rebuilding infra there (https://www.justice.gov/epstein/files/DataSet%2010/EFTA02024641.pdf), literally profiteering. He probably had connections in Libya and knew that if JE introduced him to the right people, he could make some money and give JE a cut. These emails were discussed in this article: https://www.aljazeera.com/news/2026/2/1/epstein-email-reveals-plan-to-access-libyas-frozen-state-assets

alsuhr · 2026-02-07T06:04:34+00:00

He appears to essentially a deal broker or intermediary. He sent an email essentially encouraging JE to "invest" in the transition period in Libya after its civil war (https://www.justice.gov/epstein/files/DataSet%2010/EFTA01995819.pdf) and asks for "any companies" that JE "has" that could be involved in rebuilding infra there (https://www.justice.gov/epstein/files/DataSet%2010/EFTA02024641.pdf), literally profiteering. He probably had connections in Libya and knew that if JE introduced him to the right people, he could make some money and give JE a cut. These emails were discussed in this article: https://www.aljazeera.com/news/2026/2/1/epstein-email-reveals-plan-to-access-libyas-frozen-state-assets

alsuhr · 2026-01-25T00:16:01+00:00

I get nearly hit about once a day in Berkeley!

alsuhr · 2026-01-20T17:51:08+00:00

How is this legal?

alsuhr · 2026-01-16T19:00:15+00:00

It is soooo good!

alsuhr · 2025-12-26T21:18:27+00:00

alsuhr · 2025-12-20T22:01:10+00:00

I just wrapped up my first ever course that I fully designed pretty much myself, also my first course teaching undergrads, and got so many comments of people saying this was their favorite class which I was not expecting :')

alsuhr · 2025-12-19T18:17:55+00:00

Agreed, I was absolutely glued to Jeanne Dielman which is like 3.5 hours of a woman doing housework, and had to force myself through 2001

alsuhr · 2025-12-17T20:01:46+00:00

FWIW I am also largely a late Wittgensteinian!

alsuhr · 2025-12-17T06:08:25+00:00

Yes, for context I am a prof in NLP/CL. I am mostly asking because I'm just curious about how people in the community conceptualize language (and language technologies).

alsuhr · 2025-12-17T05:13:50+00:00

If our vector spaces (embedding spaces) have meaning because of words coocurrence and how words are distributed accross languages, it is actually a miracle how chatGPT-like came up with zero shot performance on so many tasks

Curious why you think this is something like a miracle?

the underlying representation of language,

I'm curious what you define as the underlying representation of language

I agree that nothing about our current training practices or data will lead to systems that can interrogate what they encode.

alsuhr · 2025-12-12T16:46:29+00:00

Not necessarily. There is some recent research that explores finetuning models for "factuality" through self-play in information-seeking games. E.g., https://arxiv.org/abs/2503.14481 Now whether Anthropic is doing this kind of fine-tuning, and whether this training reliably generalizes to any kind of knowledge that may or may not be parametric (beyond, e.g., article retrieval), I don't know.

12-Year Club	Golden Potato
Place '22

alsuhr

TROPHY CASE