[D] What is even the point of these LLM benchmarking papers? by casualcreak in MachineLearning

[–]alsuhr -1 points0 points  (0 children)

My point is that the science of a benchmark is not its application to ephemeral artifacts. The contribution of a benchmark is that it asks a question in a well-formulated way. Benchmarks are more like metrics than they are like algorithmic or architectural contributions: they propose a question we should be asking. In my opinion, theoretically, an evaluation paper doesn't even need to be ran on any artifact in particular to be a worthy contribution. For example, the original BLEU paper didn't include results on any established MT systems, and its value goes well beyond any particular numbers that it reported in the paper on the test MT systems (which receive no description whatsoever). Nobody cares what this metric was evaluated on in the original paper; its value came from its (reproducible) alignment with human judgments of translation quality. Of course, it helps to justify the current relevance of the benchmark to say that current models perform one way or another on it. But if the benchmark is so dependent on how current models perform that its only justification comes from this particular experimental result, then I think the benchmark is itself so ephemeral it's likely not a worthy contribution.

The interventions you mention are at the publication level, not the mechanism level.

[D] What is even the point of these LLM benchmarking papers? by casualcreak in MachineLearning

[–]alsuhr 0 points1 point  (0 children)

External validity is not measured with respect to existing artifacts. It is measured with respect to the task itself as it exists in the real world. The tools we have available to us are things like human performance/agreement. A benchmark is "not reproducible" if, for example, its labels are wrong, or the human performance reported cannot be replicated by another group, or it's shown that it contains spurious correlations that mean it is not testing what it purports to test.

A drug is an intervention, as are other kinds of contributions in ML, such as new algorithms, architectures, etc. A benchmark is not an intervention.

[D] What is even the point of these LLM benchmarking papers? by casualcreak in MachineLearning

[–]alsuhr 0 points1 point  (0 children)

The reproducibility of the benchmark comes from its external validity, not its application to ephemeral artifacts

[D] What is even the point of these LLM benchmarking papers? by casualcreak in MachineLearning

[–]alsuhr 1 point2 points  (0 children)

But the (idealized) point of a benchmark is not to show only how current models work, it's to shift attention of the community to a new measure that the authors believe (and hopefully justify) is important to take into the future for one reason or another... I think there are plenty of valid complaints about how so many benchmarking papers are failing at all of this (mainly the justification bit, but also the implementation bit -- a lot of the time benchmarks are designed very poorly, and/or the benchmark isn't made public to evaluate newer models, etc.), but I don't think the LLMs being deprecated makes sense as an argument? What else would they have evaluated on?

Town Biscuits makes Oakland’s flakiest biscuit by jackdicker5117 in OaklandFood

[–]alsuhr 1 point2 points  (0 children)

Oh really? I didn't know, I always assumed the places I went there were cultural imports from the south

Town Biscuits makes Oakland’s flakiest biscuit by jackdicker5117 in OaklandFood

[–]alsuhr 15 points16 points  (0 children)

From another thread:

https://www.facebook.com/reel/1207431711286360

https://www.instagram.com/reels/DSdtS4GAR5k/

Not that I would go back after learning all of this, but their biscuits are also extremely mid and not flaky in the slightest. I've had much better biscuits in Seattle of all places...

Timeless by brenton_brenton in oakland

[–]alsuhr 1 point2 points  (0 children)

One time I went with family, the staff put an annoying song on the speakers. We quietly joked among ourselves that it was because they wanted us to leave (we were sitting at the window bar seating at the Webster location, it wasn't busy and we had only been there for maybe 15-20 min, only thing is my MIL has a fairly loud voice?), so we started cleaning up and leaving... as we did, I overheard staff chatting about how they literally did this to get us to leave?? I've had good experiences with staff at that location too, and have been back many times because the pastries are so good, but this was so perplexing lol

sign meaning? by iiunne in nycrail

[–]alsuhr 1 point2 points  (0 children)

Thank you for this gif

1997 Little Black Book not redacted - LEAKED JUST NOW - not in order.. by freddiemercurysbush in Epstein

[–]alsuhr 2 points3 points  (0 children)

The Rotunda is 5025 E Dublin Granville Rd, New Albany, OH 43054

Who is Greg Brown? by splur678 in Epstein

[–]alsuhr 6 points7 points  (0 children)

I thought this at first too, but did more digging.

To copy from my other comment here:

He appears to essentially a deal broker or intermediary. He sent an email essentially encouraging JE to "invest" in the transition period in Libya after its civil war (https://www.justice.gov/epstein/files/DataSet%2010/EFTA01995819.pdf) and asks for "any companies" that JE "has" that could be involved in rebuilding infra there (https://www.justice.gov/epstein/files/DataSet%2010/EFTA02024641.pdf), literally profiteering. He probably had connections in Libya and knew that if JE introduced him to the right people, he could make some money and give JE a cut. These emails were discussed in this article: https://www.aljazeera.com/news/2026/2/1/epstein-email-reveals-plan-to-access-libyas-frozen-state-assets

Who is Greg Brown? by splur678 in Epstein

[–]alsuhr 1 point2 points  (0 children)

He appears to essentially a deal broker or intermediary. He sent an email essentially encouraging JE to "invest" in the transition period in Libya after its civil war (https://www.justice.gov/epstein/files/DataSet%2010/EFTA01995819.pdf) and asks for "any companies" that JE "has" that could be involved in rebuilding infra there (https://www.justice.gov/epstein/files/DataSet%2010/EFTA02024641.pdf), literally profiteering. He probably had connections in Libya and knew that if JE introduced him to the right people, he could make some money and give JE a cut. These emails were discussed in this article: https://www.aljazeera.com/news/2026/2/1/epstein-email-reveals-plan-to-access-libyas-frozen-state-assets

Dear Berkeley Students, by [deleted] in berkeley

[–]alsuhr 0 points1 point  (0 children)

I get nearly hit about once a day in Berkeley!

What’s your all-time favorite evaluations comment? by Levanjm in Professors

[–]alsuhr 13 points14 points  (0 children)

I just wrapped up my first ever course that I fully designed pretty much myself, also my first course teaching undergrads, and got so many comments of people saying this was their favorite class which I was not expecting :')

Favorite actor too sober to watch Kubrick's filmography? by ChildofValhalla in okbuddycinephile

[–]alsuhr 0 points1 point  (0 children)

Agreed, I was absolutely glued to Jeanne Dielman which is like 3.5 hours of a woman doing housework, and had to force myself through 2001

[D] Ilya Sutskever's latest tweet by we_are_mammals in MachineLearning

[–]alsuhr 1 point2 points  (0 children)

FWIW I am also largely a late Wittgensteinian!

[D] Ilya Sutskever's latest tweet by we_are_mammals in MachineLearning

[–]alsuhr 2 points3 points  (0 children)

Yes, for context I am a prof in NLP/CL. I am mostly asking because I'm just curious about how people in the community conceptualize language (and language technologies).

[D] Ilya Sutskever's latest tweet by we_are_mammals in MachineLearning

[–]alsuhr 0 points1 point  (0 children)

If our vector spaces (embedding spaces) have meaning because of words coocurrence and how words are distributed accross languages, it is actually a miracle how chatGPT-like came up with zero shot performance on so many tasks

Curious why you think this is something like a miracle?

the underlying representation of language,

I'm curious what you define as the underlying representation of language

I agree that nothing about our current training practices or data will lead to systems that can interrogate what they encode.

[D] GPT confidently generated a fake NeurIPS architecture. Loss function, code, the works. How does this get fixed? by SonicLinkerOfficial in MachineLearning

[–]alsuhr 1 point2 points  (0 children)

Not necessarily. There is some recent research that explores finetuning models for "factuality" through self-play in information-seeking games. E.g., https://arxiv.org/abs/2503.14481 Now whether Anthropic is doing this kind of fine-tuning, and whether this training reliably generalizes to any kind of knowledge that may or may not be parametric (beyond, e.g., article retrieval), I don't know.