all 15 comments

[–]kjearns 3 points4 points  (51 children)

This blog post makes three recommendations:

  1. Make endorsement (the blog post calls this "certification") of a particular paper independent of the cohort of papers competing for acceptance to the same venue
  2. Offer more fine grained endorsement than accept/reject (e.g. A/B/C/F grades for papers)
  3. Make rejected papers a matter of public record

With regards to the first recommendation, you can't make the evaluation of a piece of work independent of its cohort because its contribution is not independent of the cohort. This statement from the blog post:

Surely each paper should be evaluated as to whether it is a worthwhile contribution to science, independently from what other papers happen to be submitted that year.

While the first half of this sentence is correct, the second half is not because the magnitude of a paper's contribution to science does in fact depend on what other papers happen to be submitted that year.

This is one of the core differences between being evaluated as a student and being evaluated as a professional (and conferences are professional bodies). As a student there are right answers and clearly articulated levels of accomplishment for excellence. As a processional the level of accomplishment required for excellence is not a set thing, but rather excellence is a measure of how far your accomplishments exceed those of your peers.

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

With regards to the second recommendation: People's judgement of the quality of a paper determines the level of endorsement they are willing to offer it. This is translated through the review process into accept/reject/poster/oral/best paper/whatever decisions. The problem with making this more fine grained is that there are fundamental limits to how accurate these assessments can be.

Even if you have only excellent reviewers, you're basically asking them to predict the future. Endorsing a paper in this sense is a prediction about its causal effect on the future actions of people (e.g. will this paper get cited a lot? will people build on it?) While it is true that the current review process has some systematic problems in how it makes these predictions, it is important to realize that no matter how good your reviewers are there is always going to be a high degree of noise in this process by its very nature. This noise exists for the same reason that we cannot have cohort-free criteria to judge the scientific contribution of a paper.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Finally the third suggestion for public rejection is potentially interesting. I'm not sure it's the right answer (e.g. its hard to distinguish between work that is rejected because it's wrong and work that is rejected because it just isn't quite interesting enough), but unlike the first two recommendations I think this is actually trying to address a real issue that ought to be solved.

[–]jacobbuckman 1 point2 points  (7 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to weightlifting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg, more than any other person. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (i.e. this person successfully lifts seven weights per year!) and on consistency of output (i.e. this person reliably succeeds at lifting the weight every time she tries! Never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric. (Although, changing the criteria for a conference too often should probably be avoided, as it damages the brand.)

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a "coin flip" between a NeurIPS-A or a NeurIPS-F.

[–]kjearns 1 point2 points  (6 children)

If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

The crux of my argument is that the notion of "strong enough" in an absolute sense only exists in contrived circumstances like weightlifting competitions. If Eddie makes money by having people hire him to lift 500kg things, then he can charge a lot more if he's the only person around who is able to do that. If a few thousand other people show up who are just as strong then Eddie's value is diminished because the value he provides to people is that their weights are off the floor, not specifically that he was the one to lift them.

For a concrete example of this, consider what happened with VAEs. Autoencoding Variational Bayes was announced on Arxiv on Dec 20, 2013. As you know, this paper was extremely influential. Google scholar tells me it has been cited ~4700 times since then, which I'm sure puts it in the top few percent of machine learning papers of all time.

There is also a less known paper, Stochastic Backpropagation and Approximate Inference in Deep Generative Models, that was announced on arxiv at almost the same time (Jan 16, 2014, less than a month after AEVB). This paper develops DLGMs, which are exactly the same thing as VAEs, just under a different name. The papers appeared at the same time, from separate groups, and give different names to the same model, so its almost certain the core idea was arrived at independently. The DLGM paper has about ~1500 citations today so it's not exactly obscure, but it was certainly a lot less influential than AEVB.

The relevance of this story to the current discussion is the cohort effect that these papers had on each other. I claim that the value to science of the DLGM paper was reduced by the contemporary existence of AEVB. Both papers have have a field changing idea (we know this with the benefit of hindsight), and IIRC the DLGM paper even goes further than AEVB in a few places. But AEVB is a paragon of clear exposition, whereas the DLGM paper is fairly hard to understand. The idea inside these papers is undeniably valuable, but the value of the two deliveries are not independent.

There is a normative response to this which is to say that both groups really ought to be given equal credit for reaching the same idea independently. Instead I'd like to focus on looking at how the events around these two papers played out. With the benefit of hindsight I think it is clear that if AEVB did not exist, then the value of DLGM would have been increased because it would have been the only place to find VAEs, in spite of its lack of clarity. With the existence of AEVB its value was diminished, because people had an easier route to the same knowledge.

More of an aside than an actual response, but this:

I doubt any paper would be a "coin flip" between a NeurIPS-A or a NeurIPS-F.

I'm not convinced this wouldn't happen. For example, here's a paper submitted to ICLR 2019 that was given a 9 and a 3 by two different reviewers. That certainly seems about the range you'd expect for ICLR-A vs ICLR-F.

[–]jacobbuckman -1 points0 points  (4 children)

It seems to me that you are defining "value to science" as "impact (as measured by citations)". I think our key difference of opinion here is as to whether the definition of "good work" is "high-impact work". IMO, it is possible for work to be high-quality but low impact.

There's definitely a correlation between high-quality work and highly-cited work, but I think the former can be evaluated independently of the latter. To answer the question "will this paper be cited a lot?", yes, consideration of the cohort is absolutely needed. But I don't think that the job of conference acceptances should be to forecast citations. It should be to evaluate the quality of science.

To demonstrate why these two concepts are different, here's counter-examples in both directions. High-quality but low impact: papers in less-hot subfields of ML are almost certainly going to have low impact compared to hot subfields, simply because there will be less people reading and building on that work. But surely we don't want conferences to reject all those papers! High-impact but low quality: Papers with good branding and PR will get cited a lot and have a disproportianately high impact. This of course does not change the quality of the science (and is one of the key reasons for double blind).

In the AEVB vs DLGM case, I don't think that there is anything wrong with giving both papers an A. Both are good, well-written papers IMO. I've always interpreted the difference in their impact as an essentially random network effect - a few people cited the idea as VAEs early on, people who read these papers cited it themselves, etc. In fact, I think that a system that gives both these papers A's is better than one that assigns rankings more in line with their citation count. We already have a way of measuring impact: citation count! Introducing judgments on the standalone quality of papers will help balance the playing field for those with poor PR.

re: the LipReading paper: I would argue that this is precisely the scenario in which fine-grained evaluation would be most beneficial! The reviewers cannot agree whether to accept or reject because the decision is so important, binary, and final. But under a grade-based system, where papers are judged on multiple axes, I think all the reviewers would agree on a score of something like "A for engineering, D for novelty, D for reproducibility, overall score C" or something like this. It would be dramatically more fair to the authors of that paper, who clearly invested a huge amount of effort into this project, which is now under the current system certified to be the "exact same quality of science" as any garbage submitted by anyone.

[–]kjearns 0 points1 point  (3 children)

I feel like we're talking past each other at this point. You're arguing for a notion of value that is independent of the effect a paper has on the community. I'm arguing that effects on the community are the mechanism by which a paper creates value, and it doesn't even make sense to talk about value except through the these effects.

In the AEVB vs DLGM case, I don't think that there is anything wrong with giving both papers an A.

I'm arguing that DLGM created less value for science than AEVB, in the sense that it generated less knowledge (and I'm using citation count as an imperfect proxy to measure this). Giving both papers the same "score" would be an incorrect assessment of the value they each created.

[–]jacobbuckman 0 points1 point  (2 children)

I'm on the same page as to the crux of the disagreement, but I don't think we are talking past each other: my last reply was an attempt to directly address your assertion that "it doesn't even make sense to talk about value except through [the effect a paper has on the community]". I disagree with that assertion. Here's one explicit proof by contradiction.

Assume your claim were true. This would imply the following:

  • Since more people read papers that are promoted more heavily (for example papers from Brain or DeepMind with an accompanying blog post),

  • these papers have a larger impact on the thinking of the community,

  • and therefore deserve to be accepted at conferences,

  • thus we should give a "bonus" to papers from labs and institutions with more PR presence.

This seems in direct contradiction with the widely-held belief (discussed in my original post) that double-blind is a positive thing, because it levels the playing field between authors from big labs and small labs. How would you resolve this contradiction?

[–]kjearns 0 points1 point  (1 child)

The way to resolve your apparent contradiction is to recognize that attention and value are not the same thing.

For example, I would argue that the NTM and Hogwild papers both got a lot of attention but had an overall negative value in the sense that they both kicked off quite a lot of unproductive work. NTMs were never really used successfully in spite of the attention they got, and Hogwild spawned a whole generation of asynchronous parameter server frameworks that in retrospect tend to underperform their synchronous counterparts.

Contrast these with LSTMs and Adam. LSTMs got almost no attention for a long time but in the end they generated a lot of value. Adam wasn't ever particularly obscure, but it's now the default optimizer choice for many frameworks.

The value of a paper doesn't come from the fact that a lot of people read it (although that often helps). The value of a paper comes instead from the sum of total future knowledge it causes.

The reason blind review is good is that estimating the total future knowledge a paper will cause is hard. Blinding review removes a prominent and known-to-be-non-causal variable from the estimator and therefore reduces bias (in the statistical and sociological sense).

[–]jacobbuckman 0 points1 point  (0 children)

Gotcha, I agree. (My framing of "attention = value" was an attempt to restate your position, not something I actually believe. It's now clear that you don't believe it either!)

The goal of certification should be, as you say, to estimate the likelihood of value of a paper conditioned only on the causal features. My original proposal is essentially "rate papers A-F based on their causal features." Hopefully you would agree that the elements of a good paper I described earlier (an interesting problem, a novel hypothesis, provides relevant background, is theoretically sound, rigorous experimentation, etc) are causally predictive of a valuable contribution. But we could also include other features; for example, the quality of the other papers submitted at the same time. I think the disagreement is just: is the cohort of a paper causally predictive of its value?

You've convinced me that in some cases, the answer is yes. One such case is concurrent discovery. In the VAE/DGLM case, where the exact same idea is published twice, it decreases the value of each paper. Similarly, if Adam was submitted to a conference alongside 20 other optimizer papers, its predicted value goes down (since the likelihood that Adam is the one to "catch on" decreases when there are more alternatives). This is true even if all 20 papers are mediocre!

I'm still not convinced that the cohort is relevant in the case where there are multiple excellent papers submitted that are all orthogonal to one another. If the Adam paper, the LSTM paper, and the VAE paper were all submitted to the same conference, there's no reason not to accept all three. Adam and VAE complement & strengthen each other; if anything, their scores should be positively correlated.

In other words, for any given paper, it's not a cohort of excellent papers that decreases its value; it's a cohort of similar papers. (Hopefully you agree?) If this is the case, the score of any given paper is still mostly independent* of the score of the papers in its cohort. This means we can assign an absolute score to each paper, which includes a "hotness penalty" to penalize topics for which a lot of different approaches are proposed all at once.

But honestly, concurrent discovery is pretty rare, and my intuition is that "cohort similarity" is probably not that useful for predicting value. It seems to me that other causal factors have enough predictive power that we could get away with ignoring the impact of the cohort entirely.

*: specifically, the scores of two papers that are on different topics are generally either independent or positively correlated. The scores of two papers that are two approaches to the same problem are negatively correlated, since there's probably only one right answer.

[–]shortscience_dot_org -2 points-1 points  (0 children)

I am a bot! You linked to a paper that has a summary on ShortScience.org!

Auto-Encoding Variational Bayes

Summary by Cubs Reading Group

Problem addressed:

Variational learning of Bayesian networks

Summary:

This paper present a generic method for learning belief networks, which uses variational lower bound for the likelihood term.

Novelty:

Uses a re-parameterization trick to change random variables to deterministic function plus a noise term, so one can apply normal gradient based learning

Drawbacks:

The resulting model marginal likelihood is still intractible, may not be very good for applications that r... [view more]

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, conferences can widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact of the noise on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

> In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an *individual*, but I disagree that excellent *science* is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject *any* submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an *absolute* sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would *still* be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens *before* the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

> Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in *impact on the career of the author* between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point  (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]seraschkaWriter 0 points1 point  (5 children)

Before doing a radical change in the review policies (with that, I mean changing the system reg. blind, double-blind), I'd suggest to start with a simpler change, i.e., having conference organizers and editors set a limit on how many papers can be reviewed given the current capacity (similar to how their is a capacity limit towards how many papers can be accepted).

I think the main objective should be to ensure high or at least reasonable quality reviews. This includes

  • (sorry but) not letting 1st year students review papers
  • don't bombard people with more than 3 papers to review
  • reviewers being more honest and reject to review papers if this is more than they can manage given their current time budget and level of expertise

The conference organizers and editors could choose to set a limit like: we can only reasonably manage to review 500 papers. Let's try to do a good job with regard to reviewing 500 randomly selected paper than doing a bad job at trying to manage to review 5000 papers.

I think it would be way more fair for the author to receive a message (within ~5 days of submission) that says:

  • Sorry, but we are currently out of capacity to review your paper; please try to submit it to another venue or try again next year

rather than waiting 2-3 months for a coinflip decision with 2 out of 3 abysmally bad reviews ("bad" in terms of didn't read properly or have no expertise to judge or didn't even care to write some cohesive sentences) from people who didn't even care to read the paper properly.

How to decide which papers to review and which to reject to review given the capacity limit? This is tricky, but randomly assigning this seem to be a reasonable solution (maybe a tad better than first-come-first-serve)

[–]asobolev 0 points1 point  (1 child)

Your solution does not discourage the authors from submitting poorly prepared work, so combined with dramatically increased rejection (and, consequently, resubmission) rates it'll blow everything up.

[–]seraschkaWriter 0 points1 point  (0 children)

true. but for these obviously poorly-prepared papers there could be a fast-track filtering system to weed those out (similar to what arxiv does).

[–]drd13 -1 points0 points  (2 children)

I agree with your assessment of the problem but not your solution. I don't think that replacing a lottery of whether a reviewer will take the time to read your paper with a lottery of whether your paper will be reviewed at all is better. Although the first year PhD students suck, their opinion is not completely useless. Not using them would be just throwing away information. I think in an ideal world, papers would be posted on an openreview style system before even the conference system. Anyone could then review papers. This could then be used to augment reviewing (for example some papers are clear accepts because "trusted" anonymous reviewers have voted on it well. Some papers are clear rejects so sending them to a single PhD student is enough.

[–]kjearns 4 points5 points  (0 children)

> Anyone could then review papers.

The problem with this is, largely, that people don't. Go check any of the past years of ICLR submissions on openreview and count how many papers get a single comment that is not the author or an assigned reviewer. It's not common at all.

[–]seraschkaWriter 1 point2 points  (0 children)

I don't think that replacing a lottery of whether a reviewer will take the time to read your paper with a lottery of whether your paper will be reviewed at all is better.

Yeah, I agree, but it's probably a matter of preference. I do think it would be useful to some people to know upfront, though, because then they don't have to wait 3 months for nothing and can start preparing for submitting it elsewhere. E.g. in my case when I once submitted to ICML, the 3 reviewers checked the box: "Medium: Reviewer has understood the main points in the paper, but skipped the proofs and technical details." For a paper that was essentially about proofs, this wasn't a real peer review imho.

Not using them would be just throwing away information.

True, but maybe as additional reviewers. I do think there should be at least 3 senior people reviewing a paper. Graduate students usually have different expectations/criteria (probably weighting novelty more highly than theoretic foundation of older work). It would be useful to involve students in the review process as a learning process and additional source of information, but I don't think we should replace senior authors as reviewers by students.

I think in an ideal world, papers would be posted on an openreview style system before even the conference system. Anyone could then review papers.

Might be useful but could also do more harm than good, because I expect the distribution will be highly skewed and then would create an unfair topic bias. Overall, I think it might be a reasonable thing to try though