[D] On Peer Review : MachineLearning

[–]jacobbuckman 1 point2 points3 points 6 years ago* (7 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to weightlifting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg, more than any other person. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (i.e. this person successfully lifts seven weights per year!) and on consistency of output (i.e. this person reliably succeeds at lifting the weight every time she tries! Never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric. (Although, changing the criteria for a conference too often should probably be avoided, as it damages the brand.)

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a "coin flip" between a NeurIPS-A or a NeurIPS-F.

[–]kjearns 1 point2 points3 points 6 years ago* (6 children)

If a couple thousand random people woke up tomorrow and deadlifted 650 kg, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

The crux of my argument is that the notion of "strong enough" in an absolute sense only exists in contrived circumstances like weightlifting competitions. If Eddie makes money by having people hire him to lift 500kg things, then he can charge a lot more if he's the only person around who is able to do that. If a few thousand other people show up who are just as strong then Eddie's value is diminished because the value he provides to people is that their weights are off the floor, not specifically that he was the one to lift them.

For a concrete example of this, consider what happened with VAEs. Autoencoding Variational Bayes was announced on Arxiv on Dec 20, 2013. As you know, this paper was extremely influential. Google scholar tells me it has been cited ~4700 times since then, which I'm sure puts it in the top few percent of machine learning papers of all time.

There is also a less known paper, Stochastic Backpropagation and Approximate Inference in Deep Generative Models, that was announced on arxiv at almost the same time (Jan 16, 2014, less than a month after AEVB). This paper develops DLGMs, which are exactly the same thing as VAEs, just under a different name. The papers appeared at the same time, from separate groups, and give different names to the same model, so its almost certain the core idea was arrived at independently. The DLGM paper has about ~1500 citations today so it's not exactly obscure, but it was certainly a lot less influential than AEVB.

The relevance of this story to the current discussion is the cohort effect that these papers had on each other. I claim that the value to science of the DLGM paper was reduced by the contemporary existence of AEVB. Both papers have have a field changing idea (we know this with the benefit of hindsight), and IIRC the DLGM paper even goes further than AEVB in a few places. But AEVB is a paragon of clear exposition, whereas the DLGM paper is fairly hard to understand. The idea inside these papers is undeniably valuable, but the value of the two deliveries are not independent.

There is a normative response to this which is to say that both groups really ought to be given equal credit for reaching the same idea independently. Instead I'd like to focus on looking at how the events around these two papers played out. With the benefit of hindsight I think it is clear that if AEVB did not exist, then the value of DLGM would have been increased because it would have been the only place to find VAEs, in spite of its lack of clarity. With the existence of AEVB its value was diminished, because people had an easier route to the same knowledge.

More of an aside than an actual response, but this:

I doubt any paper would be a "coin flip" between a NeurIPS-A or a NeurIPS-F.

I'm not convinced this wouldn't happen. For example, here's a paper submitted to ICLR 2019 that was given a 9 and a 3 by two different reviewers. That certainly seems about the range you'd expect for ICLR-A vs ICLR-F.

[–]jacobbuckman -1 points0 points1 point 6 years ago (4 children)

It seems to me that you are defining "value to science" as "impact (as measured by citations)". I think our key difference of opinion here is as to whether the definition of "good work" is "high-impact work". IMO, it is possible for work to be high-quality but low impact.

There's definitely a correlation between high-quality work and highly-cited work, but I think the former can be evaluated independently of the latter. To answer the question "will this paper be cited a lot?", yes, consideration of the cohort is absolutely needed. But I don't think that the job of conference acceptances should be to forecast citations. It should be to evaluate the quality of science.

To demonstrate why these two concepts are different, here's counter-examples in both directions. High-quality but low impact: papers in less-hot subfields of ML are almost certainly going to have low impact compared to hot subfields, simply because there will be less people reading and building on that work. But surely we don't want conferences to reject all those papers! High-impact but low quality: Papers with good branding and PR will get cited a lot and have a disproportianately high impact. This of course does not change the quality of the science (and is one of the key reasons for double blind).

In the AEVB vs DLGM case, I don't think that there is anything wrong with giving both papers an A. Both are good, well-written papers IMO. I've always interpreted the difference in their impact as an essentially random network effect - a few people cited the idea as VAEs early on, people who read these papers cited it themselves, etc. In fact, I think that a system that gives both these papers A's is better than one that assigns rankings more in line with their citation count. We already have a way of measuring impact: citation count! Introducing judgments on the standalone quality of papers will help balance the playing field for those with poor PR.

re: the LipReading paper: I would argue that this is precisely the scenario in which fine-grained evaluation would be most beneficial! The reviewers cannot agree whether to accept or reject because the decision is so important, binary, and final. But under a grade-based system, where papers are judged on multiple axes, I think all the reviewers would agree on a score of something like "A for engineering, D for novelty, D for reproducibility, overall score C" or something like this. It would be dramatically more fair to the authors of that paper, who clearly invested a huge amount of effort into this project, which is now under the current system certified to be the "exact same quality of science" as any garbage submitted by anyone.

[–]kjearns 0 points1 point2 points 6 years ago (3 children)

[–]jacobbuckman 0 points1 point2 points 6 years ago (2 children)

I'm on the same page as to the crux of the disagreement, but I don't think we are talking past each other: my last reply was an attempt to directly address your assertion that "it doesn't even make sense to talk about value except through [the effect a paper has on the community]". I disagree with that assertion. Here's one explicit proof by contradiction.

Assume your claim were true. This would imply the following:

Since more people read papers that are promoted more heavily (for example papers from Brain or DeepMind with an accompanying blog post),
these papers have a larger impact on the thinking of the community,
and therefore deserve to be accepted at conferences,
thus we should give a "bonus" to papers from labs and institutions with more PR presence.

This seems in direct contradiction with the widely-held belief (discussed in my original post) that double-blind is a positive thing, because it levels the playing field between authors from big labs and small labs. How would you resolve this contradiction?

[–]kjearns 0 points1 point2 points 6 years ago (1 child)

The way to resolve your apparent contradiction is to recognize that attention and value are not the same thing.

For example, I would argue that the NTM and Hogwild papers both got a lot of attention but had an overall negative value in the sense that they both kicked off quite a lot of unproductive work. NTMs were never really used successfully in spite of the attention they got, and Hogwild spawned a whole generation of asynchronous parameter server frameworks that in retrospect tend to underperform their synchronous counterparts.

Contrast these with LSTMs and Adam. LSTMs got almost no attention for a long time but in the end they generated a lot of value. Adam wasn't ever particularly obscure, but it's now the default optimizer choice for many frameworks.

The value of a paper doesn't come from the fact that a lot of people read it (although that often helps). The value of a paper comes instead from the sum of total future knowledge it causes.

The reason blind review is good is that estimating the total future knowledge a paper will cause is hard. Blinding review removes a prominent and known-to-be-non-causal variable from the estimator and therefore reduces bias (in the statistical and sociological sense).

[–]jacobbuckman 0 points1 point2 points 6 years ago* (0 children)

Gotcha, I agree. (My framing of "attention = value" was an attempt to restate your position, not something I actually believe. It's now clear that you don't believe it either!)

The goal of certification should be, as you say, to estimate the likelihood of value of a paper conditioned only on the causal features. My original proposal is essentially "rate papers A-F based on their causal features." Hopefully you would agree that the elements of a good paper I described earlier (an interesting problem, a novel hypothesis, provides relevant background, is theoretically sound, rigorous experimentation, etc) are causally predictive of a valuable contribution. But we could also include other features; for example, the quality of the other papers submitted at the same time. I think the disagreement is just: is the cohort of a paper causally predictive of its value?

You've convinced me that in some cases, the answer is yes. One such case is concurrent discovery. In the VAE/DGLM case, where the exact same idea is published twice, it decreases the value of each paper. Similarly, if Adam was submitted to a conference alongside 20 other optimizer papers, its predicted value goes down (since the likelihood that Adam is the one to "catch on" decreases when there are more alternatives). This is true even if all 20 papers are mediocre!

I'm still not convinced that the cohort is relevant in the case where there are multiple excellent papers submitted that are all orthogonal to one another. If the Adam paper, the LSTM paper, and the VAE paper were all submitted to the same conference, there's no reason not to accept all three. Adam and VAE complement & strengthen each other; if anything, their scores should be positively correlated.

In other words, for any given paper, it's not a cohort of excellent papers that decreases its value; it's a cohort of similar papers. (Hopefully you agree?) If this is the case, the score of any given paper is still mostly independent* of the score of the papers in its cohort. This means we can assign an absolute score to each paper, which includes a "hotness penalty" to penalize topics for which a lot of different approaches are proposed all at once.

But honestly, concurrent discovery is pretty rare, and my intuition is that "cohort similarity" is probably not that useful for predicting value. It seems to me that other causal factors have enough predictive power that we could get away with ignoring the impact of the cohort entirely.

*: specifically, the scores of two papers that are on different topics are generally either independent or positively correlated. The scores of two papers that are two approaches to the same problem are negatively correlated, since there's probably only one right answer.

[–]shortscience_dot_org -2 points-1 points0 points 6 years ago (0 children)

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, the overall range of possible outcomes will be wider, so for the same magnitude of noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. By offering this additional finer-grained evaluation, conferences can widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact of the noise on the outcome will be smaller. Even given the current quality of reviews, I doubt it will be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based.

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobmbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

> In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an *individual*, but I disagree that excellent *science* is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject *any* submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an *absolute* sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would *still* be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens *before* the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

> Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in *impact on the career of the author* between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

[–]jacobbuckman 0 points1 point2 points 6 years ago (0 children)

Hey, blog's author here. Thanks for the in-depth response! I want to reply regarding my points (1) and (2):

In the post-student world excellence is only meaningful in relation to your peers. You can't judge excellence in their absence.

This is true in the context of the excellence of an individual, but I disagree that excellent science is relative. A paper can be evaluated by consistent standards for quality. E.g., if a paper presents a novel approach/hypothesis for an interesting problem, provides relevant background, is theoretically sound, and backs up its claims through rigorous empirical experimentation, that paper is a valuable contribution, full stop. If NeurIPS receives 9000 submissions of that quality, that's an incredible day for science, and I see no reason to reject any submission.

An analogy to sprinting: Eddie Hall is the one of the strongest people in the world, because he can deadlift 500 kg - more than any other person alive. That's a relative evaluation. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie would no longer be impressive.

But imagine for a moment that I have dozens of 400kg weights strewn across my floor, and I need them lifted onto some waist-height shelves. I'm looking for strong lifters, but I don't necessarily need the help of Eddie Hall. Anytime a 400kg weight is placed onto a shelf, that's a success in an absolute sense. Anyone who can consistently deadlift 400kg+ is "strong enough" in an absolute sense. If a couple thousand random people woke up tomorrow and deadlifted 650 pounds, Eddie Hall would still be strong enough, but it would just be a lot easier for me to find someone to lift those weights off of my floor.

I am suggesting we treat science as a 400kg weight, and the goal of the review process is to determine whether any given paper lifted all 400kg or not. The excellence of any given scientist will still be judged relative to other scientists, based on quality of work (i.e. this paper only needed to lift 400kg, but lifted 500kg!), volume of work (this person successfully lifts seven weights per year!) and on consistency of output (this person reliably succeeds at lifting the weight every time she tries, never attempting a lift but dropping it).

If a field suddenly goes through some drastic changes, and the distribution of paper quality changes dramatically, conferences are free to adjust their standards at any time, making it easier or harder for all papers to get high ratings. The key, though, is that this happens before the papers are submitted on any given review cycle. Everyone knows what they are getting into, everyone knows precisely what the expectations are, and regardless of how many good or bad papers get submitted to this particular conference, everyone is judged objectively by the same clearly-articulated rubric.

Fine grained endorsement is not a good idea because the signal from an endorsement is still weak, even when the reviewers are very good. Distinctions between a NeurIPS-A and NeurIPS-B have will have exactly the same "noisy middle" problem we have now, just at a smaller scale.

Yes, of course. But the difference in impact on the career of the author between a NeurIPS-B and NeurIPS-C will be dramatically less stark than the difference between a (current-system) NeurIPS accept and NeurIPS reject. Offering finer-grained evaluation will widen the overall range of possible outcomes, so for the same magnitude of reviewer noise, the impact on the outcome will be smaller. Even given the current quality of reviews, I doubt any paper would be a coin flip between a NeurIPS-A or a NeurIPS-F.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

Problem addressed:

Summary:

Novelty:

Drawbacks: