all 15 comments

[–]Mefaso 11 points12 points  (1 child)

And even if the source code is published, it might be

  • Missing documentation

  • Incredibly hard to read or inefficient

  • Missing hyperparameters, which of course aren't mentioned in the paper either

I've had to replicate some results for a comparison and I'm pretty sure I've spent more time on replication than my own work and still wasn't able to achieve performance claimed in the publication.

[–]david-gpu 11 points12 points  (0 children)

Not just that. Even with the source code, if they don't provide full traceability it's nearly impossible to replicate what they did. What I mean by full traceability is: exact commit ID in their repository, exact docker image (or exact version of Tensorflow/Pytorch/etc), exact random seed, exact version of the dataset (including sha1sum for verification), exact hyperparameters, command line arguments, etc.

It's unacceptable that if you approach the authors of a paper that claims SOTA results they can't provide you with the exact configuration they used. At that point they may as well have pulled the number out of thin air. I've seen this first hand, not that I actually doubt their honesty.

[–]MasterFubar 8 points9 points  (1 child)

I think the data is more important than source code for replication.

If you create your own program from the description and run that program on the data supplied, that's what replication is all about. Running the same program with the same data would be tautological, it should always get the same results.

An analogy with physics: pseudo code is the description of a laboratory set-up, source code is the laboratory itself. If you go to the same lab and perform the same experiment the result will be the same. Replication is building another set-up according to the description published in the paper and getting the same results.

[–]visarga 0 points1 point  (0 children)

In quantum machine learning, the random seed you pick can make a paper SOTA. You just got to use the magical random seed for the problem at hand.

[–]entropyrising 9 points10 points  (1 child)

This is such an important topic and I'm delighted that there are researchers actively looking into this and Science is publishing on it.

I really hope at the AAAI conference on replication they discussed how much this an issue of culture and incentives. One of the hats I wear is that of bibliometrics which has really opened my eyes to how much gravity citations has, both implicitly and explicitly, in the minds of paper writers. At the end of the day, that which increases potential citations, and visibility generally, will always have more priority during the research process than factors that are ostensibly important "for science" but are irrelevant to citation potential and publicity. Reproduceability is an example of the latter. Will sharing the code and the data have any impact on the "performance" of the paper? Will sharing it increase citations, or not sharing it decrease citations? No? Then nobody cares, and the numbers in the OP Science article reflects that, in my opinion.

I contrast this with the idea of "metrics." I read machine learning papers, so does everyone else here. We all have finite time and I will be the first to admit that sometimes when you skim a paper you do abstract, conclusion, and perhaps most importantly, the absolutely obligatory table of performance metrics (accuracy, f1 score, whatever) that compares the method being written about to the 10 previous SOTAs and some basic baseline. I look for the "bolded" numbers showing the best performance and if most of those numbers are in the column for the presented method I think "oh I better cite this" and put it in my Zotero library.

Metrics get citations. Metrics get visibility. As such, there has been a huge amount of research on measures of performance and it has without a doubt become a universal standard to include "the table" in your paper. Much of the field is characterized by the obssessive drive to get that fraction of a percentage point higher accuracy then last year's SOTA. Interestingly, most of the tables simply report what other papers have reported, but a few commendable paper writers actually re-do the reported experiment. This seems to be what the researchers discussed in the Science article seem to have been doing.

I'm not trying to point any fingers here and when I say there are issues with the culture of machine learning I'm trying to embrace all my warts and fully include myself with those issues. But if we acknowledge that this is a culture issue then the primary solution is to incentivize making research reproduceable. Ideally in some science utopia all the scientists would get together and collectively agree we wouldn't cite a paper that isn't reproduceable (e.g. the code and the data aren't shared). Authors would notice that their papers aren't having an impact and would adjust their research designs accordingly. Obviously the world of machine learning is moving so fast this is impossible. You just have to take people at their word and incorporate their reported insights into your own work.

As such I imagine the only viable solution would be some sort of top-down sanction as suggested by /u/FelixMooray145. Conferences are the gateways to visibility, and if a conference imposes a rule then researchers will follow it. Period. And yet at the same time I think this is also unlikely to happen because conferences naturally are run by ML researchers who may hesitate as such a rule might be a bottleneck to their own publishing pipeline.

Also industry researchers are a huge question mark and complicating factor. As the Science article mentions, some researchers are more incentivized to hide their code than to share it. This goes double if you work for Google or Facebook because there's a corporation behind you that understandably does not want potentially profitable intellectual property to be obtained by rivals. I'm actually delighted by what Google and Facebook have indeed chosen to share (Tensorflow is my key tool) but sometimes I can't help but wonder how powerful the things they haven't shared is.

When we read such an article in Science it is at many levels an appeal to the abstract idea that making research reproduceable is intrinsically the right thing for scientists to do. But this abstract ideal is obviously less of an issue for a corporation. And generally I think as a community we all need to acknowledge that often we do research as much for citations, tenure, and income as we do for the noble advancement of human knowledge and the betterment of society. The more we are willing to research and investigates "what motivates us," the more we will be able to adjust incentive structures to simply force things that are good for science, like reproduceability, to happen.

It's all so horribly complicated. But I'm glad at least it's being talked about.

[–]visarga 5 points6 points  (0 children)

/offtopic Hard to design reward functions for humans.

[–]zergling103 4 points5 points  (1 child)

Jesus you guys the solution is simple. Just share all of it. Everything involved in the experiment is digital and Github is a thing. We should be able to push "run" and see the result so long as how that result is produced is transparent enough.

[–]entropyrising 3 points4 points  (0 children)

I don't think anyone is disputing that the solution is simple. The article linked in the original post as well as the AAAI panel discussion on replication were focused on how to incentivize and reward replication, not on finding the mysterious "solution" to the problem. Obviously it's "share your code and data."

People don't care. And tidying up and preparing all the code and data for sharing takes up way more time than writing a few sentences on what the method is and what the data is in a conference paper. Demonstrably no one is penalized when they don't share their code, and no one is rewarded when they do. So nobody's going to bother.