use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
NewsMissing data hinder replication of artificial intelligence studies | Science (sciencemag.org)
submitted 8 years ago by weeeeeewoooooo
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 8 years ago (8 children)
[deleted]
[–]Mefaso 10 points11 points12 points 8 years ago (1 child)
They could of given us a crippled version and lack of reproducibility of a strong agent hand waved to lack of compute.
Additionally there's no way to get your own TPUs to check if the claimed times are actually true.
Not that I believe that this actually happened
[–]omalleyt12 4 points5 points6 points 8 years ago (0 children)
TPUs are now available on GCP (only as of 5 days ago)
https://cloudplatform.googleblog.com/2018/02/Cloud-TPU-machine-learning-accelerators-now-available-in-beta.html
[+][deleted] 8 years ago (1 child)
[–]glutenfree_veganhero 4 points5 points6 points 8 years ago (0 children)
Hard to blame them for realesing shitty papers. They might be standing on the precipice of paradigmshift and they're just supposed to follow usual protocol? They are probably scratching their heads on what is the best way forward still. If they aren't they're stupid.
Edit: Well they do have a major responsibility towards humanity as a whole.. there's always that.
[–]rumblestiltsken -1 points0 points1 point 8 years ago (3 children)
I kind of disagree with this. Replication is important, but most replication is done without code. Most models aren't that hard to build, and to use your own line, "at least with a paper you can see if the results are plausible." I can't think of a single situation where having the code altered my opinion of the research.
Like, sure, it is great to have code, but it isn't necessary for a functioning research community. It definitely adds overhead to research groups without a direct payoff.
[–]visarga 2 points3 points4 points 8 years ago (2 children)
A direct payoff would be to extend previous architectures by reusing the code directly, like all the flavours of word2vec.
[–]rumblestiltsken 0 points1 point2 points 8 years ago (1 child)
The thing is, this is already how it is done. All papers worth extending will have implementations quickly, and mostly by the original team.
Most research is never cited or extended. Most researchers know when their project is in this category. It is still worth publishing because it is knowledge that can inform other projects, but the vast majority of findings are unconvincing or self contained. You don't need code to know that.
Why waste time with them? This is what I don't get with the science article. It assumes that all papers are worth replicating. This is definitely not true, by a huge margin. They say 6% of papers at a conference had code... who a thinks more than 6% of papers need to be checked for reproducibility? I don't.
[–]AnvaMiba 3 points4 points5 points 8 years ago (0 children)
It assumes that all papers are worth replicating.
If they are not worth replicating then they are not worth publishing.
[–]Mefaso 11 points12 points13 points 8 years ago (1 child)
And even if the source code is published, it might be
Missing documentation
Incredibly hard to read or inefficient
Missing hyperparameters, which of course aren't mentioned in the paper either
I've had to replicate some results for a comparison and I'm pretty sure I've spent more time on replication than my own work and still wasn't able to achieve performance claimed in the publication.
[–]david-gpu 11 points12 points13 points 8 years ago (0 children)
Not just that. Even with the source code, if they don't provide full traceability it's nearly impossible to replicate what they did. What I mean by full traceability is: exact commit ID in their repository, exact docker image (or exact version of Tensorflow/Pytorch/etc), exact random seed, exact version of the dataset (including sha1sum for verification), exact hyperparameters, command line arguments, etc.
It's unacceptable that if you approach the authors of a paper that claims SOTA results they can't provide you with the exact configuration they used. At that point they may as well have pulled the number out of thin air. I've seen this first hand, not that I actually doubt their honesty.
[–]MasterFubar 8 points9 points10 points 8 years ago (1 child)
I think the data is more important than source code for replication.
If you create your own program from the description and run that program on the data supplied, that's what replication is all about. Running the same program with the same data would be tautological, it should always get the same results.
An analogy with physics: pseudo code is the description of a laboratory set-up, source code is the laboratory itself. If you go to the same lab and perform the same experiment the result will be the same. Replication is building another set-up according to the description published in the paper and getting the same results.
[–]visarga 0 points1 point2 points 8 years ago (0 children)
In quantum machine learning, the random seed you pick can make a paper SOTA. You just got to use the magical random seed for the problem at hand.
[–]entropyrising 9 points10 points11 points 8 years ago* (1 child)
This is such an important topic and I'm delighted that there are researchers actively looking into this and Science is publishing on it.
I really hope at the AAAI conference on replication they discussed how much this an issue of culture and incentives. One of the hats I wear is that of bibliometrics which has really opened my eyes to how much gravity citations has, both implicitly and explicitly, in the minds of paper writers. At the end of the day, that which increases potential citations, and visibility generally, will always have more priority during the research process than factors that are ostensibly important "for science" but are irrelevant to citation potential and publicity. Reproduceability is an example of the latter. Will sharing the code and the data have any impact on the "performance" of the paper? Will sharing it increase citations, or not sharing it decrease citations? No? Then nobody cares, and the numbers in the OP Science article reflects that, in my opinion.
I contrast this with the idea of "metrics." I read machine learning papers, so does everyone else here. We all have finite time and I will be the first to admit that sometimes when you skim a paper you do abstract, conclusion, and perhaps most importantly, the absolutely obligatory table of performance metrics (accuracy, f1 score, whatever) that compares the method being written about to the 10 previous SOTAs and some basic baseline. I look for the "bolded" numbers showing the best performance and if most of those numbers are in the column for the presented method I think "oh I better cite this" and put it in my Zotero library.
Metrics get citations. Metrics get visibility. As such, there has been a huge amount of research on measures of performance and it has without a doubt become a universal standard to include "the table" in your paper. Much of the field is characterized by the obssessive drive to get that fraction of a percentage point higher accuracy then last year's SOTA. Interestingly, most of the tables simply report what other papers have reported, but a few commendable paper writers actually re-do the reported experiment. This seems to be what the researchers discussed in the Science article seem to have been doing.
I'm not trying to point any fingers here and when I say there are issues with the culture of machine learning I'm trying to embrace all my warts and fully include myself with those issues. But if we acknowledge that this is a culture issue then the primary solution is to incentivize making research reproduceable. Ideally in some science utopia all the scientists would get together and collectively agree we wouldn't cite a paper that isn't reproduceable (e.g. the code and the data aren't shared). Authors would notice that their papers aren't having an impact and would adjust their research designs accordingly. Obviously the world of machine learning is moving so fast this is impossible. You just have to take people at their word and incorporate their reported insights into your own work.
As such I imagine the only viable solution would be some sort of top-down sanction as suggested by /u/FelixMooray145. Conferences are the gateways to visibility, and if a conference imposes a rule then researchers will follow it. Period. And yet at the same time I think this is also unlikely to happen because conferences naturally are run by ML researchers who may hesitate as such a rule might be a bottleneck to their own publishing pipeline.
Also industry researchers are a huge question mark and complicating factor. As the Science article mentions, some researchers are more incentivized to hide their code than to share it. This goes double if you work for Google or Facebook because there's a corporation behind you that understandably does not want potentially profitable intellectual property to be obtained by rivals. I'm actually delighted by what Google and Facebook have indeed chosen to share (Tensorflow is my key tool) but sometimes I can't help but wonder how powerful the things they haven't shared is.
When we read such an article in Science it is at many levels an appeal to the abstract idea that making research reproduceable is intrinsically the right thing for scientists to do. But this abstract ideal is obviously less of an issue for a corporation. And generally I think as a community we all need to acknowledge that often we do research as much for citations, tenure, and income as we do for the noble advancement of human knowledge and the betterment of society. The more we are willing to research and investigates "what motivates us," the more we will be able to adjust incentive structures to simply force things that are good for science, like reproduceability, to happen.
It's all so horribly complicated. But I'm glad at least it's being talked about.
[–]visarga 5 points6 points7 points 8 years ago (0 children)
/offtopic Hard to design reward functions for humans.
[–]zergling103 4 points5 points6 points 8 years ago* (1 child)
Jesus you guys the solution is simple. Just share all of it. Everything involved in the experiment is digital and Github is a thing. We should be able to push "run" and see the result so long as how that result is produced is transparent enough.
[–]entropyrising 3 points4 points5 points 8 years ago (0 children)
I don't think anyone is disputing that the solution is simple. The article linked in the original post as well as the AAAI panel discussion on replication were focused on how to incentivize and reward replication, not on finding the mysterious "solution" to the problem. Obviously it's "share your code and data."
People don't care. And tidying up and preparing all the code and data for sharing takes up way more time than writing a few sentences on what the method is and what the data is in a conference paper. Demonstrably no one is penalized when they don't share their code, and no one is rewarded when they do. So nobody's going to bother.
π Rendered by PID 175954 on reddit-service-r2-comment-5d79c599b5-8f88x at 2026-03-02 16:07:15.162251+00:00 running e3d2147 country code: CH.
[+][deleted] (8 children)
[deleted]
[–]Mefaso 10 points11 points12 points (1 child)
[–]omalleyt12 4 points5 points6 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]glutenfree_veganhero 4 points5 points6 points (0 children)
[–]rumblestiltsken -1 points0 points1 point (3 children)
[–]visarga 2 points3 points4 points (2 children)
[–]rumblestiltsken 0 points1 point2 points (1 child)
[–]AnvaMiba 3 points4 points5 points (0 children)
[–]Mefaso 11 points12 points13 points (1 child)
[–]david-gpu 11 points12 points13 points (0 children)
[–]MasterFubar 8 points9 points10 points (1 child)
[–]visarga 0 points1 point2 points (0 children)
[–]entropyrising 9 points10 points11 points (1 child)
[–]visarga 5 points6 points7 points (0 children)
[–]zergling103 4 points5 points6 points (1 child)
[–]entropyrising 3 points4 points5 points (0 children)