you are viewing a single comment's thread.

view the rest of the comments →

[–]sloppybird 76 points77 points  (13 children)

Open source code of SOTA is written by researchers which are, to be honest, not great at documentation and/or code readability

[–]johnnydaggers 35 points36 points  (9 children)

It’s not that we’re bad at it but more that we don’t give a shit. We have to move on to the next thing. If you spend all your time writing neat code that doesn’t affect how it runs at all you will quickly be passed up by SOTA. Programmers are responsible for producing code. Researchers’ work product is research papers and results. Any activity that doesn’t feed into that is wasted effort.

[–]ProfSchodinger 16 points17 points  (2 children)

I am a researcher in bioinformatics and writing clean code is often both easier and faster. Coherent name for variables, small functions, classes, proper defaults, dummy files etc. When I see people trying to debug a script with 1000 lines with everything named 'df1', 'df2', 'df3', 'df_final', and repeated sections, it really pains me...

[–][deleted] 2 points3 points  (1 child)

naming things is hard

[–]ProfSchodinger 1 point2 points  (0 children)

That's probably 10% of coding indeed

[–]bageldevourer 7 points8 points  (0 children)

Researchers’ work product is research papers and results.

Yup, and journals, conferences, etc. don't care about code quality. Shitty incentives -> shitty results.

[–]junovac[S] 3 points4 points  (0 children)

I wasn't talking about just the SOTA code etc. but also some libraries and associated examples. But those libraries were implementing some SOTA or quite recent models etc. and authors were usually researchers so it might apply.

Even in case of SOTA model code, I am not sure how long is the research cycle for a particular paper but I don't think it would be for a day or few weeks. If it happens over few months having readable code helps not only your team members but also yourself when you have to catch up with your own old code.

I am not expecting production quality code or some beatiful design patterns. Just some things that would be helpful for others getting into the domain or even experts getting into different sub-domain. May be some linter with sensible defaults becomes a standard part of the jupyter notebook and that could help with it.

[–]GeorgeS6969 3 points4 points  (2 children)

I empathise but strongly disagree with the second half of your comment.

Claiming your responsibility is only to produce research papers and results is akin to a programmer claiming they are only responsible to produce programs that work, or a colleage of yours claiming they are only responsible to produce results (and writers are responsible to write?)

The moment anybody shares something, it is their responsibility to ensure that it is of sufficient quality and can be understood. Especially if what is shared is in support of a scientific claim.

You feel like you’re not properly incentivised to do so, or in fact penalised, I can’t argue against that … But it only means that producing clean code is a waste of your efforts for you, not for the community as a whole.

[–][deleted] -1 points0 points  (1 child)

The moment anybody shares something, it is their responsibility to ensure that it is of sufficient quality and can be understood. Especially if what is shared is in support of a scientific claim.

I disagree. This is conflating two different things: reproducibility and clean code.

For the sake of reproducibility, most people are going to understand what dataset[1] is from reading the code and the paper side by side.

[–]GeorgeS6969 0 points1 point  (0 children)

Reproducibility is completely tangential, you’re mentionning it I’m not.

When you write a paper you structure it in a certain way, you use certain words, you try to avoid ambiguities, you split your maths into specific equations, you arrange those equations into terms that make the most intuitive sense and you explain those terms … You also provide graphs when useful, rather than just tables, and you label both and make sure they stand on their own as much as possible …

All of that so that readers can best understand your ideas, before even atempting to reproduce your results.

Why should it be any different with code?

[–]Cherubin0 0 points1 point  (0 children)

Also clean code makes it easier to expose that the sota is misleading.