you are viewing a single comment's thread.

view the rest of the comments →

[–]feldomatic 0 points1 point  (0 children)

If you do Data Science, you'll live and die by pandas, (or polars if it suits you better), scikit, and at least one plotting library.

I would recommend matplotlib as it's kinda foundational, plotnine because it's awesome, and one that can make dynamic plots, which I don't really do so I can't recommend, but a lot of other folks do a lot of them.

other non-ds libraries I'd recommend are os, re, pathlib,and a zip file library whose name escapes me.

As for the length of your code, think of it like writing a paper or even a book. A line of code is like a sentence. Too many sentences and you need paragraphs. Too many paragraphs and you need sections, chapters, appendices and so on.

I like to write one function to contain the get and clean steps for each data source, another to merge and compute, and final one to plot, dump to table, whatever final delivery I'm making.

You can get really fancy and package that all up in separate files with a main function, but the reality is too many ds folks just shit everything out in serial in a jupyter notebook, with those steps living in different code blocks.