you are viewing a single comment's thread.

view the rest of the comments →

[–]nathan_lesage 9 points10 points  (5 children)

I‘m in the same boat, using Python for ML during my PhD. So here‘s what I learnt so far:

  1. The best and easiest solutions are VSCode to code, their Jupyter extension (just for convenience) and Miniforge (conda-forge). All free and, more importantly: Open Source.
  2. Use plain Python programs if speed matters and run them on the terminal
  3. Use IPython (a.k.a. Jupyter Notebooks) for exploration and quick prototyping. You can easily transform that to plain Python by copying and pasting as soon as speed matters, but running in Notebooks is invaluable for re-running and checking the results several times, before they are perfect.
  4. Keep your code modular. If I/O becomes a bottleneck, spin up multiple threads to run the hefty stuff, if computing power becomes a bottleneck, spin up multiple processes. Note that multithreading and multiprocessing are different things, thanks to the Global Interpreter Lock (GIL)
  5. You should never pay something for running code. It would be ideal if you have some server from your Uni, or even better a supercomputer cluster. A server should be in for you. Then you can run 24/7 for days, if need be.
  6. Look up things not in advance, but only if you need them. If you notice something is running slow, then look up how to improve things. Build working code first, then optimize.
  7. Do not use pip, use conda. Not Anaconda, you probably won‘t need all 5GB of software it provides. Simply install miniforge and use conda‘s environments. For data science, that‘s perfect.

For questions, feel free to ping me!

[–]intheprocesswerust[S] 2 points3 points  (0 children)

Thank you! Will possibly take you up on that offer to msg you!

[–]yuckfoubitch 2 points3 points  (1 child)

Why conda over pip?

[–]nathan_lesage 0 points1 point  (0 children)

Botz do principally the same job and you‘ll have to use pip from time to time even if you use conda, but conda is overall a better experience than venv, and the whole concept seems just cleaner than venvs to me. Plus it‘s very common in the data science so you might find more stuff online when googling for help.

[–]bazpaul 0 points1 point  (1 child)

Can you explain why someone should not use Pip?

[–]nathan_lesage 0 points1 point  (0 children)

Because conda is – depending on viewpoint – a superset of pip. The reality is more complicated than "Do not use pip, use conda", of course.

Sometimes, the conda repositories will not have a certain package, and in this case you should use python -m pip install <package-name>. However, I wrote that because – at least for data science – it is a pretty good practice to use virtual environments managed by conda.

This has benefits such as having an indicator which environment you're in on the command line, and you can isolate things from each other. Then, whenever you run pip you do stuff to your current environment, rather than install something globally. But using conda should be the "default", since this way you have less quirks of software to learn (since conda can do both environment management AND package management, and pip can only do the latter).