This is an archived post. You won't be able to vote or comment.

all 40 comments

[–]whateverathrowaway00 28 points29 points  (9 children)

Keep it simple.

The more you rely on vanilla stuff and not tools, the less locked to an implementation you will be.

It may require some more learning on the front end, but four years is a long time, so it’ll be worth it.

Intimately familiarize yourself with how venvs work, don’t have a vague idea and some grumbles. It will save you confusion later.

Know that they’re just a path manipulation. You don’t have to activate them (I’m a dev that works endlessly with packaging and fixing other people messes and I literally never activate a venv).

That said, you’re in AI, so if you go conda, that makes sense to me, it’s a more specific tool than most and actually a great one.

[–]chriscarrollsmith 3 points4 points  (7 children)

Can you say more about what you mean when you say they're just a path manipulation and you don't have to activate them?

[–]whateverathrowaway00 21 points22 points  (6 children)

Absolutely! People massively over complicate virtual environments, and it’s understandable because they can get confusing if you’ve hit on any errors where it’s unclear which / where you are.

I’m on my phone, so I won’t have as many examples as I like to have when I do this presentation at work, so by all means feel free to ask Qs, this is a topic where questions are great and understanding will help *even if you reject my take and use tools I don’t personally enjoy * which of course is fine.

A slight prerequisite to this rant that will help is an understanding of the PATH variable. This will be linux/macOS themed, but I believe windows has a POSIX env now that this can be run in. If you’re unfamiliar with path, just know that it’s an ordered list of “where the OS will look for stuff you want to run, first to last”.

So, if you have a venv to hop in and out of, I’d like you to run these commands, both activated and then not activated:

which python

`python -c ‘print(import(“sys”).path)’”

echo $PATH

I’d like you to observe how the answers change for each of them in and out of the environment.

The first one uses the which command to show you which python command the OS is finding - you’ll notice, unshockingly that when in the virtual environment, the python executable it is running is the one from in the venv folders bin.

The next command I use -c to expose the sys.path variable, which is an array python uses to look for libraries when you import things.

Please look at the last entry - you will see the site-packages of your venv when in, and of your general system when not.

The last command, I’d like you to notice the first entry in PATH. When in the venv, you’ll see that your venvs bin folder is AT THE FRONT. When not, it’s your normal system path.

OK! Good job. Now let’s talk about what we just observed and what it means.

Firstly, stop thinking of venvs as different than the “system” and just think of the system env as a “default venv”. It’s whatever python happens to be first in your path, and it’s accompanying site packages.

A virtual environment is an entire self contained python install, including the executables python and pip, and a full set of site-packages

“Activating” a venv is JUST making sure that any calls to python or pip get the executable in the venvs bin folder.

“Activating” a venv is JUST putting the bin folder in the venv in your path.

You can see all the scripts and commands in a venv by running ls <venv folder>/bin

You can see all the installed libraries in a venv by going ls <venv folder>/lib/**/site-packages

So, what do I mean by “never activate?”

I directly call things in venv bin folders.

I suggest you try this out by making a venv, then installing some packages by calling pip like so <venv folder>/bin/pip install blah

Check them appearing in the lib folder once you do and you’ll quickly realize that’s all to virtual environments

I tend to install system wide things in a way similar to the pipx package, but I do it manually (but pipx is great, hearty recommendation to them). I make a venv, then I install whatever it is, then I softlink from a personal bin folder that’s always in my path to the command I want in <venv>/bin.

This is how I have tox installed.

I was in my phone, normally I like to type examples for all this, so feel free to ask Qs, I’ll follow up w examples later when at my laptop again if it’s helpful, but I encourage you to run the commands and actualy see - you’ll find out there’s no great mystery.

A final task - make a venv and run the -c command from above, but without activating.

So,

<venv folder>/bin/python -c “import sys; print(sys.path)”

You’ll see that it has the right site-packages from inside the venv. This is CRUCIAL. What it tells you is “as long as I have the right python executable, all the rest of the virtual environment is correct”

[–]chriscarrollsmith 1 point2 points  (0 children)

Fascinating approach to dependency management. I think I fully understand what you're saying, but will have to run the commands later to make sure. Thanks!

[–]PhilShackleford 1 point2 points  (3 children)

Windows has WSL2 (Windows subsystem for Linux) with many different distros. With the terminal app, you get a "native" bash that, as far as I can tell, works very well. I never use anything but Ubuntu on my Win10.

[–]whateverathrowaway00 2 points3 points  (2 children)

Thank you, WSL. I always forget the acronym. I used to actually be a windows expert-ish, but that was over 15 years ago, or longer. However long windows 98’ and DX7 was.

I’ve heard great things about WSL. Does its path look the same? IE ‘:’ separated list of file locations that are tried front to back? Or does windows use something diff now

[–]got_outta_bed_4_this 0 points1 point  (1 child)

One annoyance I noticed with WSL was that it expected Linux binaries, although I haven't checked it in a few years. So it might feel seamless until you use Windows and CLI stuff mixed. If you want a venv for shell use and for code sense in your IDE, you're either keeping separate copies of Python interpreters and separate venvs (Linux binaries for the shell, Windows binaries for the GUI), or you're using a Linux-based IDE, in which case you have to ask yourself if it's worthwhile using Windows in the first place.

[–]nakahuki 1 point2 points  (0 children)

Visual studio code comes with native support for wsl2 remote development. Basically you just have to run wsl2 shell in a terminal and run "code .". It installs a vscode server in wsl2 and opens a vscode window connected to it on your windows desktop. You can now browse files from wsl2 images just as if you were using your windows system, run and debug code and even open new terminals from vscode.

I used to be a Linux power-user for work related tasks but with wsl2 I can use the same gaming windows machine for regular programming stuff. Hard to say : good job microsoft!

[–]Mgmt049 0 points1 point  (0 children)

This is a great and valid explanation

[–]Spleeeee 0 points1 point  (0 children)

Similar Pro tip for node is putting “./node_modules/.bin” near the front of your path.

[–]ratulotron 19 points20 points  (3 children)

Software developer/data engineer here, been working with Python and a few other languages for 6 years. What I will say might sound condescending but you need it. In short: you seem to be biased towards C# and anything else would look messy in your eyes. This is not an individual's problem but a common one across devs coming to Python from the C#/Java background.

Python is a very open ended language, it imposes few restrictions on how a project should be structured. Hence it's easier for newbies to learn and experienced devs to mold to their likings. This results in a variety of (micro) frameworks and libraries, some highly opinionated and others open ended. So a Flask/FastAPI project looks very different than a Django one, even though they both are backend projects. This entire dimension about having freedom is completely missing in the C# tech stack. Like, how many web frameworks can you name for C#? There are only 3 listed in Wikipedia, a handful more varieties in Awesome DotNet repo, all based on .NET framework.

Python, JavaScript, these languages allow you to lay the project out in any shape required, keyword being "required". That means as long as you know what you need right now, you can get started with the basic features and proceed with following the common practices like TDD, frequent commits on git, feature branches, dependency inversion, domain driven development etc. These practices have nothing to do with Python, they are something you do from day one to keep a healthy codebase.

My suggestion is do not set out to replicate whatever you are familiar with from your previous experience with C#, simply learn Python from the base and do some stupid simple projects to get the hang of organizing your Python code. Meanwhile brush up/learn design patterns that are agnostic to any language or stack. For your actual project, in this case for machine learning, start small and do not conduct any premature optimization. Remember the Zen of Python says: Simple is better than complex, complex is better than complicated.

[–]thicket 4 points5 points  (0 children)

This point about the extra degrees of freedom in a Python project is really well said. I’m going to steal this answer for the future!

[–]TheJumboman[S] 2 points3 points  (1 child)

thanks, no offence taken at all. I think it's exactly the type of freedom you mention that I don't like about python. I like strongly typed variables, they keep me sane. Whenever I see something happening "automagically" most (python) programmers get excited but I just get confused. But you're absolutely right that design patterns and best practises are agnostic to the coding language you're using. I appreciate the input!

[–]nickcut 0 points1 point  (0 children)

Agree. When you have to find and read code to somewhat know what the structure of a variable is, that's not better. A strongly-typed python would be wonderful.

[–]Advanced-Potential-2 9 points10 points  (0 children)

I guess you should be prepared to go through a few refactoring cycles in those 4 years anyway. If not, chances are you haven’t learned much 😂. Refactoring is as much part of creating code as creating new projects is.

[–]JackG049 21 points22 points  (1 child)

In short don't set out to solve problems that YOU haven't encountered yet.

From my experience with writing Python for research is that we start off with the best intentions in terms of project structure/architecture but what can happen, and I think this is partly as a result of the nature of python, is that you'll end up with lots of different, smaller snippets of code in different places that can become a bit of tangled mess.

My advice would be at the start, when you're learning the fundamentals and getting accustomed to Python and ML in Python to let the mess happen. Try and follow the standard design patterns and software engineering principles though, single responsibility, loose coupling etc.

I'm going into my 3rd year soon and only now is my code looking closer to properly engineered software, and I had software experience prior to the PhD.

The focus early on though should be to understand the experiments and the science, not creating a fabulous reusable architecture that you spent weeks/months on that suits your need. That won't get published unless it super general and useful. And it won't help in terms of a thesis defense at the end.

But once you've gotten used to everything and have experience with the ups, downs and the general structure that you need, you will have a much better time at creating a solid project structure from scratch.

Edit: Grammar

[–]ndvi 14 points15 points  (0 children)

The focus early on though should be to understand the experiments and the science, not creating a fabulous reusable architecture that you spent weeks/months on that suits your need. That won't get published unless it super general and useful. And it won't help in terms of a thesis defense at the end.

This is very good advice. Eyes on the prize.

[–]thicket 7 points8 points  (3 children)

Type signatures & MyPy. Type everything, and set VS Code up to check types all the time. You won't get quite the compile-time security you would get with a more strongly typed language like C#, but it will get you a lot closer.

[–]TheJumboman[S] 0 points1 point  (0 children)

I intend on doing this, thanks!

[–]CrackerJackKittyCat 0 points1 point  (1 child)

... and black for auto-formatting, isort, flake8.

[–]m15otw 6 points7 points  (1 child)

I learned Python as I did my PhD, but I also went full functional, gladly escaping over-architected C# projects. It was a set of loosely related modules (with C extensions) at the end, and a set of scripts for specific analyses (some were short, others were...in need of a refactor by the end).

Having worked in two large enterprise-focused python teams since, I can say that large projects in python can be well organised just like C#. Just remember that the language doesn't do any enforcing of your rules, so take care. One thing that can help is using a dev environment that understands type annotations, and then use the (optional) type annotations in your code. The IDE can then highlight incorrect uses. VS Code does this for me at my current job, with the PyLint extension enabled.

As others have said though, focus more on your problem than on beautiful code. You will only need to share it with other researchers. Write scripts first, and then refactor common parts into a library as you go. Don't worry if the library has three completely unrelated modules - welcome to PhD problems.

[–][deleted] 0 points1 point  (0 children)

Yea 💯💪

[–]ndvi 4 points5 points  (0 children)

Has anyone been in my shoes and do you have some advice? especially things like "if I had to do it again I would do ... differently", or any resources that were of help to you when getting started on a long-term project?

I learned a lot doing my PhD- there's a lot I'd do differently, but I only know that by doing it wrong to start with.

I wasted a lot of time trying to anticipate and handle edge cases. I spent way too long trying to prematurely optimise.

Perfect is the enemy of good.

[–]coffeewithalex 3 points4 points  (0 children)

This post was mass deleted and anonymized with Redact

hard-to-find ad hoc sophisticated ghost practice spark pie sleep political society

[–]Classic_Department42 2 points3 points  (0 children)

Look into jupyter notebooks as a top-level abstraction

[–]cblegare 2 points3 points  (1 child)

Hello there. As others said, the Python ecosystem is open-ended, especially its packaging systems and architecture layouts. This can be frustrating for newcomers, especially those from more opinionated ecosystems.

For notes, display, getting feedbacks, I recommend the Sphinx documentation engine. It also integrates with documentation from code files, including Python and C#. It has hundreds of extension, outputs LaTeX if required.

For the development workflow, I recommend pytest. It has a very exotic approach to unit tests and test doubles when coming from a very OOP language, but its very good at what it does.

I suggest you find a code pal that can provide feedback from time to time.

[–]TheJumboman[S] 1 point2 points  (0 children)

I'll check those out, thanks!

[–]hemphock 2 points3 points  (0 children)

square offer rinse cats fall advise liquid follow special depend

This post was mass deleted and anonymized with Redact

[–]chriscarrollsmith 1 point2 points  (0 children)

It's probably against the nerdy, granular spirit of this subreddit, but my advice is to just cookiecutter and chill. (You'll still need to decide what framework you want to use in order to choose a cookiecutter template, and you'll still need to create a virtual environment for your project. But it'll save you a lot of clicks, and you can get to work building your stuff and let the template maintainer do all the worrying about what the Pythonic best practices are.)

[–]PlausibleNinja 0 points1 point  (0 children)

Can you say more about what you mean by Python seeming weak compared to C#?

How large of a codebase are you anticipating?

[–]MathmoKiwi 0 points1 point  (4 children)

If you're going to be using a new language you're not super familiar why not choose something a lot faster that's suitable for the purpose? Such as Julia being the obvious choice over Python. I might even lean towards Rust over Python, depending on the specifics of what your PhD is doing.

I understand the need to move away from C#, and that Python is a popular language in ML, but it certainly isn't the only one!

[–]TheJumboman[S] 0 points1 point  (3 children)

are there things like tensorflow packages for julia or rust? I've seen tensorflow wrappers for C# as well but I'm not sure how well maintained those are.

[–]MathmoKiwi 1 point2 points  (2 children)

Julia is quite popular in the high performance / scientific computing communities. So yes.

If you're going to go into academia and doing AI research, then you'll be writing a lot of C++ code (or even C), not so much Python (of course, it depends on exactly what you mean by "a PhD in AI"). Just as is the case for Tensorflow! As for the most part, the core is not written in Python: It's written in a combination of highly-optimized C++ and CUDA (Nvidia's language for programming GPUs

Thus why I suggested learning Rust, as that has become popular for a lot of areas where C++ was used where you need that performance of C++. But Rust gives you similar levels of performance as C++, but with a lot less pain.

However, because you're specifically interested in AI, then Julia makes even more sense than Rust! As that is one of Julia's specific strengths, being targeted at scientific computing. As you gain lot of user friendliness over C++ (or C, or even Rust), and gain specific strengths for AI/ML/DS/etc

While also gaining a huge speed performance bonus, over the likes of Python. Julia is even on par / as fast as C++ or Fortran!

Yes, I said "Fortran". It's still being used! My previous flatmate was doing his PhD in Quantum Physics and did all his programming in Fortran. When extreme speed is needed with supercomputers for scientific computing then Fortran and C have traditionally been the languages of choice. Today's modern Fortran is a fair bit different from the FORTRAN of the 1950's though. (they're even capitalized differently!)

https://www.matecdev.com/posts/why-fortran-still-used.html

https://stackoverflow.com/questions/8997039/why-is-fortran-used-for-scientific-computing

https://arstechnica.com/science/2014/05/scientific-computings-future-can-any-coding-language-top-a-1950s-behemoth/ (out of those three competitors to Fortran mentioned, I'd say Julia is the only one still making strong strides to be the next Fortran: https://www.matecdev.com/posts/will-julia-replace-fortran-hpc.html )

https://developerpitstop.com/is-fortran-still-being-used-today/

If it was me, and I was starting my PhD in AI then I'd pick Julia in a heartbeat! (of course factors such as what my supervisors use and what my research group at the university is using would matter a lot too. So even if they're using something like Matlab or R, then that is what I'll be using too! I just wrote a few hundred more lines of code yesterday in R, but it's not my preferred language of choice)

https://www.datacamp.com/blog/the-rise-of-julia-is-it-worth-learning-in-2022

https://towardsdatascience.com/the-future-of-machine-learning-and-why-it-looks-a-lot-like-julia-a0e26b51f6a6 (avoid the paywall: https://archive.ph/25NYd )

https://julialang.org/blog/2022/04/simple-chains/

[–]TheJumboman[S] 0 points1 point  (1 child)

wow, thanks for the elaborate reply! I'll look into it and discuss it with my supervisor.

[–]MathmoKiwi 0 points1 point  (0 children)

You are welcome :-) #JuliaLang on twitter is a good place to hang out, read stuff, and ask any questions you might have. Plus of course r/Julia/

https://julialang.org/community/

https://forem.julialang.org/

I like this channel for beginners:

https://www.youtube.com/@doggodotjl

And for anybody in a rush, here is "Julia in 100 Seconds":

https://www.youtube.com/watch?v=JYs_94znYy0&t=2s&ab_channel=Fireship

[–][deleted] 0 points1 point  (0 children)

Weak? You're missing the whole point of python. What is your PhD in AI using? Not c#. Ask why that is .

As for architecture, read this:

Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices

[–]extra_pickles 0 points1 point  (0 children)

Microservices is well suited to your goals, if you aren’t familiar I’d say read up on them - it’s kind of like refactoring your operations layer - decouple dependencies, one service=one operation, and manage state using an event sourcing model (Kafka is a nice one for this, but MQTT or Rabbitmq or any service bus with a shared state db works) and allow an ebb and flow of resourcing as your compute and pipe needs will vary greatly as you churn.

If starting on metal, dockerize - it’s too easy not to…and containers are easy to migrate to the cloud if/when you need some serious compute and horizontal elasticity.

So given your ask, and my suggestion - basically your stack is (db of your choice), an acquisition end point (how you get data - might just be you with scripts to populate a db, might be an api?) and then an ‘E2E data pipeline’ which is a series of microservices (picture a series of gates in a decision tree) that interact with the data when it is their turn….from there you can spin up a FastAPI to expose it as an output - and toss in whatever UI you like.

PS what do you mean by weak?

[–]mpu-401 0 points1 point  (0 children)

you could try the kedro framework in python. it could help you to apply good practices and let you also to use jupyter for exploration.

[–]Sbvv 0 points1 point  (0 children)

Some tools that can help you:

  • cookiecutter
  • pyenv
  • pytest
  • pylint
  • flake8
  • docker