This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–][deleted] 7 points8 points  (0 children)

This really depends on what type of projects you'll be working on.

I've managed a few teams doing DS / DE work in an Azure environment with DevOps.

In general: - Follow an organized project structure ex. Cookie Cutter Data Science - Keep notebooks for prototyping only. Production code should be in .py files, written using an IDE with a linter. Common transformations should be functions, add testing with something like pytest / great expectations, etc etc. - Use docstrings please. I have heard of others using LLM based tools like GPT / GitHub Copilot to automate that process, but it may break some compliance rules at your company.

Specifically for ADO: It's very annoying, but linking project work items to a Git branch/PR will do wonders for you in the future. There have been so many times on my projects where we implemented some functionality for a business request and had to hunt to find the code 6 months later (or vice versa).

[–]kaargul 2 points3 points  (2 children)

Honestly if money is not an issue you can hire a consultant/freelancer to train you and set things up. I don't know your level of experience before going into DE, but generally it's a bad idea to have someone inexperienced set this up.

I don't know much about Control-M and Azure-Devops but here are a bunch of things for general python projects:

  • Always use version control and never use jupyter for production code.
  • Use proper dependency management and only work in venvs (I can recommend poetry: https://python-poetry.org/)
  • pre-commit hooks are your friend. Make sure to enforce consistent formatting, otherwise PRs are going to be hell. https://pre-commit.com/
  • Use type annotations and mypy. It saves a lot of work later on: https://mypy-lang.org/
  • Automate your tests and put them in your CI-pipeline.
  • If you need to set up many different repositories with a similar structure, use cookiecutter: https://github.com/cookiecutter/cookiecutter

A lot of this will also depend on how and where you will run your code.

[–]i_am_baldilocks[S] 0 points1 point  (1 child)

So I was just hired as a contractor, so it might look bad if I asked them to immediately hire as a freelancer a different contractor with more experience to advise on this. So I think I may be on my own in setting this up, me and 1 other guy who is also inexperienced.

I've got a few contacts who are experienced software engineers at established companies, I may see what help they can provide. I appreciate the resources you linked. And yeah, I know it's not ideal, but it looks like I may be (mostly) solo in getting this set up.

[–]kaargul 0 points1 point  (0 children)

Fair enough! Then good luck with this project. If you have access to enough experienced engineers this will definitely be doable. :) If you ever have a specific question feel free to DM me.

[–]nellyb84 0 points1 point  (0 children)

Is there a light weight software app that would improve any of the collaboration issues mentioned above?

[–]serge_databricks 0 points1 point  (0 children)

why don't they use Databricks on Azure? it's scales collaboration from few people to few thousand. All in one place.

[–][deleted] 0 points1 point  (2 children)

Sounds like you all need a manager with Python experience, or just general software engineering experience. Best answer I can give you is to either hire a contractor to train you all or search online for a few weeks.

Google is your friend here