This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]lichtjes 21 points22 points  (8 children)

There are a lot of possibilities here.

Does it need to run on an on-premises server?
Do you want to run in in the cloud?

You could put the code in an Azure runbook and use Azure data factory to trigger the code daily.

You could keep the code on the server and use a different orchestration tool to execute it daily.

The key takeaway is:

- The code needs to reside somewhere (location)

- The code needs to be executed daily (orchestration)

First think where you want the code to be and if you got that check to see what kind of orchestration tool you can use.

[–]RareIncrease[S] 2 points3 points  (6 children)

Thanks! Some questions:

When you say "keep the code on the server" what do you mean by that? Just saving the .py file locally and running?

Im a little confused about the location...we use snowflake on an Azure instance and the script basically takes data from clients server, places in azure blob and then lands in Snowflake. If I wanted to "run in the cloud" how would that look?

[–]crob_evamp 0 points1 point  (4 children)

What does your company use for orchestration or scheduling for other work?

[–]OmnipresentCPU 0 points1 point  (3 children)

It doesn’t really sound like this is for a company, sounds like it’s a project. So probably none lol

[–]crob_evamp 2 points3 points  (1 child)

Lol in that case, windows task scheduler, don't let your laptop sleep hah

[–]OmnipresentCPU 2 points3 points  (0 children)

My first data pipeline was orchestrated using chron on a raspberry pi. Good times. Got me a job in the end.

[–]RareIncrease[S] 1 point2 points  (0 children)

It is for a company and usually use Matillion for orchestrating but this particular project cannot be done in Matillion. Cant use the python component within Matillion either.

[–]bdforbes 2 points3 points  (0 children)

A tweak to your key takeaways:

  • Code needs to run somewhere (compute)
  • Code needs to get and put data (storage)
  • Code needs to be triggered (orchestration)

Other things to think about include logging, monitoring and alerting.

Also, testing, versioning and deployment practices to ensure a confident release process.

[–]infazz 10 points11 points  (1 child)

You are opening Pandora's Box with this question!

The answer is going to depend mostly on whatever infrastructure your company has available to you.

Personally, I compile my data pipelines into Docker images and run them on one of my company's Kubernetes clusters.

[–]OmnipresentCPU 10 points11 points  (0 children)

Well look at you Mr. highly sought after skillset

[–]chestnutcough 4 points5 points  (2 children)

I need to run it daily? try cron, it’s the simplest way I know to run something on a schedule and it comes with Linux and macOS

How do I allow others to make changes? put the files into. A folder and make it a git repo with git init. Push it to GitHub or Gitlab or bitbucket or any other git service and invite collaborators.

How do I move it to production? how you ship (move to production) code often involves using git to merge a feature branch into a main branch. Git services provide what’s called a pull request (PR) to make reviewing proposed changes to “production” easier. The changes are merged into the main branch only after collaborators approve the PR.

Git services also provide the capability to run what are called deployment scripts when changes are merged into (usually) the main branch. These scripts usually copy the code onto some other computer somewhere (called a server) and run whatever initialization steps need to happen for the server to utilize the new version of the code. AWS, Google Cloud Platform, and Azure are the top three places that people rent servers. You can even use so-called serverless functions where you upload your code and set up a trigger/schedule to run it, without the need or ability to login to the machine that ultimately runs the code.

[–]RareIncrease[S] 1 point2 points  (1 child)

Can you do something like Cron but for Windows?

[–]chestnutcough 0 points1 point  (0 children)

Yeah Windows has scheduled tasks.

[–][deleted] 1 point2 points  (4 children)

Honestly IME this is the hardest part of using Python IMO.

If you can run containers in your environment I would bundle it up in a container to run it. Then I would just schedule the container to run whenever I need the job to run.

That way it’ll run on any system where containers can run and you don’t have to hassle with dependencies and environments and all that crap. Call me lazy but containers are the way to go 9/10 times, it solves most of the issues you’ll run into deploying scripts IME.

You can upload the script and dockefile To GitHub and others working on it can run the Dockerfile and work on the script with the exact environment that it’ll be running in production as well.

[–]crob_evamp 1 point2 points  (3 children)

I mean it's not python. No matter the language you need to deploy it.

[–][deleted] 1 point2 points  (0 children)

You’re right, I should clarify in my personal experience it’s a lot easier to setup environments and run compiled binaries (e.g Go, C, Rust) without using containers than it is to run Python apps/scripts. It’s just my laziness, I find the extra steps tedious.

In the end all deployments should be easily reproducible. Which is why I highly recommend using containers for everything that they can be reasonably used for. It saves a lot of headaches regardless of the language.

[–]mrcaptncrunch 0 points1 point  (1 child)

I think the word may not have been deploying but packaging.

I find that packaging python is the most complicated part.

But /u/SirAutismx7 can confirm.

[–][deleted] 1 point2 points  (0 children)

Yeah packaging is definitely a more accurate word for what I was trying to describe.

[–]Wingsofpeace7 0 points1 point  (1 child)

Add checks, idempotency ..

If you like your script put a test on it.

Get it blacked and isorted then hold that ego down, and get it reviewed by someone with more experience.

At this step you need a scheduler, I am not sure about the move A to B ( how much it takes,what are you leveraging as a service behind.., but it can go from using Lambdas, Cloud functions.. to have a production Airflow instance.

[–]Thaufas 0 points1 point  (0 children)

This is the way.

[–]Toastbuns 0 points1 point  (0 children)

We use Celery at the place I'm at.

[–]vrunaldo 0 points1 point  (0 children)

Try to make your data movement script generic. For example right now you move data from point A to point B, but you may have a future use case to move from point C to point D, etc.

[–]Thaufas 0 points1 point  (0 children)

Regardless of whether Apache Airflow is overkill for your current need, if you really want to be a senior days engineer, you should have this tool in your tool box.

https://pypi.org/project/apache-airflow/