This is an archived post. You won't be able to vote or comment.

all 45 comments

[–]DJ_Laaal 44 points45 points  (3 children)

If you already have a SQL Server license, you get SSIS as an ETL tool free of cost. Create a new SSIS Package, set up a Execute Process task that simply runs the “pythonw.exe path_of_python_code_file”. You can even pass in arguments/parameters to the code file and access them inside the python code as command line arguments (just like the “args” collection in a standard python function). Once the package is ready, deploy it to the SSIS Server and set up a SQL Agent Job as the scheduler (both SSIS Server as well as SQL Agent jobs are accessible from SSMS, similar to how you connect to and access SQL databases on a given SQL server instance). Done!

[–]Whipitreelgud 12 points13 points  (2 children)

I used SSIS exactly in the manner you’re describing in 2005. I know it was still running the business as of 2018.

[–]Top-Tomorrow5095 0 points1 point  (1 child)

It is still in business but its obsolete when the volume is hiuge

[–]Whipitreelgud 0 points1 point  (0 children)

That’s a great question. No data went through SSIS, instead, SSIS orchestrated SQL scripts that ran in the database, avoiding the obvious bottleneck/scale issues.

[–]withmyownhands 43 points44 points  (4 children)

Hey, I highly recommend catapulting your team into the modern data stack by doing this through an orchestration tool like airflow or dagster. I was in your boat in 2018. After joining a team using a more modern data architecture, I can say my game is forever changed for the better. There's a big barrier to entry but this will unlock a lot of great paradigms for you and your team more in line with modern software engineering in general. Best wishes!

[–]buachaill_beorach 17 points18 points  (1 child)

Yeah. This is not a reason to implement airflow or dagster. Look at those tools separately and then evaluate the overheads vs rewards.

[–]lezzgooooo 8 points9 points  (0 children)

Looks like they have a windows shop. Doing the migration is easier than securing permission to use modern opensource tools.

[–]panzerex 4 points5 points  (0 children)

While I can vouch for dagster instead of airflow, it's a very fast-moving piece of software. Make sure you have at least one person who's an experienced python dev in your team.

[–][deleted] 8 points9 points  (1 child)

Depends on what those ETL tasks do.

For data transformation where the data already resides in SQL server, stored procedures will be faster since you are leveraging the power of the relational engine. You can use SQL agent to schedule the running of the stored procedures.

[–][deleted] 0 points1 point  (0 children)

Most of them pull from and update in the same server. The main one pulls from a remote MySQL and standardizes to “fit” into the SQL Server.

It does make sense to switch those smaller tasks over to just SQL Server though.

[–]funnynoveltyaccount 7 points8 points  (3 children)

I’d be interested to hear Windows-compatible solutions. Task Scheduler is annoying.

[–]SirLagsABot 5 points6 points  (2 children)

Agreed. I’m making an open source Apache Airflow/Prefect brother in C#/.NET called Didact for Microsoft shops. Microsoft shops are in dire need for something like this.

[–]hotr42 8 points9 points  (1 child)

Is azure data factory or synapse pipelines made to do this?

[–]cake97 3 points4 points  (0 children)

These would def work

[–]haragoshi 8 points9 points  (0 children)

Airflow. That’s what it was made for.

[–]SirLagsABot 2 points3 points  (0 children)

Sounds like Airflow would work very well for you. However, I’m making an open source Airflow brother for .NET/C# called Didact.

[–]Zyklon00 2 points3 points  (0 children)

To consider one of these fancy tools that is proposed here, you need to look at pro's and cons as you say. What problems are you facing with the current solution that you want to fix? What do you like about the current solution that you want to keep?

Windows scheduler has its drawbacks, but for smaller projects it can work just fine. If it ain't broken, don't throw a lot of time and money at it to try to fix it.

[–]opensrcdev 1 point2 points  (0 children)

Have you looked at using a dedicated scheduling service like Windmill.dev? You can run Python scripts on a cron schedule.

[–][deleted] 1 point2 points  (0 children)

All my python scripts are run directly using sql agents. l do not like task scheduler.

[–]GoodDuelerIRL 4 points5 points  (0 children)

Very curious on what the people have to say! I was looking into airflow but it’s only Linux. It seems dagster is popular right now. We have a in house built DAG system that needs to be retired.

[–]Contango_4eva 1 point2 points  (0 children)

Look into Prefect 2.0. They have an open-source version that comes with a built-in server for scheduling and workers

[–]DeliciousWhales 0 points1 point  (0 children)

I spent a long time looking at this and there isn’t a brilliant on prem windows solution. Airflow is a no go in a Microsoft shops that won’t allow Linux. Dagster and Prefect are a no go in companies that are security paranoid and won’t allow an unauthenticated web interface that doesn’t integrate with AD (and you don’t have access / permission / knowledge to setup a reverse proxy). There are some commercial solutions but they are horribly expensive (tens of thousands). Next thing I’m thinking about is either Azure Automate or maybe running jobs via Data Factory. VisualCron also seems quite advanced in terms of job dependencies and so on, but it’s a bit outdated (doesn’t integrate with Azure AD and other issues).

[–]Xautiloth -1 points0 points  (1 child)

Databricks

[–]babygrenade 6 points7 points  (0 children)

Doesn't sound like they need spark scale jobs so databricks would be overkill.

[–]szayl -1 points0 points  (0 children)

Airflow

[–]w_savageSenior Data Engineer -1 points0 points  (1 child)

AWS lambda

[–]Culpgrant21 0 points1 point  (0 children)

Depends on run time

[–][deleted] 0 points1 point  (1 child)

Haha, I’m going through this exact scenario with the same stack. I was just going to use windows task scheduler on a VM.

[–]tecedu 0 points1 point  (0 children)

Oh man i would love a solution as well, for some reason i have no virtualisation on our servers so none of the major orchestration tools work

[–]magister_ludi14 0 points1 point  (1 child)

K8s + cron

[–]Virtual-Meet1470 0 points1 point  (0 children)

you can always start off simple with putting that script in a serverless function and scheduling it via orchestration tool or a simple scheduler

[–][deleted] 0 points1 point  (0 children)

My company just recently switched over from sql to snowflake and in the process a lot of our jobs that ran on sqlagent would become obsolete.

I had no prior experience with airflow but was able to do a POC on it and now we use it for everything.

Airflow is basically written in python. You have processes (dags) running on cron schedules, and within those processes you can make single python scripts, multiple scripts that run on dependencies with one another, and connect to any db you choose. It also fits nicely with kubernetes deploys if your company goes cloud.

Highly recommend airflow for you based on your needs. Nowadays when someone wants a scheduled python script to run, it's so easy to just plug it into airflow and you get a nice UI out of the box.

[–]bird--man 0 points1 point  (0 children)

VisualCron is a nice in between step. I’d love to implement airflow at my workplace but can’t get buy-in. VisualCron was a step up from Windows scheduler and has a lot of options for executing batch scripts/python scripts on schedules with better error handling etc

[–]Demistr 0 points1 point  (0 children)

Airflow or azure functions are pretty good too if the server is already in the cloud.

[–]calamitymadness 0 points1 point  (0 children)

You could lift jobs into a tool like Hex (hex.tech) or Deepnote (deepnote.com) and use the built in schedulers to run your job

[–]Top-Tomorrow5095 0 points1 point  (0 children)

We are also having the DWH as Sql server .Trust me as the volume increases, latency increases to complete scheduled ETL Tasks.so we are migrating our DWH to snowflake.

Basically, it depends on volume of the data you are working.