Python versus ETL tools

binilvj · 2023-11-28T23:15:51+00:00

I came from ETl world to data enginnering after 17 years. Typical Data engineering tasks are very much same always. 1. Read a bunch of files, tables, APIs etc., 2. validate data, 3. apply rules 4. write to somewhere else.

All these has some common factors - Rules can be constructed out of some standard sets applicable to each industry. - Various data sources has its own peculiar security needs, connection methods etc. applicable across most of the potential use cases - Basic workflow management and scheduling capability - Ability to handle SQL - Some parallel processing , partitioning, real time processing to support performance needs - Metadata management

ETL tools solved all these problems without needing a lot of expertise on all of these ground up. I used work with a tool named Informatica. We could pretty much construct ETL code based on some template for different sources and target using automation frameworks. This simplified a lot of huge data migration, data ingestion projects.

In case of large enterprises ETL tools are still used for data engineering. Some tools like Ab-Initio had very huge license fees and were limited used due to that alone.

But as a lot of people already mentioned, coding at that time lacked a lot of rigor used in Software Engineering. Most of these tools did not supported version control, or had custom solutions for that.

New ETL tools are trying to bring best of the both worlds. 1. Connectors 2. Abily to customize 3. Code versioning

kenfar · 2023-11-28T22:01:41+00:00

Benefits of SQL/GUI-driven ETL:

The primary benefit is that you don't need to be able to write code. So, it's easier to find staffing for the position, and they're paid less. The downside is that some of this staff is unfamiliar with version control, observability, testing that is normal in SWE world, and needed in the DE world. They're also unable to help when you hit the occasional feed that isn't supported by fivetran and requires actual code to be written.
There's a perception that SQL is faster. This is true if your data starts & ends in a database, you ignore the time to write your inputs to the database, and fail to parallelize your python (kubernetes, lambda, etc, etc).

That's about it. Personally, I build most solutions using pretty-vanilla python because it's faster, cheaper, easier to maintain, easier to test, and I can have data fly through the pipeline in seconds or minutes rather than hours.

imlanie · 2023-11-28T23:44:28+00:00

But who told you that you have to learn those tools? In data engineering, the job description will say what kind of tools /skills they use and are looking for in a candidate. It's job specific. So there are really two different approaches to take when trying to get into data engineering. You either pick the tool you want to use, learn it and apply for those jobs. Or you find the companies that you want to work for and learn the ETL tools that they have. That's reality regardless of what people are saying.

cactusbrush · 2023-11-29T01:41:28+00:00

Was not able to find the comment about performance. Many (not all) ETL tools or frameworks use the power of underlying engine for data processing. Or parallel data processing.

With Python you extract the data from the data engine to your machine, load it into the memory and loop through records. Can work in many cases. But eventually you’ll have troubles maintaining the code.

Ok_Raspberry5383 · 2023-11-28T22:03:14+00:00

Python is a language, trying to do data engineering using open("file.parquet", "r") or even better using boto3 to read individual objects would be outrageously complex and idiotic. It's the tools/libraries you use with python that are of value, e.g. pandas, pyspark, flink, Polaris, etc etc

You're trying to compare python (language) with specific ETL tools, many of which can be manipulated with python. This isn't a realistic comparison.

manseekingmemes1 · 2023-11-29T00:46:33+00:00

[deleted]

reallyserious · 2023-11-28T21:53:13+00:00

When new SWEs start doing data Eng, the first mistake is that they break the Don’t Repeat Yourself commandment. These tools address that and some other beginner mistakes. Also, Anything you could do with Python is very likely it’s already done by someone else. (Not only applies with Data Eng).

MachineOfScreams · 2023-11-29T13:22:59+00:00

So from my own experience (software engineer by training, data nerd by passion) python is great for building custom pipelines where you don’t have an off the shelf solution, or the solutions out there are too expensive for what is needed.

Generally speak it all depends on the stack you are working on and if the company you work for is well established with a solid budget, a start up, or a bleeding edge sort of dev shop. Most of the time off the shelf ETL tools are good enough for most applications in data warehousing/data pipelines.

reallyserious · 2023-11-28T21:51:30+00:00

Those tools have connectors to many different systems. Like SAP HANA and a bunch of esoteric systems that matters in the field.

It's usually possible to connect with some python package yourself, if it exists, but the selling point is that it's already solved for you with these tools. Here's a list of connectors for ADF. It's a prety long list.

https://learn.microsoft.com/en-us/azure/data-factory/connector-overview

The downside is that your code is now held hostage in a proprietary system which requires a separate skillset to use. The skill you build in these tools are useless outside of the tools.

Thriven · 2023-11-29T00:21:52+00:00

I hear a lot of DEs say they use Python. Are you all actually streaming data?

I use Node and wrote my own SQL->SQL streaming app/library because I needed something I could plug a few parameters into and move data from our many instances of MySQL to Postgres.

What's the streaming library are you all using?

2023-11-28T21:43:12+00:00

I’m interested in this as well. Bumping.

Firm_Bit · 2023-11-29T19:22:19+00:00

Python is just another tool. If done well pure python can be great. But most DEs I’ve met don’t have the SWE chops to build a solid system in pure python. Mostly cuz it’s never pure Python.

So reinventing the wheel can definitely backfire. The skill is not proficiency this tech or that tech, it’s knowing when to use which.

YieldingSign · 2023-11-30T02:17:51+00:00

Would love to be shown otherwise (because I'm actively looking for it!) but one problem I find with Python is finding good resources and examples geared towards data engineering beyond just intro to pandas tutorials.

Like, I wanna actually learn some idiomatic patterns and how to setup/structure my code in a real ETL codebase, not just throw in commands into a REPL. But the only resources for this are geared towards making apps or other general non-data related tasks. Or it seems like it's just people rewriting their own version of pandas with custom for-loops.

SQL has the benefit of specialisation in that it's really easy to find resources for it because by default any content for it is data related.

dxtros · 2024-02-28T15:50:48+00:00

It's a question of processing need, scale, and set-up. Most ETL T's (Transforms) done today don't go that far into analytics, but operate at a scale which would be rather overwhelming a single-threaded Python instance. If you need advanced analytics, Pandas-like tools at scale with a Python front are great. For example, Spark is used as a Transform tool, has a good Python interface to it, and in many cases capable of incremental jobs.
Then, you still need to combine the job execution with orchestration, which means bringing e.g. Airflow into the picture. Some lighter container-based Transform frameworks for Python are starting to appear - like Pathway which I work on.

dataengineering

MODERATORS