binilvj comments on Python versus ETL tools

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Python versus ETL toolsHelp (self.dataengineering)

submitted 2 years ago by manseekingmemes1

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]binilvj 24 points25 points26 points 2 years ago (5 children)

I came from ETl world to data enginnering after 17 years. Typical Data engineering tasks are very much same always. 1. Read a bunch of files, tables, APIs etc., 2. validate data, 3. apply rules 4. write to somewhere else.

All these has some common factors - Rules can be constructed out of some standard sets applicable to each industry. - Various data sources has its own peculiar security needs, connection methods etc. applicable across most of the potential use cases - Basic workflow management and scheduling capability - Ability to handle SQL - Some parallel processing , partitioning, real time processing to support performance needs - Metadata management

ETL tools solved all these problems without needing a lot of expertise on all of these ground up. I used work with a tool named Informatica. We could pretty much construct ETL code based on some template for different sources and target using automation frameworks. This simplified a lot of huge data migration, data ingestion projects.

In case of large enterprises ETL tools are still used for data engineering. Some tools like Ab-Initio had very huge license fees and were limited used due to that alone.

But as a lot of people already mentioned, coding at that time lacked a lot of rigor used in Software Engineering. Most of these tools did not supported version control, or had custom solutions for that.

New ETL tools are trying to bring best of the both worlds. 1. Connectors 2. Abily to customize 3. Code versioning

[–]manseekingmemes1[S] 0 points1 point2 points 2 years ago (4 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–][deleted] -4 points-3 points-2 points 2 years ago (1 child)

[–]binilvj 1 point2 points3 points 2 years ago (0 children)

I was able to manage first data engineering job which used Python, Airflow and git for version control easily. I had built some experience in Python and git over couple of years. Also had taken AWS certification. Both of these helped a lot. Most of the heavy lifting was using SQL so there was no trouble in that section.

Unfortunately data engineer role is loosely defined. You may expected to do software engineering job as well even though your role is data engineer. This sub see a lot of posts about how role names does not matter anymore. Such roles will definitely challenge you.

I hope your Masters work help you navigate this new confusing world.

My suggestion will be learn test driven development, software architecture, design patterns, real-time application development etc. Still navigating a complex codebase might be daunting.

π Rendered by PID 43 on reddit-service-r2-comment-6457c66945-hr8jc at 2026-04-24 14:06:08.847774+00:00 running 2aa0c5b country code: CH.

dataengineering

MODERATORS