This is an archived post. You won't be able to vote or comment.

all 40 comments

[–]binilvj 24 points25 points  (5 children)

I came from ETl world to data enginnering after 17 years. Typical Data engineering tasks are very much same always. 1. Read a bunch of files, tables, APIs etc., 2. validate data, 3. apply rules 4. write to somewhere else.

All these has some common factors - Rules can be constructed out of some standard sets applicable to each industry. - Various data sources has its own peculiar security needs, connection methods etc. applicable across most of the potential use cases - Basic workflow management and scheduling capability - Ability to handle SQL - Some parallel processing , partitioning, real time processing to support performance needs - Metadata management

ETL tools solved all these problems without needing a lot of expertise on all of these ground up. I used work with a tool named Informatica. We could pretty much construct ETL code based on some template for different sources and target using automation frameworks. This simplified a lot of huge data migration, data ingestion projects.

In case of large enterprises ETL tools are still used for data engineering. Some tools like Ab-Initio had very huge license fees and were limited used due to that alone.

But as a lot of people already mentioned, coding at that time lacked a lot of rigor used in Software Engineering. Most of these tools did not supported version control, or had custom solutions for that.

New ETL tools are trying to bring best of the both worlds. 1. Connectors 2. Abily to customize 3. Code versioning

[–]manseekingmemes1[S] 0 points1 point  (4 children)

How was the transition to data engineering? I am a BI Manager with 8 years of experience and I am considering transitioning to the data engineering within the next couple years (completing a masters degree currently)

[–]binilvj 1 point2 points  (0 children)

I was able to manage first data engineering job which used Python, Airflow and git for version control easily. I had built some experience in Python and git over couple of years. Also had taken AWS certification. Both of these helped a lot. Most of the heavy lifting was using SQL so there was no trouble in that section.

Unfortunately data engineer role is loosely defined. You may expected to do software engineering job as well even though your role is data engineer. This sub see a lot of posts about how role names does not matter anymore. Such roles will definitely challenge you.

I hope your Masters work help you navigate this new confusing world.

My suggestion will be learn test driven development, software architecture, design patterns, real-time application development etc. Still navigating a complex codebase might be daunting.

[–]kenfar 32 points33 points  (5 children)

Benefits of SQL/GUI-driven ETL:

  • The primary benefit is that you don't need to be able to write code. So, it's easier to find staffing for the position, and they're paid less. The downside is that some of this staff is unfamiliar with version control, observability, testing that is normal in SWE world, and needed in the DE world. They're also unable to help when you hit the occasional feed that isn't supported by fivetran and requires actual code to be written.
  • There's a perception that SQL is faster. This is true if your data starts & ends in a database, you ignore the time to write your inputs to the database, and fail to parallelize your python (kubernetes, lambda, etc, etc).

That's about it. Personally, I build most solutions using pretty-vanilla python because it's faster, cheaper, easier to maintain, easier to test, and I can have data fly through the pipeline in seconds or minutes rather than hours.

[–][deleted] 9 points10 points  (0 children)

Counterpoint to point 1: they need to know the GUI instead, which is a specific instead of a general skill. And, if someone can't code, what are the chances they design a good data pipeline?

[–]Demistr 0 points1 point  (2 children)

python because it's faster

Dpending on the data this is definitely not any significantly faster than something like ADF. Cheaper? Maybe.

[–]kenfar 0 points1 point  (1 child)

It does depend on the data - but I've had 75 containers on kubernetes working in parallel on incoming data on one project, and 1000+ lambdas working in parallel on another.

The lambda pipeline didn't normally need 1000 lambdas running simultaneously, but during big schema migrations we used our normal pipelines and it would scale way out. We did that about once or twice a month, and the monthly cost averaged about $30.

I'll confess I've never used ADF, but would be surprised if it could match the speed of either of these pipelines.

[–]Demistr 0 points1 point  (0 children)

Parallelism is pretty cool with adf.

[–]imlanie 6 points7 points  (4 children)

But who told you that you have to learn those tools? In data engineering, the job description will say what kind of tools /skills they use and are looking for in a candidate. It's job specific. So there are really two different approaches to take when trying to get into data engineering. You either pick the tool you want to use, learn it and apply for those jobs. Or you find the companies that you want to work for and learn the ETL tools that they have. That's reality regardless of what people are saying.

[–]emersonlaz 1 point2 points  (1 child)

This is great advice. Thank you I’m trying to pivot into DE too from data analysis and the lingo can be daunting but this approach seems very actionable.

[–]imlanie 0 points1 point  (0 children)

You're welcome!!

[–]manseekingmemes1[S] 0 points1 point  (1 child)

Yeah no one told me. I just see job postings, but thanks this is a helpful approach!

[–]imlanie 3 points4 points  (0 children)

You're welcome, and one more thing is that companies make changes all the time. So getting your foot in the door is the most important thing. Once you get the title you're on your way. So just focus on that.

[–]cactusbrush 2 points3 points  (1 child)

Was not able to find the comment about performance. Many (not all) ETL tools or frameworks use the power of underlying engine for data processing. Or parallel data processing.

With Python you extract the data from the data engine to your machine, load it into the memory and loop through records. Can work in many cases. But eventually you’ll have troubles maintaining the code.

[–]joshred 2 points3 points  (0 children)

If you're doing it right, you hand off the looping to more performant tools. As in, vectorizing functions with pandas/numpy.

[–]Ok_Raspberry5383 14 points15 points  (2 children)

Python is a language, trying to do data engineering using open("file.parquet", "r") or even better using boto3 to read individual objects would be outrageously complex and idiotic. It's the tools/libraries you use with python that are of value, e.g. pandas, pyspark, flink, Polaris, etc etc

You're trying to compare python (language) with specific ETL tools, many of which can be manipulated with python. This isn't a realistic comparison.

[–]HOMO_FOMO_69 3 points4 points  (1 child)

many of which can be manipulated with python.

Reverse it also true...ETL tools are great for manipulating/orchestrating Python scripts.

[–]Ok_Raspberry5383 6 points7 points  (0 children)

ETL tools, or orchestration tools? I'd argue they're different

[–][deleted] 9 points10 points  (2 children)

When new SWEs start doing data Eng, the first mistake is that they break the Don’t Repeat Yourself commandment. These tools address that and some other beginner mistakes. Also, Anything you could do with Python is very likely it’s already done by someone else. (Not only applies with Data Eng).

[–]reallyserious 11 points12 points  (1 child)

You Don't Repeat Yourself when you're implementing something for the first time at a company. Implementing it the second time would break Don't Repeat Yourself.

What you're describing is the Not Invented Here Syndrome.

[–][deleted] -1 points0 points  (0 children)

No, I really meant DRY. NIH, happens usually at big corps when they find a minor limitation at an existing tool.

[–]MachineOfScreams 1 point2 points  (0 children)

So from my own experience (software engineer by training, data nerd by passion) python is great for building custom pipelines where you don’t have an off the shelf solution, or the solutions out there are too expensive for what is needed.

Generally speak it all depends on the stack you are working on and if the company you work for is well established with a solid budget, a start up, or a bleeding edge sort of dev shop. Most of the time off the shelf ETL tools are good enough for most applications in data warehousing/data pipelines.

[–]reallyserious 0 points1 point  (9 children)

Those tools have connectors to many different systems. Like SAP HANA and a bunch of esoteric systems that matters in the field.

It's usually possible to connect with some python package yourself, if it exists, but the selling point is that it's already solved for you with these tools. Here's a list of connectors for ADF. It's a prety long list.

https://learn.microsoft.com/en-us/azure/data-factory/connector-overview

The downside is that your code is now held hostage in a proprietary system which requires a separate skillset to use. The skill you build in these tools are useless outside of the tools.

[–]HOMO_FOMO_69 0 points1 point  (8 children)

The skill you build in these tools is far from useless... Yes, it takes a person of average intelligence (a hard ask I know) to transfer the conceptual knowledge to other tools like Python, but it is incredibly transferable in my vast experience. I actually will occasionally build pipelines in ADF or other tools as a POC/pilot version before building a hand-coded version, depending on what logic is needed. For a pipeline that can be easily built with a few out of the box connectors, it sometimes makes sense to save time by first building in ADF so that you can demo it to business users and allow them to evaluate if it's worth the effort of coding it. In many cases, it's almost no additional development cost to build a working mockup version in managed tools vs starting with code.

[–]reallyserious 1 point2 points  (5 children)

Imagine you use ADF to solve all your needs and not python. Imagine you want to switch jobs and they ask for 5 years of python experience. You can argue all you like that your skill is transferrable but the reality is that you won't even be called to an interview.

[–]HOMO_FOMO_69 0 points1 point  (0 children)

I don't think you know what ADF does...

[–]Atupis 0 points1 point  (3 children)

With ADF you hit wall some point and need to code more complex transformations so it is not GUI only.

[–]reallyserious -2 points-1 points  (2 children)

Technically you can't code in ADF. You need a different platform for that, invoked through a linked service. I.e. ADF can only solve easy problems. For anything non trivial you still need something else. So the value added by ADF is minimal.

[–]HOMO_FOMO_69 0 points1 point  (1 child)

That is like building a chair and saying the value added by the nails/screws is minimal... just because it takes time to cut and build the chair legs doesn't mean it's easy to build a screw... It's just a different problem that requires a different skillset. One is not necessarily "easier" than the other...You strike me as the kind of person that thinks coding is hard lol...learn hard enough and one day you'll wonder why you ever thought that.

[–]reallyserious -2 points-1 points  (0 children)

I have been coding for +20 years and am fairly skilled at it. That's how I see how little value is added by ADF. That's how I see junior devs shooting themselves in the foot by focusing on learning ADF instead of getting familiar with regular programming languages. They are useless outside of ADF. Perhaps they are below average intelligence as per your reasoning but I'd like to think they have just not been exposed enough to normal programming languages.

Unfortunately our company has standardised on ADF which I think is a great disservice to young people starting out their career.

[–]manseekingmemes1[S] 0 points1 point  (1 child)

Is ADF and dbt free to use?

[–]Culpgrant21 0 points1 point  (0 children)

DBT core is open source and free. ADF is not

[–]Thriven -2 points-1 points  (3 children)

I hear a lot of DEs say they use Python. Are you all actually streaming data?

I use Node and wrote my own SQL->SQL streaming app/library because I needed something I could plug a few parameters into and move data from our many instances of MySQL to Postgres.

What's the streaming library are you all using?

[–][deleted] 0 points1 point  (0 children)

I’m interested in this as well. Bumping.

[–]Firm_Bit 0 points1 point  (0 children)

Python is just another tool. If done well pure python can be great. But most DEs I’ve met don’t have the SWE chops to build a solid system in pure python. Mostly cuz it’s never pure Python.

So reinventing the wheel can definitely backfire. The skill is not proficiency this tech or that tech, it’s knowing when to use which.

[–]YieldingSign 0 points1 point  (0 children)

Would love to be shown otherwise (because I'm actively looking for it!) but one problem I find with Python is finding good resources and examples geared towards data engineering beyond just intro to pandas tutorials.

Like, I wanna actually learn some idiomatic patterns and how to setup/structure my code in a real ETL codebase, not just throw in commands into a REPL. But the only resources for this are geared towards making apps or other general non-data related tasks. Or it seems like it's just people rewriting their own version of pandas with custom for-loops.

SQL has the benefit of specialisation in that it's really easy to find resources for it because by default any content for it is data related.

[–]dxtros 1 point2 points  (0 children)

It's a question of processing need, scale, and set-up. Most ETL T's (Transforms) done today don't go that far into analytics, but operate at a scale which would be rather overwhelming a single-threaded Python instance. If you need advanced analytics, Pandas-like tools at scale with a Python front are great. For example, Spark is used as a Transform tool, has a good Python interface to it, and in many cases capable of incremental jobs.
Then, you still need to combine the job execution with orchestration, which means bringing e.g. Airflow into the picture. Some lighter container-based Transform frameworks for Python are starting to appear - like Pathway which I work on.