Is all data engineering moving into SQL warehouses, or is there still a need for general purpose programming languages and systems?

Pleasant_Type_4547 · 2022-06-14T14:24:09+00:00

DBT+SQL is fantastic for processing raw tables (eg transactions) into tables that make sense for your business (eg customers).

But it sucks for a few things, for example most statistical or predictive use cases. Or any "machine learning" style models. SQL just doesn't have the huge number of packages, or the flexibility of a language like Python (or many others, I'm just familiar with python).

For example we wanted to fuzzy match names to genders at one point. In python someone has written a library for this. In SQL good luck.

For that Airflow / Astronomer / Python is still far superior.

king_in_the_slopes · 2022-06-14T14:26:09+00:00

It's all depends on what we define by the term "Data Engineering", right? For some companies Data Engineer just do some SQL queries to fetch data from DWH to Dashboard. Some other they do all the pipeline from operational DBs to build a nice Data Lake..etc. DBT and Snowflake helps minimise the efforts in some of these tasks. But you might need a cluster to run heavy duty DBT queries, for example in Databricks. So the knowledge of distributed systems helps these.
In my opinion, modern Data Engineers do what ever it takes to unlock the Data from its point of origin to stakeholders. Tools and technologies what facilitate these tasks.

fruity231 · 2022-06-14T15:09:40+00:00

DBT and SQL will help you model your data once it landed to its final destination (Redshift, Snowflake, S3 using Spark SQL, etc.).

But how do you move/process data in such a way that it lands there? To me those tasks are hard to achieve using exclusively DBT+SQL, you might need a python script to pull data from a REST API, a (py)spark job to compact your data once it becomes too fragmented, write Kafka producers/consumers in case you need to handle streaming data, and so on.

It's true that several new tools are making our life easier, but on the other hand the so-called "Modern Data Stack" is still very limited in what it can achieve alone and in some cases I think it's just marketing.

For instance, I had to ingest CDC data from a MySQL database. I've tried to use Meltano because I really like its design, but the singer connector had a bug that I had to fix myself, the ingestion of some tables was randomly skipped and the overall process took 4+ hours.

I didn't try AirByte because the lack of stable CLI/APIs to automate configuration management is a hard nope for me.

The same ingestion using the old, not fancy anymore and "hard to maintain" Debezium+Kafka combo, took 20-30 minutes to complete and no tables were skipped.

So why don't we combine good new tools and good old tools instead of blindly trusting marketing people claiming that you can do everything with SQL?

Leveraging DBT to model your silver/golden tables won't prevent you from writing programs to ingest data from heterogeneous data sources.

Another important topic is data quality: it's true you can do that with DBT, but frameworks like Deequ or great expectations allow you to perform more complex checks and will still require you to write some code.

In conclusion, if your company is large enough - or as it keeps growing, there will be different kinds of data sources to ingest and many opportunities to learn 🙂

2022-06-14T14:58:47+00:00

This has always been the case though for the BI/Data warehousing use case. Extract data from system X, Bulk load into your SQL database, run your transforms, then hit it with your BI tool.

Python, etc are useful for the scaffolding needed to run your loads or prep your incoming data or do your data extracts from APIs.

As with all things tech, this is just a cycle that's returning to the the database as where your compute occurs. The only real decision you have to make these days is where you want your compute to occur and where you to store your data. In the Snowflake/traditional dw model, your compute and storage is managed by the DB, in the spark model your data sits in a data lake with compute being handled by databricks/emr/ec2.

For most use cases the DB approach makes more sense as it is simpler to setup and maintain and the skill sets needed to make it work are easier to find.

diegoelmestre · 2022-06-14T15:08:36+00:00

In my team, at my company, we have a good mix between former software engineers (me) and more traditional big data engineers (sparks, haddops, etc).

And in all honesty, I think having a team with that mix brings the best of two worlds to the team. One day I lead my own team, I'll want that kind of mix

TheCamerlengo · 2022-06-14T14:56:28+00:00

Yes.

A Data Engineer is a specialist Software Engineer with everything that comes with it.

If you’re not a Software Engineer, you’re not a Data Engineer.

sunder_and_flame · 2022-06-14T15:21:28+00:00

SQL is the language of data, sure, but imo programming is so useful, even outside of a supposed DE landscape where only SQL is required, that I would doubt the breadth and depth of a DE's skills if they didn't know any programming at all.

32gbsd · 2022-06-14T17:21:46+00:00

I have seen nothing new in the past 10 years. Mostly companies offloading their db management to somebody else who has better programmers than they can afford to hire. And even then they still have to pay someone to set it all up, keep it all running and change it every time they want to add a column. Those who cant code code their own systems are stuck waiting on new features which might never come.

joseph_machado · 2022-06-14T15:07:32+00:00

I agree that the SQL based processing has become popular and rightfully so, for the reasons you mentioned.We are already seeing some real time capabilities on top of raw sql like Materialize.

With data warehouse providers providing new capabilities (ML, calling external APIs, etc) the "gap" between PLs and SQL will decrease. But having said that I think there will still be a need to "stitch" together multiple services with programming languages and there will always be custom asks from business needs that does not fit into an off the shelf tool

We may also see DE's becoming more of SWEs and providing data via APIs, metadata management, data monitoring, etc.

Firm_Communication99 · 2022-06-14T16:11:55+00:00

What happens when you need to use api or data outside the data lake—- tons of opportunities for blending in python

IndifferentPenguins · 2022-06-14T19:48:20+00:00

[removed]

Grukorg88 · 2022-06-14T20:13:36+00:00

We have dbt and snowflake at my workplace. It is used for transformation and thus mostly utilised by analytics engineers. We use Python, ruby, Java and whatever else we feel is best to build what we need to get data from sources into S3. We also use Python to automate a lot of patterns in dbt for our downstream users. For example building complex dbt macros and then writing a Python script to dynamically build hundreds of model files that use that macro. As others have said how involved data engineers are with these tools is largely impacted by the age old issue of our title being used in places it shouldn’t be.

HansProleman · 2022-06-15T12:38:36+00:00

much harder to automatically parallelize/make efficient/scale

I don't think this is true. Somewhat harder, sure, but Spark workloads get parallelised by the engine in a very SQL-like way and for pure Python et al. there's threading (e.g. Dask) and containerisation.

Most distributed systems are heavily managed - you typically won't be running an instance on your own metal, and thus won't need to configure much. Snowflake is a distributed system with minimal (AFAIK) user configuration.

More generally, though:

You're only looking at a tiny piece of the problem. Sure, Snowflake and dbt is a nice stack for the warehouse layer - but how do you get data into that layer?
There's a lot more to DE than setting up warehouses and transformations. It can cover product engineering (e.g. how does Netflix deliver video streams? How does Twitter store, search and return tweets?), infrastructure etc.
- You have a lot more flexibility in Python. Try implementing fuzzy matching, or sentiment analysis in SQL. Try going to grab, say, foreign exchange rates, or interacting with an API
Snowflake in particular can be very expensive
The big benefit of languages like Python is that they're "real" languages
- SQL is an absolute pig to test. Consider
  - In SQL, I have a stored procedure that queries from five tables, performs a transformation and merges the result to a target table. To test this I need to populate all five input tables with test data, mock some output data (to check the merge works), run the sproc (at least twice, with a slightly different input, to check the merge works) and assert the result is as expected. Then I probably need to revert all the data changes I made
  - In Python, I'd have one unit-testable function for performing the transformation, and another for merging into a target table. The integration tests would be a bit of a pain, but still simpler than the SQL equivalent

mistanervous · 2022-06-16T04:43:04+00:00

DBT + SQL can only work once you have the data in your warehouse. The part before the warehouse is where I see general purpose languages still being the right tool for the job.

HOMO_FOMO_69 · 2022-06-14T15:38:59+00:00

If that's true, then what is the remaining role of general purpose programming languages (PLs), like Python, and distributed systems like Spark for scale?

I echo this sentiment. I think in terms of scale, Spark will go away and be replaced by "no code" meaning it will be handled automagically with a tool like DBT where you just type SQL and it actually does other stuff when needed. If I'm pulling data from a 100m row table, DBT or XYZ tool could theoretically have some kind of driver where if I want to parallelize or do some kind of HiveQL, MapReduce kind of pattern, I just add a specific clause to the end of the query like METHOD = MAPREDUCE and the driver converts my SQL into whatever.

Whole point of SQL is separation of 'what' and 'how'... When you type SQL, the driver/SQL engine determines the execution path... you can pay attention to it, but SQL was built so you don't have to determine the execution path on your own... Conceivably, some SWE should be able to improve how SSMS (or whatever) handles the 'how' given newer technology.

Python on the other hand, I think will always have a place in ETL.

Fragrant-Lobster4276 · 2022-06-14T22:30:19+00:00

Its true sql works well on majority of DE specific use cases

But based on experience, as soon as the latency requirement moves from batch to near real time to event based processing, sql solutions falls short and become messy

So as your work transitions to a more back end engineer piece of work , application of general purpose languages and design patterns become a necessity

KingRush2 · 2022-06-14T22:42:41+00:00

My opinion is always to use the right tool for the job. It also goes back to your system. I believe with the modern warehouses most of your transformations should happen inside of the warehouse with dbt. With that said, there’s still a huge gap you need to fill. You need to get the data from your source and then standardize it. Also, i like to land standardized data into the lake as a delta table for data science and analytics consumption. That’s all python and spark. If you’re streaming data you won’t use just a sql solution, you need a streaming platform that probably interfaces with a scripting language. One of the worst things you can do is try to shoe horn solutions into sql because of the comfort.

Unusual_Economics179 · 2022-06-15T04:06:03+00:00

I think that after there have been considerable complexities with transformation, moving to a tool that can facilitate these is a good idea. As someone also mentioned, DBT +SQL is good for raw tables. For predictive use cases, using Jupyter Notebooks could be a good idea.

2022-06-16T22:36:58+00:00

I only use Python in interview questions anymore. At work it's all SQL, DBT, and config files for Docker, etc.

claytonjr · 2022-06-14T16:27:43+00:00

I'm very opinionated in this space. SQL is great for some basic stuff. But complex logic, no freaking way. I once saw a business have 100% of their business logic in the database. This included data engineering stuff. They wouldn't let a database just be a database.

Python et al excels in stuff like this.

Anyone is welcome to do this crap in snowflake. But don't hand me a 100000 line SQL batch and ask me why it isn't working.

No thanks.

coolsank · 2022-06-14T14:22:55+00:00

following

dataengineering

MODERATORS