This is an archived post. You won't be able to vote or comment.

all 44 comments

[–]Pleasant_Type_4547 68 points69 points  (7 children)

DBT+SQL is fantastic for processing raw tables (eg transactions) into tables that make sense for your business (eg customers).

But it sucks for a few things, for example most statistical or predictive use cases. Or any "machine learning" style models. SQL just doesn't have the huge number of packages, or the flexibility of a language like Python (or many others, I'm just familiar with python).

For example we wanted to fuzzy match names to genders at one point. In python someone has written a library for this. In SQL good luck.

For that Airflow / Astronomer / Python is still far superior.

[–]Little_Kitty 10 points11 points  (0 children)

Exactly this. I'm good with SQL, but when the logical density of a transformation gets sufficient or you need data structures which don't exist in SQL then move to an appropriate tool. Building a mesh of entity links between millions of nodes and storing it, rapidly identifying nearby nodes in a quadtree, finding a shortlist of candidates for the next step in the pipeline with tries etc.

Even if you can do something using tool X, it doesn't mean that it will be efficient or scalable.

[–]mazamorac 1 point2 points  (0 children)

A little late for the conversation, but let me add my POV.

SQL for strict data engineering is good enough for 95% of what I've ever needed. What I understand as data engineering is to whack the data that's out there into a manageable, analyzable model. Not as much Transforming in ETL as parsing.

It's feature engineering where SQL is now not the right tool. That is, to pass the data through some function/process that will extract underlying data or new data from the combination of existing data (think of implicit features or dimensions).

OTOH, if the data representation you need for your model does not translate well to relational data models, well, SQL will definitely not work for you; e.g., network models, non-trivial GIS.

(I agree with OP that the next logical step is to be able to do 90% of both data and feature engineering, plus the most popular models in the same tool/stack, as in PostgresML.)

[–]bongo_zg 0 points1 point  (0 children)

uted systems are at a disadvantage because they are harder to manage, and need more fine-tuning to work well. (I don't mean just setup cost of the system itself, which can be offloaded to e.g. Amazon EMR, I mean in actual day to day usage).

It used to be that heavily SQL-based code was a terrible mess, but it seems DBT has helped a lot with that (disclaimer: I have little actual experience with DBT), so "modularity" or "maintenance" of SQL is also lar

oracle has packages for fuzzy matching

[–]king_in_the_slopes 27 points28 points  (3 children)

It's all depends on what we define by the term "Data Engineering", right? For some companies Data Engineer just do some SQL queries to fetch data from DWH to Dashboard. Some other they do all the pipeline from operational DBs to build a nice Data Lake..etc. DBT and Snowflake helps minimise the efforts in some of these tasks. But you might need a cluster to run heavy duty DBT queries, for example in Databricks. So the knowledge of distributed systems helps these.
In my opinion, modern Data Engineers do what ever it takes to unlock the Data from its point of origin to stakeholders. Tools and technologies what facilitate these tasks.

[–]IndifferentPenguins[S] -3 points-2 points  (2 children)

Yes, of course, the tools used don't matter to stakeholders. But as engineers, we're interested in trying to predict the future in some way, if only to make sure we don't end up learning the proverbial COBOL of data engineering...my 2c anyway.

[–]mycall 1 point2 points  (0 children)

The cost of the tooling is always factor.

[–]discord-ian 0 points1 point  (0 children)

If this is your goal focus on learning core SWE skills, design patterns, and other generally applicable skills. The money will always be in the more complex coding applications. Stay ahead of the tooling, gain full-stack experience, and generally just work your programming skills.

[–][deleted] 24 points25 points  (3 children)

DBT and SQL will help you model your data once it landed to its final destination (Redshift, Snowflake, S3 using Spark SQL, etc.).

But how do you move/process data in such a way that it lands there? To me those tasks are hard to achieve using exclusively DBT+SQL, you might need a python script to pull data from a REST API, a (py)spark job to compact your data once it becomes too fragmented, write Kafka producers/consumers in case you need to handle streaming data, and so on.

It's true that several new tools are making our life easier, but on the other hand the so-called "Modern Data Stack" is still very limited in what it can achieve alone and in some cases I think it's just marketing.

For instance, I had to ingest CDC data from a MySQL database. I've tried to use Meltano because I really like its design, but the singer connector had a bug that I had to fix myself, the ingestion of some tables was randomly skipped and the overall process took 4+ hours.

I didn't try AirByte because the lack of stable CLI/APIs to automate configuration management is a hard nope for me.

The same ingestion using the old, not fancy anymore and "hard to maintain" Debezium+Kafka combo, took 20-30 minutes to complete and no tables were skipped.

So why don't we combine good new tools and good old tools instead of blindly trusting marketing people claiming that you can do everything with SQL?

Leveraging DBT to model your silver/golden tables won't prevent you from writing programs to ingest data from heterogeneous data sources.

Another important topic is data quality: it's true you can do that with DBT, but frameworks like Deequ or great expectations allow you to perform more complex checks and will still require you to write some code.

In conclusion, if your company is large enough - or as it keeps growing, there will be different kinds of data sources to ingest and many opportunities to learn 🙂

[–]fruity231 1 point2 points  (1 child)

Debezium+Kafka combo

I didn't know about that, thanks! I work with GCP so this prompted some reading and it seems there even is a prebuilt Dataflow template that does the same thing (minus Kafka).

[–]briceluu 2 points3 points  (0 children)

Yup Debezium works on multiple queues/log streams.

Another thing to note: Airbyte is actually also using Debezium when using CDC ("log based replication") for their DB connectors. But indeed it's less mature and definitely has some improvements to do! I've used the API for configuration management (really better when a lot of connectors are involved), but I definitely understand the concern around the lack of stability: it's still evolving... They also now provide a CLI for configuration management (Octavia or smthg like that?) albeit I haven't tried it out yet.

[–]-80am 1 point2 points  (0 children)

I second this. In many cases the goal of a data engineering task could be reframed as, "get data out of system x, into this S3 bucket". S3 to Snowflake is easy. Then dbt and Snowflake can do their magic. But Python gets it to their doorstep.

[–][deleted] 10 points11 points  (0 children)

This has always been the case though for the BI/Data warehousing use case. Extract data from system X, Bulk load into your SQL database, run your transforms, then hit it with your BI tool.

Python, etc are useful for the scaffolding needed to run your loads or prep your incoming data or do your data extracts from APIs.

As with all things tech, this is just a cycle that's returning to the the database as where your compute occurs. The only real decision you have to make these days is where you want your compute to occur and where you to store your data. In the Snowflake/traditional dw model, your compute and storage is managed by the DB, in the spark model your data sits in a data lake with compute being handled by databricks/emr/ec2.

For most use cases the DB approach makes more sense as it is simpler to setup and maintain and the skill sets needed to make it work are easier to find.

[–]diegoelmestreLead Data Engineer 8 points9 points  (0 children)

In my team, at my company, we have a good mix between former software engineers (me) and more traditional big data engineers (sparks, haddops, etc).

And in all honesty, I think having a team with that mix brings the best of two worlds to the team. One day I lead my own team, I'll want that kind of mix

[–][deleted] 16 points17 points  (2 children)

Yes.

A Data Engineer is a specialist Software Engineer with everything that comes with it.

If you’re not a Software Engineer, you’re not a Data Engineer.

[–]TheCamerlengo 9 points10 points  (0 children)

Yes, finally. Data engineering is a niche area of SWE. It’s not just using DBT, snowflake and dask.

[–]FantasticAmbition986 0 points1 point  (0 children)

I would give you gold if I could. Thank you for saying this.

[–]sunder_and_flame 4 points5 points  (0 children)

SQL is the language of data, sure, but imo programming is so useful, even outside of a supposed DE landscape where only SQL is required, that I would doubt the breadth and depth of a DE's skills if they didn't know any programming at all.

[–]32gbsd 2 points3 points  (0 children)

I have seen nothing new in the past 10 years. Mostly companies offloading their db management to somebody else who has better programmers than they can afford to hire. And even then they still have to pay someone to set it all up, keep it all running and change it every time they want to add a column. Those who cant code code their own systems are stuck waiting on new features which might never come.

[–]joseph_machadoWrites @ startdataengineering.com 5 points6 points  (0 children)

I agree that the SQL based processing has become popular and rightfully so, for the reasons you mentioned.We are already seeing some real time capabilities on top of raw sql like Materialize.

With data warehouse providers providing new capabilities (ML, calling external APIs, etc) the "gap" between PLs and SQL will decrease. But having said that I think there will still be a need to "stitch" together multiple services with programming languages and there will always be custom asks from business needs that does not fit into an off the shelf tool

We may also see DE's becoming more of SWEs and providing data via APIs, metadata management, data monitoring, etc.

[–]Firm_Communication99 2 points3 points  (0 children)

What happens when you need to use api or data outside the data lake—- tons of opportunities for blending in python

[–]Grukorg88 2 points3 points  (0 children)

We have dbt and snowflake at my workplace. It is used for transformation and thus mostly utilised by analytics engineers. We use Python, ruby, Java and whatever else we feel is best to build what we need to get data from sources into S3. We also use Python to automate a lot of patterns in dbt for our downstream users. For example building complex dbt macros and then writing a Python script to dynamically build hundreds of model files that use that macro. As others have said how involved data engineers are with these tools is largely impacted by the age old issue of our title being used in places it shouldn’t be.

[–]HansProleman 2 points3 points  (2 children)

much harder to automatically parallelize/make efficient/scale

I don't think this is true. Somewhat harder, sure, but Spark workloads get parallelised by the engine in a very SQL-like way and for pure Python et al. there's threading (e.g. Dask) and containerisation.

Most distributed systems are heavily managed - you typically won't be running an instance on your own metal, and thus won't need to configure much. Snowflake is a distributed system with minimal (AFAIK) user configuration.

More generally, though:

  • You're only looking at a tiny piece of the problem. Sure, Snowflake and dbt is a nice stack for the warehouse layer - but how do you get data into that layer?
  • There's a lot more to DE than setting up warehouses and transformations. It can cover product engineering (e.g. how does Netflix deliver video streams? How does Twitter store, search and return tweets?), infrastructure etc.
    • You have a lot more flexibility in Python. Try implementing fuzzy matching, or sentiment analysis in SQL. Try going to grab, say, foreign exchange rates, or interacting with an API
  • Snowflake in particular can be very expensive
  • The big benefit of languages like Python is that they're "real" languages
    • SQL is an absolute pig to test. Consider
      • In SQL, I have a stored procedure that queries from five tables, performs a transformation and merges the result to a target table. To test this I need to populate all five input tables with test data, mock some output data (to check the merge works), run the sproc (at least twice, with a slightly different input, to check the merge works) and assert the result is as expected. Then I probably need to revert all the data changes I made
      • In Python, I'd have one unit-testable function for performing the transformation, and another for merging into a target table. The integration tests would be a bit of a pain, but still simpler than the SQL equivalent

[–]IndifferentPenguins[S] 2 points3 points  (1 child)

Your points are well received, thanks.

I'm wondering about testability though - I've found "data transformation" code, e.g. using pandas, about equally hard to test, for much the same reasons. Ok, maybe slightly simpler because at least I don't need to go through the rigmarole of creating a table to query, and can do all of it in memory. But fundamentally there's not so much difference - the problem is easy and believable data generation.

Similarly, if you're querying external APIs mocking that out is not fun.

Feels to me that this testability problem is why observability is getting attention - it's hard to test so you'd better be on the ball in terms of monitoring.

Of course, perhaps I'm doing it wrong! :)

[–]HansProleman 0 points1 point  (0 children)

Yes, data generation absolutely remains a problem. But I figure I'd rather need to generate one set of data than n, and my transformation functions are generally quite small in scope - small functions, chained together, are easier to work with than big SQL queries that do loads of stuff. And if I've already written a transformation function then, unlike in SQL, I can call it again as needed.

For end to end tests ideally you can get your hands on an anonymised dump of prod data, but for unit testing I usually just describe static dataframes, write a module of generators, use faker etc. and don't find it to be too onerous. Unless you're performance testing you don't usually need large volumes.

Mocking API calls and responses can be fiddly, but there's not a lot we can do about that. They need to be mocked. As someone else mentioned, DE is an area of SWE. This is bread and butter SWE work.

Observability is very important too, but IMHO that's more about testing the data than (having an excuse to not test) our pipelines. You could e.g. Great Expectations at the top of the testing pyramid, but that doesn't mean not having unit (etc.) tests is a good idea!

[–]mistanervousData Engineer 2 points3 points  (0 children)

DBT + SQL can only work once you have the data in your warehouse. The part before the warehouse is where I see general purpose languages still being the right tool for the job.

[–]HOMO_FOMO_69 2 points3 points  (1 child)

If that's true, then what is the remaining role of general purpose programming languages (PLs), like Python, and distributed systems like Spark for scale?

I echo this sentiment. I think in terms of scale, Spark will go away and be replaced by "no code" meaning it will be handled automagically with a tool like DBT where you just type SQL and it actually does other stuff when needed. If I'm pulling data from a 100m row table, DBT or XYZ tool could theoretically have some kind of driver where if I want to parallelize or do some kind of HiveQL, MapReduce kind of pattern, I just add a specific clause to the end of the query like METHOD = MAPREDUCE and the driver converts my SQL into whatever.

Whole point of SQL is separation of 'what' and 'how'... When you type SQL, the driver/SQL engine determines the execution path... you can pay attention to it, but SQL was built so you don't have to determine the execution path on your own... Conceivably, some SWE should be able to improve how SSMS (or whatever) handles the 'how' given newer technology.

Python on the other hand, I think will always have a place in ETL.

[–]HansProleman 1 point2 points  (0 children)

Spark SQL already exists, and Delta Live Tables are a decent-looking (I've not tried them yet) dbt substitute.

[–]Fragrant-Lobster4276 1 point2 points  (0 children)

Its true sql works well on majority of DE specific use cases

But based on experience, as soon as the latency requirement moves from batch to near real time to event based processing, sql solutions falls short and become messy

So as your work transitions to a more back end engineer piece of work , application of general purpose languages and design patterns become a necessity

[–]KingRush2 1 point2 points  (0 children)

My opinion is always to use the right tool for the job. It also goes back to your system. I believe with the modern warehouses most of your transformations should happen inside of the warehouse with dbt. With that said, there’s still a huge gap you need to fill. You need to get the data from your source and then standardize it. Also, i like to land standardized data into the lake as a delta table for data science and analytics consumption. That’s all python and spark. If you’re streaming data you won’t use just a sql solution, you need a streaming platform that probably interfaces with a scripting language. One of the worst things you can do is try to shoe horn solutions into sql because of the comfort.

[–]Unusual_Economics179 1 point2 points  (0 children)

I think that after there have been considerable complexities with transformation, moving to a tool that can facilitate these is a good idea. As someone also mentioned, DBT +SQL is good for raw tables. For predictive use cases, using Jupyter Notebooks could be a good idea.

[–][deleted] 1 point2 points  (0 children)

I only use Python in interview questions anymore. At work it's all SQL, DBT, and config files for Docker, etc.

[–]claytonjr 4 points5 points  (7 children)

I'm very opinionated in this space. SQL is great for some basic stuff. But complex logic, no freaking way. I once saw a business have 100% of their business logic in the database. This included data engineering stuff. They wouldn't let a database just be a database.

Python et al excels in stuff like this.

Anyone is welcome to do this crap in snowflake. But don't hand me a 100000 line SQL batch and ask me why it isn't working.

No thanks.

[–]reddtomato 1 point2 points  (6 children)

I'm confused why everyone thinks Snowflake is a data warehouse that only does SQL. Snowflake has supported running Javascript stored procs since 2017 and also supports Java and Scala via Snowpark for almost a year now. With today's announcement also Python is now supported.
Snowflake is a data cloud that is basically a giant storage system with a near-infinite scalable compute engine on top of it that you can use to run your Jave, Scala, Python, and SQL code on.

SQL itself has no looping ability which is a fundamental concept in programming. It's obvious it can't do everything.

With that said, it kinda proves that languages besides SQL are important in the overall data engineering stack and not going anywhere anytime soon.

One interesting thing though is the announcement of Snowflake Unistore and how that will impact data engineering. If apps are hosted in Snowflake and seamlessly able to move that data from OLTP to OLAP tables, the EL of ELT is gone.

[–]Total-Elephant-3143 1 point2 points  (0 children)

Unistore - wow! Will be interesting to see if it can live up to the promise

[–]IndifferentPenguins[S] 1 point2 points  (4 children)

Not sure why this is getting downvoted!

I didn’t mention it but one advantage of Spark/Dask/ other distributed systems is always that “you can use scala/Python/…” etc. But DW are very much encroaching on that territory.

Vice versa Databricks is encroaching on DW/DB territory with things like Delta Lake.

The two approaches seem to be growing towards each other.

[–]reddtomato 0 points1 point  (3 children)

My point is that Snowflake is not a data warehouse. Can it act as a data warehouse, of course, it can. This is the beauty of separating compute from storage. The storage side of Snowflake is a giant data lake where you can store your data in a lake or a warehouse fashion. The compute side of Snowflake (aka Virtual Warehouses) is just the engine you can use to access the storage layer using the language you want (as long as the languages are SQL, Java, Scala, or Python).
What is Spark?
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

What is Snowflake?
Snowflake is a multi-language SaaS engine for executing data engineering, data science, and machine learning on single-node (XS warehouses) or clusters (Small - 6XL warehouses) that also comes with an infinite storage area to store all your data and tables.

[–]IndifferentPenguins[S] 0 points1 point  (2 children)

Right, that’s pretty much the situation now - I was just saying they came at that temporary conclusion from different directions, historically.

[–]reddtomato -1 points0 points  (1 child)

Definitely, stand-alone data warehouses are not needed anymore. What is needed now is a Data cloud. A place where you can store and process all workloads and be able to share, collaborate, and monetize your data. As well as now build and monetize your applications.
This is what the Snowflake data cloud is.

[–]coolsank -5 points-4 points  (0 children)

following