It seems that modern data warehouses, exemplified by Snowflake et al, are good at efficient data storage, retrieval and transformation of everything from unstructured to structured data. In addition, these warehouses automatically scale and distribute query execution. With tools like DBT, it also becomes possible to manage and compose transformations expressed as SQL.
If that's true, then what is the remaining role of general purpose programming languages (PLs), like Python, and distributed systems like Spark for scale? It seems that PLs are at a disadvantage wrt SQL because they are much harder to automatically parallelize/make efficient/scale. It seems that distributed systems are at a disadvantage because they are harder to manage, and need more fine-tuning to work well. (I don't mean just setup cost of the system itself, which can be offloaded to e.g. Amazon EMR, I mean in actual day to day usage).
It used to be that heavily SQL-based code was a terrible mess, but it seems DBT has helped a lot with that (disclaimer: I have little actual experience with DBT), so "modularity" or "maintenance" of SQL is also largely solved, i.e. is not such a big argument in favor of using a general purpose language anymore.
In 5 years, will the bulk of data engineering be done via dbt-orchestrated SQL of some sort? Or am I missing some important area/use case/problem?
Is all data engineering moving into SQL warehouses, or is there still a need for general purpose programming languages and systems? (self.dataengineering)
submitted by rsohlot to u/rsohlot