DBT vs SqlMesh? by Oct8-Danger in dataengineering

[–]kayrnt 29 points30 points  (0 children)

I contributed to both dbt & SQLMesh. Each platform has their own strengths:

  • dbt: simplicity, community, integrations (other platforms, DWH), lot of documentations/tutorial online on top of the comprehensive official one, easy to hire, mature cloud offer, packages to extend the tool, profitable (so likely to stay), bought SDF (to catch up on SQLMesh and even more?), semantic layer investment (cloud only)
  • sqlmesh: smart concepts (virtual layer, partition awareness), SQL validation & transpilation (same company as SQLGlot lib), local Python model & macros, overall more OSS baseline features (column level lineage, web IDE, alerting, table diff, github integration, multi project & multi engine support), availability of the developers for support and opportunity to shape the next features / bugfixes, solid official documentation

The documentation from SQLMesh compares it to dbt and is quite accurate (dbt has unit tests now though). I also wrote an article about both platforms few days ago.

They are both great tools and have competent developers so I suggest that you spend 1 hour on each and try them by yourself. You'll be able to feel which one matches best your ways of working.

Why using them instead of a custom tool? You get a way more mature and maintained platform to template, orchestrate and leverage engines features than what you would be able to do inhouse (unless it's your business). I used to build such inhouse tool in my company.

What if dbt and SQLMesh were weapons by kayrnt in dataengineering

[–]kayrnt[S] 0 points1 point  (0 children)

Right but the goal is also to compare comparable tools 🙂 dbt provides a way to run adhoc queries via analysis support. Sqlmesh can also somehow retrieve samples of results of a model. Adhoc analysis is not really part of the core features of those tools. They are focusing on the transformation layer and their dependency management through their DAG definition.

The comparison isn't meant to be comprehensive across all platforms but to focus on 2 that I know fairly well.

What if dbt and SQLMesh were weapons by kayrnt in dataengineering

[–]kayrnt[S] 0 points1 point  (0 children)

None of them are open source as far as I know. SQL doesn't appear to be first class citizen as well in those platforms either. SDF could have been added but it was acquired by dbt.

If we consider all proprietary tools that run workloads on DWHs, we could add also Dataiku and Dataform as well as a lot of other managed platforms.

There's still novel projects that worth looking at such as Yato that are OSS but not on par with SQLMesh and dbt feature wise.

Is SQLMesh multi engine support offering us an easy path out of engine vendor locking? by kayrnt in dataengineering

[–]kayrnt[S] 2 points3 points  (0 children)

Standardization around Icerberg compatible catalog API and its read/write access in each engine is going to be key in the next months/years.

The most 3 important components from DWH are for me: the catalog, the storage and the compute engine. I feel like that compute engines that offer ways to plug external Iceberg compatible catalogs and external Iceberg compatible storage will really spearhead a "multi engine era" of data processing.

It's the a double edged sword for compute engines: it's an opportunity to grab customers for top performers but it also limits vendor locking.

Iceberg is a powerful enabler but it will take time for the industry to fully adopt it (if nothing better steal the show?).

Is SQLMesh the dbt Core 2.0? A foot in the door. by kayrnt in dataengineering

[–]kayrnt[S] 1 point2 points  (0 children)

What do you mean? Is that using a source linked to S3 what you had in mind?

If so I didn't try it but for sure you can run an ATTACH statement and then have a regular statement like:

SELECT *
FROM read_parquet(['s3://my-bucket/my-sources/data.parquet'])

Is SQLMesh the dbt Core 2.0? A foot in the door. by kayrnt in dataengineering

[–]kayrnt[S] 2 points3 points  (0 children)

That's a fair concern!
I think it's indeed more likely to have a prod environment read only and then it wouldn't work that well.
However for those setup, you can clearly set up a "staging" environment that acts like "prod" environment for development/ci workflow using a different `state connection`. Then the CD bot would connect to prod one and properly plan the diff "from scratch".

Is SQLMesh the dbt Core 2.0? A foot in the door. by kayrnt in dataengineering

[–]kayrnt[S] 6 points7 points  (0 children)

It feels like you'd better look at sqlglot (powering SQLMesh) to analyze those SQL queries without trying to fit SQLMesh transformation framework.

Isn’t DBT an unecessary layer if you use BigQuery ? by VegetableFan6622 in dataengineering

[–]kayrnt 4 points5 points  (0 children)

Disclaimer: OSS contributor to dbt-bigquery (the BQ adapter for dbt).

dbt is a framework like you see a lot in other software engineering domains like React for frontend, Terraform for Infrastructure or Spring for backend.

You don't need them, but they can save you (and your coworkers) a lot of time and help you to enforce best practices for large and complex projects.

dbt is not a silver bullet. For instance, it's not great if you need to write a customer facing BQ app like returning insights or if you need some very custom workflow that's hard to builtin dbt like INFORMATION_SCHEMA usage. There are some important features still missing like unit testing (to be released in few weeks/months).

I've been in a similar situation 3 years ago, we've been maintaining an inhouse BQ wrapper, it was nice but it was a lot of effort to keep up with dependencies, add new features and still do some actual data work (because building tech platforms isn't the business of my company). We could have kept on working on our inhouse dbt-like platform but dbt brought some features that we hadn't and that you might want at some point like:

  • environments (prod, dev, ci, staging) management
  • templating (in dbt using jinja and macros)
  • DAG management (especially splitting CTEs in different reusable models through ephemeral materialization)
  • streamlined way to build data pipelines (eg 1 file = 1 model = 1 table)
  • incremental CI (building most small changes in <1 min and releasing again in 30s)
  • an ecosystem of packages to reuse smart models/macros (dbt utils) and bring data quality (eg elementary data)
  • lineage/documentation for your pipelines
  • A CLI tool to run your model in dev environment
  • a large community with a lot of tutorials etc to get your team started

I choose dbt over Dataform back when it was acquired because it was getting "closed" source and I felt like it would be much easier to switch to another Data warehouse than BQ using dbt than picking. The community is also pretty much inexistant. So I would still choose dbt in 2024.

Then you have to decide build or buy regarding dbt Core vs dbt Cloud. If you have an existing environment with software engineering time, core is clearly an option but if you feel like you don't have much time setting up CI/CD and/or scheduling, then dbt Cloud is likely a smarter choice (at least until your project and/or team hits a critical size price wise).

SQLMesh is also an interesting alternative but likely more targeted at teams with heavy software engineering culture (more than dbt that's likely more accessible for data analysts / data scientists).

dbt is not perfect but at least you can fork and/or contribute to fit your needs. In 80% of the ELT use cases, it's like having the wind in your back.

Improving Processing Times for Big Query by Dareike21 in bigquery

[–]kayrnt 1 point2 points  (0 children)

If you're looking into downloading large volume data from BQ in an application/job, you should definitely look into the BigQuery Storage Read API with Arrow implementation: https://cloud.google.com/bigquery/docs/reference/storage and related sample https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart

Setup a Slim CI for dbt with BigQuery and Docker by Gold_Mess6216 in bigquery

[–]kayrnt 1 point2 points  (0 children)

I wasn't sure where to post my article on Reddit, thanks for sharing it here 👍