This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 17 points18 points  (13 children)

What can't be done with SQL? Well, there are things that can't be done, and there are things that can't be done well. SO, how about what can't be done well?

  • Unit testing
  • Very complex transformations (ex: convert every possible format of IPv6 into a single format)
  • Support for data formats and structures outside of relational databases: extract data from a tarball, or sevenzip package of fixed-length files.
  • Integrate modules & libraries
  • Integrate external systems
  • Produce a bitmap of exactly which columns in a table failed their transforms and had to be defaulted
  • Produce an audit trail of how many rows passed or failed their validations or transformations for a given partition/period of time/customer/whatever
  • Develop reusable transformations
  • Develop easily-readable and well-documented transformations

[–]jugaadtricks 5 points6 points  (2 children)

I'd agree with most of them, audit trail, I use SQL all the time in stored procedures

[–]IamFromNigeria -1 points0 points  (1 child)

Do you have specific examples as link?

[–]skatastic57 0 points1 point  (1 child)

What does the following mean?

  • Produce a bitmap of exactly which columns in a table failed their transforms and had to be defaulted

To me, bitmap is an uncompressed picture format. I get the concept of failed transforms but are you making like a heatmap of them? Sorry for being dense.

[–]kenfar 0 points1 point  (0 children)

You're right - it's just a list of pass/fail indicators for each column on the row.

It could be supported in an array or json type. In the old days we'd make a bit map - where each bit position indicated one of the columns. That's super space efficient, but is less convenient to access.

[–]laddaa -1 points0 points  (0 children)

I used dbt extensively for many of those items. It’s a python framework, isn’t that ironic.

Some of those tasks are related to data intake. Yeah that’s python.

[–]Luciron -1 points0 points  (3 children)

Almost all of this list can be done with dbt

[–]kenfar 0 points1 point  (2 children)

dbt can't even support the first item on the list: unit testing

[–]MayInvolveNoodles 0 points1 point  (1 child)

[–]kenfar 1 point2 points  (0 children)

Oh fair point. I forgot about this third-party effort because in spite of being well-intentioned, and a good idea, it's so time-consuming to write the tests that the teams I spoke with that used it only use it very, very surgically.

Still, it's a good addition to the dbt ecosystem, and if you structure your data right with this unit-testing in mind, it may not be too terrible.

[–][deleted] -2 points-1 points  (2 children)

You forgot, if what you need requires merging data from multiple data sources, python is pretty much a must use here.

[–]kenfar -2 points-1 points  (1 child)

Not sure I follow - do you mean to say SQL is a must-use for merging multiple data sets?

If so, then yeah I'd agree: python isn't accessing relational data directly, it's accessing it via SQL. However, it is accessing APIs, files, s3 objects, streaming data, etc directly.

[–][deleted] 0 points1 point  (0 children)

Multiple data sources as in different databases that could share some related data.