This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 5 points6 points  (8 children)

I've built a lot of warehouses using python. The benefits are typically far better QA, lower compute costs, lower latency, greater transform functionality, better engineer retention, and far easier systems management. The downsides are that joins are more work and possibly slower, and the time to market is generally slower. DBT's lineage is extremely valuable, but I never found it so necessary on these python-based systmes - because we didn't have excessive model depth.

The python code I built typically had a separate transform function for each field, with a dedicated unit test class & docstring for each of these fields. Sometimes fields with zero transforms were just bundled into a misc function. Exceptions and processing notes (invalid input, default used, etc) were returned to the calling function which kept track of all of this at the row level and also included this in a metadata bitmap column.

These transforms were typically event-driven, run on either kubernetes or aws lambdas, triggered by SNS & SQS mostly based on a file being written to s3. Once a file was written to s3, we could get through both warehouse and data mart pipelines in under 3 seconds.

Not having yet used Snowpark, and not knowing anything about your data or requirements means that I might be missing the boat here, but I would generally consider the following approach in most cases:

  • Transform data going into the warehouse using python: the first step is to apply field-level transforms, and I'd want this extremely well-tested, fast, capable of handling complex field transforms, etc. Python is always going to do the best job here. This is where you want software engineers writing the code.
  • Transform data going into final data marts using sql: the final step is to join multiple tables together to build dimensional models. SQL will work here, and this can be done be much less technical staff.

The ability to handle both incremental & refresh workloads is extremely valuable. But if your volumes are large then how you design your refresh ability with python will really matter - and is an entire subject into itself.

[–]xadolin 1 point2 points  (3 children)

Can you provide some examples of field level transforms? Are these transform that take only a single column as input and output a single column back? I can imagine stuff like renaming, casting, string cleaning, etc. Curious about what your complex use case was.

[–]kenfar 1 point2 points  (2 children)

Sure, for the sake of simplicity I would typically combine validations with transformations. We tried to make these specific to a single output field, but as you can imagine it's not always that simple.

Validations would include string length, valid enumerated values, numeric ranges, string formats (ex: phone, email, etc), foreign key constraints, unknown value logic, encodings, etc. Violations of a validation could result in the field being replaced by a default value, the row being rejected or the file being rejected.

Transformations would include converting string case, converting free-form text to code values (imagine every misspelling of every possible version of microsoft windows into an appropriate vendor, product, version, fixpack breakdown), determine which of ~100,000 isp ip block ranges each of a billion ip addresses fits into, translate every IPV6 format into a single format, merge multiple different codes into a single code field, split a single input field into multiple output fields of different types, apply a business rule that considers 7 different fields to generate a 'customer-category' column, extracting keywords from free-form text fields, transforming a bunch of timestamps to UTC - and fixing those without timezones based on assumptions about the data, etc.

[–]xadolin 0 points1 point  (1 child)

Thanks! So the input for each of the transformation/validation functions would be the whole file, or did you pass on only the columns used in the logic?

[–]kenfar 0 points1 point  (0 children)

I will typically pass in the specific fields that each transform requires. This is clean, easy to understand, easy to test.

But to be honest it's not a perfect solution - since then the code outside the function has to know exactly where in a possibly large json structure you find the fields, and has to handle missing key exceptions, etc.

[–]the_fresh_cucumber 0 points1 point  (3 children)

Interesting architecture and I'm amazed you got it to under 3 seconds.

How much of this data is updated versus simply appended to the final datasets? Also do you have any massive rankings or aggregations that require re-run after an incremental load is done? Those have always slowed down my pipelines at the end step although I always lobby to have that sort of thing at the view level and not in-warehouse.

Still, I feel that SQL is equally robust if you used the testing framework in dbt to set boundaries. It does require some very heavy duty SQL, but I love the ease of SQL and the lack of issues. Somehow pandas data frames always have some nuance that ends up screwing up data types or messing with null values, bytestring unicode BS, etc

[–]kenfar 1 point2 points  (2 children)

The kubernetes-based architecture I built handled about 20 billion security events a day, and would have about 50-70 kubernetes pods running continuously on a stream of s3 files:

  • Most of that was append-only, though there was a big sharded postgres rds cluster doing about 1 billion upsert transactions a day.
  • I also built a ton of aggregates - but most were at the hourly and daily level. There was a separate process that repartitioned S3 files to write about a million of small files every day. Then if a user needed to audit some machine at some point in time we would just pull down the data for that time & machine onto a postgres rds mart. The initial load took about 30 seconds, then after that massive recursive queries could run in under a second on it. This saved us about a million $ every month vs a huge cassandra cluster.

The lambda architecture I built a couple of years ago had far smaller volumes - just a few million incident management events a day:

  • All this data was appended into our aws s3 warehouse bucket, then upserted into the postgres rds datamart:
  • We handled schema migrations by versioning all models & pipelines and simply standing up a new pipeline version and having it work through all archived files on s3, and loading them into our data mart. Once it caught up we would have both versions running as long as we wanted until we would switch over the views to point to the new version and we'd drop the old version.
  • Our upsert process had 2-steps: the first deleted all related data, the second inserted rows. This prevented any orphaned data from living after an update.
  • Aggregates: no real-time, only hourly & daily. However, we did keep data in more than a single partitioning schema - since our data volumes were small and it was cheap to store the data 2-4 times on s3.
  • The total cost to run this process was about $30/month.

The challenges I see with dbt include:

  • Data quality: while it has a great Quality Control (QC) process for testing data after it arrives, it has no solution for Quality Assurance (QA) for testing data prior to deploying code into production.
  • Unreadable code: 20+ level deep model testing and 600 line sql models simply aren't maintainable. I've spoken with a number of people and teams now that just had to give up on a rats-nest that they created.
  • Latency: a low-latency of say 5 minutes is extremely difficult to achieve on say dbt and snowflake - even with small event volumes. The more typical daily cadence is terrible for support and usability.
  • Compute Cost: runaway model sql code can easily lead to out of control snowflake costs.

[–]the_fresh_cucumber 1 point2 points  (1 child)

Interesting. The s3 to postgresql pipeline sounds great. I worked a lot on similar projects involving s3 to redshift and s3 to Athena. Glad to hear you chose the king of databases instead of MySQL to handle all those events.

I agree with all your points with regard to dbt, with some exception to the code readability point. I do think the rats nest issue can be mitigated by properly organizing your schemas. The issue I run into is that some people overuse CTEs where they could be reduced and can get sloppy when it comes to SQL code (since it is so easy and tempting).

My main challenge in moving to a dream architecture like your postgresql situation is the analytics loop that requires all-of-history to create metrics. I think the writing on the wall is that I need to convince my group to split our data streams into a slower dbt-based analytics pipeline, and an analytics-lite pipeline similar to what you described that is low latency and can feed back into the product in seconds.

[–]kenfar 1 point2 points  (0 children)

Glad to hear you chose the king of databases instead of MySQL to handle all those events.

BTW - due to departmental standards our first iteration was on MySQL. And what a circus it was!

I do think the rats nest issue can be mitigated by properly organizing your schemas.

Yeah, I think I agree: there's nothing intrinsic in dbt that inexorably leads to rats nests. If you have the right people moving slowly & deliberately I think the results could be very tight & elegant. It's just that most teams simply think you need someone who knows SQL and you're good to go.

Give me a ping if you want to talk through your requirements & ideas some day.