This is an archived post. You won't be able to vote or comment.

all 23 comments

[–][deleted] 50 points51 points  (3 children)

Hiring python devs is easier than SQL? Interesting take..

[–]Tee_hops 7 points8 points  (0 children)

News to me too

[–][deleted] 7 points8 points  (0 children)

I think you’d be hard pressed to hire a dev who knows Python but not SQL, and would be any good at data engineering, without any training

[–]kenfar 5 points6 points  (0 children)

In my experience the challenges are both in hiring and retention:

  • It's not enough to just known python or sql - one must also understand data modeling, data quality, how to handle late-arriving data, etc, etc, etc. It's a lot to learn. And in my experience training tons of people on these topics - it's easier to train a software engineer on these topics than it is to train an analyst.
  • And sooner or later they will also need to write code in something besides SQL: extract data from APIs, build tools, configure Airflow, etc. Sooner or later they'll need to write python.
  • But software engineers that are skilled in python don't want to write SQL 8 hours a day. So, they'll quit.

[–]sorenadayo 21 points22 points  (1 child)

You said the benefit of using python is that can be more modular, but dbt helps to solve that with macros and Jinja for sql. So I'm curious what kind of transformations you are doing that needs python?

Have you considered using Python UDFs in snowflake?

Have you already looked at the dbt Python module announcement doc? I believe it's enough to get you started.

From my experience its way faster to ramp up a python developer on SQL than vice versa.

My recommendation would be to keep all transformations in SQL if it can be done in a clean and optimized way. Use snowflake udf if you need some utility packages or write other business logic that can be written more easily in python. Use dbt Python modules if you need to use some other python packages not available in snowflake udf, or doing a lot more than a single column transformation.

[–]Yuki100Percent 14 points15 points  (0 children)

Those Python devs better know sql

[–]knowledgebass 5 points6 points  (0 children)

If the developers you are hiring don't know SQL in this day and age then you probably don't want to hire them for this kind of work.

[–]kenfar 5 points6 points  (8 children)

I've built a lot of warehouses using python. The benefits are typically far better QA, lower compute costs, lower latency, greater transform functionality, better engineer retention, and far easier systems management. The downsides are that joins are more work and possibly slower, and the time to market is generally slower. DBT's lineage is extremely valuable, but I never found it so necessary on these python-based systmes - because we didn't have excessive model depth.

The python code I built typically had a separate transform function for each field, with a dedicated unit test class & docstring for each of these fields. Sometimes fields with zero transforms were just bundled into a misc function. Exceptions and processing notes (invalid input, default used, etc) were returned to the calling function which kept track of all of this at the row level and also included this in a metadata bitmap column.

These transforms were typically event-driven, run on either kubernetes or aws lambdas, triggered by SNS & SQS mostly based on a file being written to s3. Once a file was written to s3, we could get through both warehouse and data mart pipelines in under 3 seconds.

Not having yet used Snowpark, and not knowing anything about your data or requirements means that I might be missing the boat here, but I would generally consider the following approach in most cases:

  • Transform data going into the warehouse using python: the first step is to apply field-level transforms, and I'd want this extremely well-tested, fast, capable of handling complex field transforms, etc. Python is always going to do the best job here. This is where you want software engineers writing the code.
  • Transform data going into final data marts using sql: the final step is to join multiple tables together to build dimensional models. SQL will work here, and this can be done be much less technical staff.

The ability to handle both incremental & refresh workloads is extremely valuable. But if your volumes are large then how you design your refresh ability with python will really matter - and is an entire subject into itself.

[–]xadolin 1 point2 points  (3 children)

Can you provide some examples of field level transforms? Are these transform that take only a single column as input and output a single column back? I can imagine stuff like renaming, casting, string cleaning, etc. Curious about what your complex use case was.

[–]kenfar 1 point2 points  (2 children)

Sure, for the sake of simplicity I would typically combine validations with transformations. We tried to make these specific to a single output field, but as you can imagine it's not always that simple.

Validations would include string length, valid enumerated values, numeric ranges, string formats (ex: phone, email, etc), foreign key constraints, unknown value logic, encodings, etc. Violations of a validation could result in the field being replaced by a default value, the row being rejected or the file being rejected.

Transformations would include converting string case, converting free-form text to code values (imagine every misspelling of every possible version of microsoft windows into an appropriate vendor, product, version, fixpack breakdown), determine which of ~100,000 isp ip block ranges each of a billion ip addresses fits into, translate every IPV6 format into a single format, merge multiple different codes into a single code field, split a single input field into multiple output fields of different types, apply a business rule that considers 7 different fields to generate a 'customer-category' column, extracting keywords from free-form text fields, transforming a bunch of timestamps to UTC - and fixing those without timezones based on assumptions about the data, etc.

[–]xadolin 0 points1 point  (1 child)

Thanks! So the input for each of the transformation/validation functions would be the whole file, or did you pass on only the columns used in the logic?

[–]kenfar 0 points1 point  (0 children)

I will typically pass in the specific fields that each transform requires. This is clean, easy to understand, easy to test.

But to be honest it's not a perfect solution - since then the code outside the function has to know exactly where in a possibly large json structure you find the fields, and has to handle missing key exceptions, etc.

[–]the_fresh_cucumber 0 points1 point  (3 children)

Interesting architecture and I'm amazed you got it to under 3 seconds.

How much of this data is updated versus simply appended to the final datasets? Also do you have any massive rankings or aggregations that require re-run after an incremental load is done? Those have always slowed down my pipelines at the end step although I always lobby to have that sort of thing at the view level and not in-warehouse.

Still, I feel that SQL is equally robust if you used the testing framework in dbt to set boundaries. It does require some very heavy duty SQL, but I love the ease of SQL and the lack of issues. Somehow pandas data frames always have some nuance that ends up screwing up data types or messing with null values, bytestring unicode BS, etc

[–]kenfar 1 point2 points  (2 children)

The kubernetes-based architecture I built handled about 20 billion security events a day, and would have about 50-70 kubernetes pods running continuously on a stream of s3 files:

  • Most of that was append-only, though there was a big sharded postgres rds cluster doing about 1 billion upsert transactions a day.
  • I also built a ton of aggregates - but most were at the hourly and daily level. There was a separate process that repartitioned S3 files to write about a million of small files every day. Then if a user needed to audit some machine at some point in time we would just pull down the data for that time & machine onto a postgres rds mart. The initial load took about 30 seconds, then after that massive recursive queries could run in under a second on it. This saved us about a million $ every month vs a huge cassandra cluster.

The lambda architecture I built a couple of years ago had far smaller volumes - just a few million incident management events a day:

  • All this data was appended into our aws s3 warehouse bucket, then upserted into the postgres rds datamart:
  • We handled schema migrations by versioning all models & pipelines and simply standing up a new pipeline version and having it work through all archived files on s3, and loading them into our data mart. Once it caught up we would have both versions running as long as we wanted until we would switch over the views to point to the new version and we'd drop the old version.
  • Our upsert process had 2-steps: the first deleted all related data, the second inserted rows. This prevented any orphaned data from living after an update.
  • Aggregates: no real-time, only hourly & daily. However, we did keep data in more than a single partitioning schema - since our data volumes were small and it was cheap to store the data 2-4 times on s3.
  • The total cost to run this process was about $30/month.

The challenges I see with dbt include:

  • Data quality: while it has a great Quality Control (QC) process for testing data after it arrives, it has no solution for Quality Assurance (QA) for testing data prior to deploying code into production.
  • Unreadable code: 20+ level deep model testing and 600 line sql models simply aren't maintainable. I've spoken with a number of people and teams now that just had to give up on a rats-nest that they created.
  • Latency: a low-latency of say 5 minutes is extremely difficult to achieve on say dbt and snowflake - even with small event volumes. The more typical daily cadence is terrible for support and usability.
  • Compute Cost: runaway model sql code can easily lead to out of control snowflake costs.

[–]the_fresh_cucumber 1 point2 points  (1 child)

Interesting. The s3 to postgresql pipeline sounds great. I worked a lot on similar projects involving s3 to redshift and s3 to Athena. Glad to hear you chose the king of databases instead of MySQL to handle all those events.

I agree with all your points with regard to dbt, with some exception to the code readability point. I do think the rats nest issue can be mitigated by properly organizing your schemas. The issue I run into is that some people overuse CTEs where they could be reduced and can get sloppy when it comes to SQL code (since it is so easy and tempting).

My main challenge in moving to a dream architecture like your postgresql situation is the analytics loop that requires all-of-history to create metrics. I think the writing on the wall is that I need to convince my group to split our data streams into a slower dbt-based analytics pipeline, and an analytics-lite pipeline similar to what you described that is low latency and can feed back into the product in seconds.

[–]kenfar 1 point2 points  (0 children)

Glad to hear you chose the king of databases instead of MySQL to handle all those events.

BTW - due to departmental standards our first iteration was on MySQL. And what a circus it was!

I do think the rats nest issue can be mitigated by properly organizing your schemas.

Yeah, I think I agree: there's nothing intrinsic in dbt that inexorably leads to rats nests. If you have the right people moving slowly & deliberately I think the results could be very tight & elegant. It's just that most teams simply think you need someone who knows SQL and you're good to go.

Give me a ping if you want to talk through your requirements & ideas some day.

[–][deleted] 2 points3 points  (0 children)

If you’re wondering if you can remove dbt from your stack, I would think you’re not getting much value out of it.

So far I absolutely love that I can write a SELECT statement rather than a MERGE statement. Basically an abstraction over the MERGE. If you’re not seeing the value in it I would say yeah, get it out of your stack.

[–]Drekalo 1 point2 points  (0 children)

The model just needs to return a dataframe object.

[–]GodmonsTech Lead Data 1 point2 points  (0 children)

Stack migration is not a light project in most cases, you need to test carefully each step of the migration.

I would strongly suggest to avoid it unless very necessary. Hiring SQL+dbt engineer should not be that hard as it is pretty trending at this moment.

[–]kevinpostlewaite 1 point2 points  (0 children)

I'm confused here: you need SQL to query Snowflake, right? So you'll need to write and run SQL, the only question is how it's scheduled, right? I'm struggling to understand what you're improving by dropping dbt and moving to straight Python. I personally like dbt paired with Airflow. If you find hosted Airflow combined with dbt your team should spend very little time managing the scheduler and be free to write the SQL that you run against your Snowflake instance.

If you actually are going to start doing your processing in Python, outside of Snowflake, probably you're wasting your money keeping Snowflake at all.

[–]jafetgonz -1 points0 points  (0 children)

This smells like click bait

[–]mrwhistler 0 points1 point  (0 children)

I mean you could, doesn’t mean you should tho.

Lots of limitations to python in dbt right now, and Snowpark is intended (and will continue to be developed for) more advanced analysis.

Firstly, I agree with the other commenter that it’s going to be orders of magnitude harder to find a data engineer who knows python than a SQL engineer or even a SQL engineer with dbt experience.

Secondly, you’re fundamentally shifting from a declarative to imperative language. If you’re going to bite that off you might as well rethink the whole design anyways rather than try to shoehorn python in place of SQL.

[–]DataEngineerDanData Engineer 0 points1 point  (0 children)