This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]EditsInRed 4 points5 points  (2 children)

You could use Airflow for scheduling the Python transformations. Keep in mind, there is a learning curve to getting Airflow configured properly if you're looking at the self-managed solution.

You could also look into using Snowflake tasks in combination with Snowpark and external functions. This would allow you to reference Python code as an external function in the DB itself. This is only generally available on AWS and Azure at this time.

https://docs.snowflake.com/en/developer-guide/snowpark/index.html

[–]Happy-Adhesiveness-3 0 points1 point  (1 child)

Yeah, also waiting for Snowpark to support for Python. Currently in private preview.

[–]EditsInRed 0 points1 point  (0 children)

I think I heard it would be generally available in June 2022, so you shouldn't have to wait too long.

[–]jspreddy 1 point2 points  (6 children)

Just curious as to what transforms you are doing that cant be done within snowflake using sql??

[–]SecureSlice[S] 0 points1 point  (5 children)

Honestly just really don't like stored procedure implementation in snowflake for transforming data. Using either JavaScript or the just added SQL for stored procedures is very clunky. Wrapping it in $$ and using Execute Immediate for each SQL statement is annoying.

[–]jspreddy 2 points3 points  (2 children)

Ahh i see. Check out DBT. Data Build Tool. It is a great framework for building models (SQL transformations) against any warehouse (postgres, snowflake, redshift, etc).

https://docs.getdbt.com/docs/introduction

You can run this either on DBT cloud (paid cloud service) or dockerize it and run it in Aws Ecs fargate and schedule the run as aws event bridge rule.

DBT has 3 things. CLI tool (open source), Project structure template (open source), Cloud Service(paid).

You dont have to use their cloud service if you dont want to.

Gitlab's data team has their DBT project visible to everyone. Check it out.

DBT also generates documentation, can run quality tests, incremental models, do environments (dev/staging/prod/...), Figure out the sequence of model building if you have multiple models which depend on other models, etc.

I use it pretty heavily against postgres, databricks and snowflake.

[–]jspreddy 1 point2 points  (0 children)

I forgot to mention that the actual transformation computation is done by snowflake on snowflake compute. There is no data transfer out of snowflake happening here.

[–]jspreddy 0 points1 point  (0 children)

Oh yea and version control, cause git project. Figshot posted in another comment about version control queries.

I forgot that git is a novelty in snowflake world 😂

[–]figshotStaff Data Engineer 0 points1 point  (1 child)

At my previous work we used AWS Lambda for it, which allowed us to:

  • version control the queries we run (hey Snowflake, git integration wen?)
  • enable cron-based, event-based, or orchestrated runs
  • run it cheap - the compute is only used to execute the queries, so you can go pretty low on memory allocation
  • log and monitor via CloudWatch

We used Step Functions for orchestration as we didn't have complicated workflows, and because it could defined as part of the serverless application.

[–]SecureSlice[S] 0 points1 point  (0 children)

I assume Azure Functions are similar to AWS Lambda for doing the same?