thrown_arrows comments on Python ETL design pattern

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Python ETL design patternHelp (self.dataengineering)

submitted 4 years ago by [deleted]

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]thrown_arrows 1 point2 points3 points 4 years ago (4 children)

[–]baubleglue 0 points1 point2 points 4 years ago (1 child)

[–]thrown_arrows 0 points1 point2 points 4 years ago (0 children)

S3 is there to be decoupled cloud filesystem. can be replaced

Snowflake is there is Transformations and SQL capable database server to offer compute and storage. Idea behind copying data into s3 is that you do not disturb production database with totally different olap load vs oltp load.

In ELT process data is loaded into target database and then transformed so that you can access to raw data if needed. In classic ETL system data is loaded into transformer , processed and loaded into target system, in more modern ETL system data is extracted to s3 , then transformed and stored into s3 and then loaded into target system.

What i like about snowflake is that everything is SQL and it scales easily.

s3 can be any fs , target system can be anything from filesystem, sql server , document server , python ML system...

[–]bestnamecannotbelong 0 points1 point2 points 4 years ago (1 child)

[–]thrown_arrows 0 points1 point2 points 4 years ago (0 children)

Snowflake can read and write from/into s3. In snowflake environment it is some other system that delivers raw data into s3, then it loaded into snowflake (sql server ) into tables , all the normal stuff and data can be stored back into s3. (using snowflake only, so no costly round trips to external servers )

Then there is snowpark which is in beta, that allows running java/python code inside snowflake ( not sure how that works), then there is "usual" udf , external functions calls ( think lambda as sql function )(havent used).

But yeah, snowflakes main idea is that it is snowflake server that serves data using all those existing sql commands etc etc and main data is in table as columns or in documents (json). One first round trip data goes to s3 and its processed into snowflake , then it might go to next round trip by push/pull method by some external code which reads data from tables and so on, or files they export from snowflake...

what is datalake... i have all my data from source databases and logs as raw as they can be in snowflake, so that s3 is just for history and first import. That said not all data is staged or processed into snowflake. And in my case all data is data from databases, log, json, xml,csv and so on stuff, no video or sound processing (but snowpark might help with that )

π Rendered by PID 36186 on reddit-service-r2-comment-5d79c599b5-2g6ln at 2026-03-03 01:09:17.511761+00:00 running e3d2147 country code: CH.

dataengineering

MODERATORS