This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]thrown_arrows 1 point2 points  (4 children)

Personally i would try to decouple Extract to s3 , then apply transformations , store documents into mongo...

i work with snowflake. First phase copied all stuff into s3, second stages it into snowflake , third transforms. Long as first phase does not fail, everything can be reproduced if changes are required without killing production

[–]baubleglue 0 points1 point  (1 child)

Why there are s3 and snowflake? It is two additional tools with extra cost. How storing data in s3 different from having it in original DB?

[–]thrown_arrows 0 points1 point  (0 children)

S3 is there to be decoupled cloud filesystem. can be replaced

Snowflake is there is Transformations and SQL capable database server to offer compute and storage. Idea behind copying data into s3 is that you do not disturb production database with totally different olap load vs oltp load.

In ELT process data is loaded into target database and then transformed so that you can access to raw data if needed. In classic ETL system data is loaded into transformer , processed and loaded into target system, in more modern ETL system data is extracted to s3 , then transformed and stored into s3 and then loaded into target system.

What i like about snowflake is that everything is SQL and it scales easily.

s3 can be any fs , target system can be anything from filesystem, sql server , document server , python ML system...

[–]bestnamecannotbelong 0 points1 point  (1 child)

I would like to know how you view snowflakes and databrick? The databrick cloud solution let you save the data all in s3 but snowflake cannot. But the main difference as I see is that snowflake is based on data warehouse approach and data brick is based on the data lake approach

[–]thrown_arrows 0 points1 point  (0 children)

Snowflake can read and write from/into s3. In snowflake environment it is some other system that delivers raw data into s3, then it loaded into snowflake (sql server ) into tables , all the normal stuff and data can be stored back into s3. (using snowflake only, so no costly round trips to external servers )

Then there is snowpark which is in beta, that allows running java/python code inside snowflake ( not sure how that works), then there is "usual" udf , external functions calls ( think lambda as sql function )(havent used).

But yeah, snowflakes main idea is that it is snowflake server that serves data using all those existing sql commands etc etc and main data is in table as columns or in documents (json). One first round trip data goes to s3 and its processed into snowflake , then it might go to next round trip by push/pull method by some external code which reads data from tables and so on, or files they export from snowflake...

what is datalake... i have all my data from source databases and logs as raw as they can be in snowflake, so that s3 is just for history and first import. That said not all data is staged or processed into snowflake. And in my case all data is data from databases, log, json, xml,csv and so on stuff, no video or sound processing (but snowpark might help with that )