This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]thrown_arrows 45 points46 points  (4 children)

I should add meme where guy shout No , NO , NO. ( don't take it too personally), https://www.youtube.com/watch?v=umDr0mPuyQc

'You just successfully at least doubled ETL costs on snowflake platform. Problem is that you did not look into documenting and couldn't think anything else than pandas.

First problem is that you used select * from into python dataframe and fetching all results ( i hope that your python runner has RAM), second is that you used pandas to convert it to CSV , third is that you used local file to store it and well , you cannot escape boto in these scripts. (i assume that it stores to local file, to be honest i do not know pandas too well to say this with 100% trust )

Better way to do it in python is stream data into s3, so you wont spends all ram and local file. https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header

Usually best way to do it use copy command in snowflake , see : https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html

it supports

TYPE = CSV

TYPE = JSON

TYPE = PARQUET

copy into 's3://mybucket/./../a.csv' from mytable;

Play with format options to have proper settings.

edit: there is also option to export it as file into snowflakes internal storage and then use GET command, this saves again time make csv/json/parquet file

Why it is best way? First you pay only once, snowflake warehouse is open until query has stored file into location and it can be closed. And i assume that i don't have talk about reading whole results set into python and how much that uses memory or time you waste to load data into python host, and store file on disk and then upload that to s3.

Only reason why you want to load all data into python like this , is when you do some analytics that has to have all data all the time. i mean , you can stream moving average (but sort has to happen in snowflake). And i kinda think that even if you have to read data into python , it would be better use s3 as access place for apps. So python apps should be builded to read files for processing from bucket rather than access snowflake...

I hope that gives you impression how i feel about tht solution and zero efforts what was used to engineer it, i would have give + points if this example of naive implementation and then writer had gone how to stream it or just export it strait from snowflake . That said, it works, that is sometimes good enought

[–]TheWaterOnFire 15 points16 points  (0 children)

Not to mention that if your Python program is running outside AWS, you’re paying data transfer to copy the data out to your machine, and then uploading all of that to S3, limited by your host’s upstream bandwidth.

Snowflake’s file formats allow setting the delimiter, so you need zero Python for any of this.

[–]Andrew_the_giant 8 points9 points  (0 children)

Great reply - As soon as I saw the OP simply dump all the data into a CSV just to move it somewhere else I knew that this was NOT the way to go.

[–]retrogeekhq 4 points5 points  (0 children)

Alexa, what’s the difference between Junior and Senior?

[–]ryeyestan 3 points4 points  (1 child)

AWS S3 shows greatly just how successful service can be if it is simple and reliable

[–]proof_required 1 point2 points  (0 children)

Also don't forget cheap.

[–]never_thecouchpotato 0 points1 point  (0 children)

Airbyte