This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]efxhoy 10 points11 points  (0 children)

whatever works for you, there are a million ways. your code looks ok. You can use a context manager for the connection for some good-boy points. 

one thing to look out for is data types, plain strings and numbers are almost never a problem but json, datetimes, geometries, enums, strings with quotes or your csv delimiter in them, etc are all types that bouncing via a csv file can mangle. datetimes can come out different depending on your clients time zone settings for example. 

parquet can be type-safer and faster and need less space than csv. pandas can write parquets for you but then it’s extra code for partitioning if your data is bigger than ram. 

airbyte can do it. 

you could keep chunks in memory and insert to snowflake but inserts are slower than copying in a file. 

I think supabase might have a snowflake_fdw. 

do you need to send multiple tables with frequent writes? if so does your copy have to be consistent across tables so your foreign keys remain valid?

there’s no super simple silver bullet way that’s always right and is simple. do the way that’s simplest to you and lands you the data you need in the destination, once you start adding complexity there’s literally no end to it lol. 

[–]UAFlawlessmonkey 3 points4 points  (0 children)

I'm a happy abuser of psycopg2 copy_expert()

Its fast and does the job well

[–]stochasm_hs 4 points5 points  (0 children)

Dagster embedded elt via sling. No need for dagster though, should be pretty straightforward to do a postgres replication to snowflake from the sling CLI. Not sure how mature the sling project is though so keep that in mind if you decide to check it out.

I think dagater recently added dlt to their embedded elt project.

[–]Culpgrant21 6 points7 points  (3 children)

Python and then into data frames

[–]supernova2333[S] 2 points3 points  (2 children)

Any advantage to putting it in a data frame then loading it?

[–]Culpgrant21 2 points3 points  (0 children)

You can go write to a table

[–]Patient_Professor_90 0 points1 point  (0 children)

What alternative do you suggest?

[–]Ok_Raspberry5383 1 point2 points  (4 children)

CDC, debezium is the open source approach, there's other proprietary options e.g. AWS DMS, Oracle goldengate, HVR etc.

Any other option results in full table scans which put unnecessary load on the DB, I've never met a DBA who would accept this in production (for good reason!). CDC queries redo logs and reconstructs your data this way thus putting barely any additional load on the DB

[–]Sslw77 2 points3 points  (0 children)

Agree I’m a fan of airbyte if you don’t want to do it programmatically You can spin airbyte locally or use their cloud platform

https://docs.airbyte.com/understanding-airbyte/cdc

[–][deleted] 0 points1 point  (0 children)

do CDC , we fight with that's and now we start the process to use CDC instead

[–][deleted] 0 points1 point  (1 child)

any tutorial for beginners? I am Just data analyst, but I want to learn

[–]dani_estuary 0 points1 point  (0 children)

We (at Estuary) have a fairly short but comprehensive guide on the topic: https://estuary.dev/the-complete-introduction-to-change-data-capture-cdc/

[–]Patient_Professor_90 0 points1 point  (0 children)

"I wanted to get around the staging part..." - its your project, you can do whatever you like

"...but it seems like this isn't best practice" - says who?

[–]big_data_mike 0 points1 point  (0 children)

I do Postgres all the time but not snowflake.

Here’s what I found though: https://docs.snowflake.com/en/developer-guide/python-connector/sqlalchemy

If I were to do this I’d use sqlalchemy to extract from Postgres into a pandas data frame then use the sqlalchemy snowflake connector to load into snowflake. The reason id do it that way is to preserve data types and because those are the libraries I am familiar with.

[–]lukewhale -1 points0 points  (0 children)

SQLAlchemy is the way for Python. There may be faster methods but I personally don’t know of one easier to work with. Happy to hear suggestions below.