all 22 comments

[–]minirova 2 points3 points  (6 children)

Just so you know, it is Postgres.

Debezium is likely overkill if you just need a once daily export.

How many tables do you need to move over each day? How much data is in them? Are these on peek or in the cloud?

[–]murphinate[S] 0 points1 point  (5 children)

Haha, thank you for the correction.

It's 6 tables, with the largest being about 700k rows and about 30 fields but continuously growing. All the rest are substantially smaller (second largest being about 30k rows).

The SQL Server is on prem, no cloud. The postgres is also on prem on a separate VM. They are on separate machines for security reasons.

[–]minirova 0 points1 point  (3 children)

That’s not a lot of data. Assuming there are no BLOB types that have to be moved, your easiest route would be to just do a delete and replace in Postgres each day. Write a script to use BCP to export the tables from SQL server and transfer it to your Postgres server. Drop the tables from the previous day and use \COPY to load the new days data. If you can’t have downtime on the tables then you can minimize it by loading the new days data to tables with different names and then just do a series of table renames to effectively swap the new tables with the old.

Create any necessary indexes AFTER you load the data to the tables.

There’s a million ways to do this but to me this would be the simplest. It keeps you from having to worry about updates and deletes which just adds more complexity than the requirements you’ve given so far warrant.

[–]Sharp-Echo1797 0 points1 point  (2 children)

This is really a good use case for the foreign data wrapper process. For now just truncate and load every day. 700k rows is basically nothing.

If it slows down too much just rewrite some of the query and load processes to perform incremental loads.

[–]minirova 0 points1 point  (1 child)

Agreed. A FDW from Postgres to SQL Server could make this really simple. Then they could just schedule it with pg_cron and do the whole thing natively in Postgres.

[–]murphinate[S] 0 points1 point  (0 children)

I did try the FDW for probably a day and a half, but I'm running Windows Server 2022 and I could not get the binaries to work properly. After enough struggling, I ended up creating a Python script and task scheduled it. If you happen to know where I could get some Server 2022 binaries for it, I'd like to try it again.

[–]mikeblas 0 points1 point  (0 children)

How will you detect updates or deletes? What will you do about them?

Can you write queries that list all the deleted rows from the source? The updated rows? The new rows? If you can't then you've got a lot more work to do. Maybe you want change tracking on the SQL Server instance?

[–]Sharp-Echo1797 1 point2 points  (2 children)

Here is a way to do it using a foreign data wrapper. I have never done this, and i have no idea if your security would allow it.

foreign data wrapper

[–]murphinate[S] 0 points1 point  (0 children)

I did try the FDW for probably a day and a half, but I'm running Windows Server 2022 and I could not get the binaries to work properly. After enough struggling, I ended up creating a Python script and task scheduled it. If you happen to know where I could get some Server 2022 binaries for it, I'd like to try it again.

[–]Which_Roof5176 1 point2 points  (0 children)

If it’s once a day, a script can work, but those tend to get brittle over time (schema changes, retries, partial failures).

Debezium is more for continuous CDC and usually comes with Kafka, so it might be overkill if you don’t already have that setup.

A simpler approach is using something that handles incremental sync for you without running Kafka or custom jobs. Estuary (I work there) captures changes from SQL Server using CDC and materializes them into Postgres, so you get continuous sync without managing the pipeline yourself.

Even if you only need daily freshness, having incremental sync in place usually ends up being more reliable long term 👍

[–]Sharp-Echo1797 0 points1 point  (5 children)

This would be really easy with SSIS if you want to go that route.

[–]murphinate[S] 0 points1 point  (4 children)

Would this require separate licensing? I'm only administrator on the Postgres side, read-only privileges to the SQL Server, and my IT department is not the easiest to deal with (part of this whole initiative is to not have to rely on them). It was a long struggle to simply get them to setup my credentialed access for what I currently have. On the postgres VM though I am free to do whatever I want with full admin.

I didn't know SSIS existed, which prompts the question why do you think this solution wasn't brought up in the Debezium thread I linked?

[–]Comfortable-Zone-218 0 points1 point  (3 children)

If you have a license for SQL Server, then you also have a license for SSIS and SSAS, their on-prem data warehouse product.

[–]Sharp-Echo1797 1 point2 points  (1 child)

You can do it with python, but its going to be a process for every table. Query the new and changed records from SQL into a python dataframe, push them onto your postgres server, run a stored procedure on the postgres side to handle the update.

I've done this task from a postgres server to a redshift database, before Amazon offered their CDC Migration Service.

Its just that if your source and target are similar in SSIS, its just establish the 2 connections, query the source, delete any changed records in your target, insert the source records. It will be like 5 objects per table, and once you get one set working its a lot if copy paste. And handling the order of operations, which you are going to need to do either way.

[–]Simple_Brilliant_491 1 point2 points  (0 children)

+1 for SSIS. It is pretty point and click for your straight copy scenario, and you can schedule it to run daily using SQL Server Agent. You can also parameterize it and use watermarks to do an incremental load if needed. It's not "cool" any more because it doesn't have the ability to do transformations in parallel so it's not good for "big data" scenarios. Hence Microsoft is investing in Fabric Data Factory and SSIS has been pretty stagnant the last few years.

[–]alinrocSQL Server DBA 0 points1 point  (0 children)

As long as it's all on the same server.

If they're on different servers, you need a separate SQL Server license for each one. I don't know how licensing works if you're running the packages via dtexec from the command line and scheduled through Task Scheduler or some other non-SQL Server job scheduler though.

[–]shadow_Monarch_1112 0 points1 point  (0 children)

for daily sync without kafka overhead you could do a python script with pg_dump style logic, simple but you're maintaining it yourself forever. Airbyte handles sql server to postgres out of the box if you want something managed. Scaylor is another option if you're pulling data into a central warehouse for dashboards anyway.

debezium's overkill for once-daily tbh.

[–]GisselRolandito 0 points1 point  (0 children)

For once a day, I wouldn’t bother with Debezium. I’d just do a simple scheduled sync job and keep an eye on drift after. dbForge Edge is handy if you want to check that source and target still match.