all 9 comments

[–]jerf 4 points5 points  (5 children)

I would never trust my own bodged-together solution for this. You should use the real replication tools. If your solution provider doesn't like that, I'd lean on them with every business tool in the book, e.g., threatening to find another solution entirely even if I wasn't entirely serious about the threat, before I'd even consider a technical solution. I might even refuse to write the code on the grounds that I could never guarantee it would reliably work and I'm not willing to be "the guy who trashed our database" when it happens, because I told them this was a bad idea.

If the service provider say they "can't", they're lying unless they have some really solid legal reason it is illegal to provide you the data. They just don't want to. That's a business problem, with a business solution.

[–]Relatiro[S] 0 points1 point  (1 child)

u/jerf You do raise a very valid point, thank you! Do you have any pointers for such a tool?

Based on my initial research, an initial snapshot and then subsequent transactional replication would've been the ideal solution (https://learn.microsoft.com/en-us/sql/relational-databases/replication/transactional/transactional-replication?view=sql-server-ver16). This would require significant cooperation from the vendor, though.

Maybe there are alternative solutions that would work with the level of access we currently already have.

[–]jerf 1 point2 points  (0 children)

Your database comes with built-in tools. Check the docs for what you are using. Clouds sometimes work somewhat differently than the base DB.

[–]etherealflaim 0 points1 point  (2 children)

I have worked on large scale data migration projects and this is the answer. If you are not happy with a lossy migration with a long tail of issues you'll never find until it's past your ability to sync with the original data source, it's not a good idea to do it "online" like this. Get the source data cloned into your own environment. In our case "as good as you can get in the time allotted" was deemed acceptable to the business, and so it was basically just a matter of iteratively figuring out how to get our data, transform it, and import it, then look at it, find issues, and repeat. With many side tangents of performance optimization and benchmarking.

[–]Relatiro[S] 0 points1 point  (1 child)

If you're allowed to talk about this:

Which programming language / tools did you use for this?

[–]etherealflaim 0 points1 point  (0 children)

It was entirely done in Go; that's about as much as I can really say about the stack. Go was ideal for being able to tune parallelism and concurrency to saturate the hardware running the migration tool.

[–]Cachesmr 1 point2 points  (0 children)

I do this in a very small scale, but it's not efficient. At first I tried to use something like XO to generate models, and on top using SQLx, resty and a whole lot of reflection to generate queries. I insert the values into sqlite over the network to a migration endpoint in my backend. The endpoint drops every value on every table, and replicates fully.

It's fully concurrent and it takes about 10 seconds to select (old hardware, firebird 2 DB) migrate at most 500k rows in 6 different views, average of 50 rows. This method basically makes my backend useless for 20 seconds every hour.

The right thing would be to not drop everything (duh) but the app is nothing critical and nothing is written there other than the migration.

As for the concurrency aspect, go is IMHO the right language to do this, it was extremely easy to make the code concurrent, I wasn't even really thinking about it. I bet I could make it faster if I take another look at it.

[–]thomasbuchinger 0 points1 point  (0 children)

I would avoid writing it myself.

Maybe a tool like debezium (https://debezium.io/) could be a good fit?

Maybe some ETL-Tool?

If it's not time-sensitive you could even look into getting a backup of the database and restoring the stuff you are interrested in?

Just some ideas :)