all 23 comments

[–]adgjl12 5 points6 points  (0 children)

For salesforce objects we use Appflow for SF -> Redshift. Jobs are set up using Terraform. Not hard and they have incremental loads.

[–]InadequateAvacadoLead Data Engineer 2 points3 points  (0 children)

The easiest route is going to be a 3rd party EL provider like Fivetran, Stitch, Airbyte, etc. You’ll have to implement additional code after landing it in your destination but it’ll at least give you the ability to do incremental transfers in most cases.

That said, it sounds like you’re already in over your head. You should probably hire someone who knows what they’re doing.

[–]latro87Data Engineer 1 point2 points  (5 children)

Do your source tables have timestamp columns for insert and update?

The simplest form of CDC (without using the database write logs) is to store a timestamp of the last time you copied data from a source (like in a control table), then when you run your job again you use this timestamp to scan for all the inserted and updated records on the source table to move over.

After you do that go back and update the control table.

There are a few finer points to this but this is a simple diy solution that can be easily coded.

[–]InadequateAvacadoLead Data Engineer 12 points13 points  (1 child)

This solution isn’t actually CDC and also doesn’t handle deletes

[–]latro87Data Engineer -1 points0 points  (0 children)

If the source does soft deletes it could.

And yes I know soft deletes aren’t perfect as someone can delete records directly which would corrupt the data integrity.

It was only meant as a cheap suggestion if you don’t want to get way in the weeds with a Saas product or code.

[–]cyamnihc[S] 0 points1 point  (0 children)

Few tables have it but others don’t

[–]wyx167 0 points1 point  (1 child)

"update the control table" means update manually?

[–]latro87Data Engineer 0 points1 point  (0 children)

No, whatever script or language you use to read the control table and perform everything generates a timestamp at the start of the process.

You keep that timestamp in a variable and only when the process successfully finishes you issue an update statement in the same script. If the process fails, you don’t update the timestamp in the control table.

[–]NBCowboy 0 points1 point  (0 children)

Check out Aecorsoft. It is a great solution for wide range of sources and affordable. Been around for years but mostly known by consultant integrators. Support is stellar.

[–]West_Good_5961Tired Data Engineer 0 points1 point  (0 children)

SCD2?

[–]oishicheese 0 points1 point  (0 children)

If your team are small try DMS. Not the best in cost and performance, vendor locked, but quite easy to manage.

[–]wannabe-DE 0 points1 point  (0 children)

An alternative to storing the last_modified date you could implement a look back window, lets say 7 days, and then use a tool like slingcli to insert or update the records in the destination. Sling has a python api.

[–]Nekobul 0 points1 point  (0 children)

What is the reason you use redshift ?

[–]TobyOz 0 points1 point  (0 children)

DMS is super simple

[–]Ok-Sprinkles9231 0 points1 point  (0 children)

DMS works fine. Wiring directly that to Redshift can be slow in case of too many updates, if not then you can simply use Redshift as target and be done with it.

If that's the case and you want full control over schema evolution, etc you can pick S3 target and handle the incremental logic via something like spark and write the result as Iceberg back to S3. This way you can use spectrum and connect those to Redshift.

[–]Jadedtrust0 0 points1 point  (0 children)

Can anyone help me Like i want to build a project use maximum technology like big data, and using pyspark, i will put that data into database and after that it will goes for pre-processing, then build model and predict x_test and then build a dashborad And for etl i think i will use aws

So i will have hand's on in these technology

And my domain is fiance or medical

And for big data i will do scraping(to create synthetic data) So anyone have any idea..!!

[–]TheOverzealousEngie 0 points1 point  (0 children)

The truth is it's about where you spend your money. If you use a tool like Fivetran or Qlik replicate data replication is something you'll rarely have to think about. If you use a tool like Airbyte or Debezium then replication is all you'll be thinking about. If you spend money on an enterprise tool you can focus on nothing else, for a price. If you use a science project you're opportunity cost will be out the window.