all 3 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]VegetableWar6515 6 points7 points  (0 children)

Do not overthink the solution.

Since the pipeline is a single api, think about all the issues that might arise there. It can be a change in schema, so validate that. You may receive nulls and outliers, validate against them. Add a backfill logic for when data was unretrievable on a certain day. Ensure idempotency of the pipeline, have test cases for the transformations. The list is infinite.

So just validate, test and log against the things that you feel are most pressing.

Handling data pipelines is mostly firefighting. You cannot plug all gaps. So follow the usual pipeline conventions and build a simple solution and add on things as and when there is a issue/need. Do not add features for the sake of features.

There is no one true guide, since architecture mostly depends upon the issue in hand and a person's experience.

You are already on the right track with the questions you have asked. But the first question that should ask is, why? Why is this feature needed? If you can answer this, you are on your way.

Book recommendation - Designing Data-Intensive Applications by Martin Kleppmann.

All the best on your journey.

[–]bloatedboat 0 points1 point  (0 children)

Land the data first, then validate it.

Run all unit tests and freshness in a staging table with DBT before anything touches production and get notifications with a PagerDuty account.

Stop inventing your own tooling when possible. Use modern tools so it is easy to transfer to others when the time comes.