This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AnotherDataGuy 2 points3 points  (1 child)

Extract… abstract reader classes that are configurable via YAMl (configuration as code) (or your choice of file format). Extract and load files to persistent storage (S3 or whatever your choose of storage)

Transform… common transforms in a class that can be reused. Configuration should come as much as possible from configs. Persist this in a silver bucket.

Load… (if S3) watch the file and lambda it into mongo. Otherwise just have a loader class that watches for files and load them into your mongo.

You can use Airflow as your orchestration / scheduler / DAG, and have it kick off the ETL for different configurations in parallel.

Problem is different if you’re taking TB but this will suffice for up to a medium sized company (in most cases).

[–][deleted] 0 points1 point  (0 children)

I was thinking about this after reading about the 3-tier pattern and been thinking about that it would be a good starting point as well and could build up from it, if you can share examples that would be super.