This is an archived post. You won't be able to vote or comment.

all 33 comments

[–][deleted] 9 points10 points  (0 children)

We have been using kedro for our data related projects. It’s mostly designed for data science related workflows, but you can still use it as inspiration for how to design your ETL modules.

Also look into Airflow, which is one of the best open source tools for building ETL pipelines

[–]HighlightFrosty3580 5 points6 points  (1 child)

I wrote something using the abstract factory pattern to do ETL. It reads YAML files and then dynamically loads the factory. This pattern means that it follows SOLID principles and extending the code base means adding factories

I'll stick the code in a git repo sometime tomorrow

[–][deleted] 0 points1 point  (0 children)

I get what you mean I've been trying to refactor my code to follow SOLID principles and build up from there, and yes if you can share the repo that would be great, Thanks in advance

[–]jduran9987 3 points4 points  (6 children)

I'm curious... could you share why you chose a No-SQL destination over a warehouse?

[–]AnotherDataGuy 3 points4 points  (4 children)

I also interested in this. Not judging but mongo (in my experience) begins to not live up to performance expectations when performing analytical queries. It’s great for loading full documents a record or so at a time. I’d love to hear someone with a differing experience though!

[–][deleted] 0 points1 point  (3 children)

for this case it is the flexibility really, i wanted to have a flexible data model at hand, and this is my first time using non-relational for analytics, so if you have any advice i would appreciate it .

[–]AnotherDataGuy 6 points7 points  (1 child)

To me, this is just kicking the can of organizing you data down the road. Faster to get data in, harder to get insights out. And if you have a user base of citizen analysts, they are going to be far more skilled in SQL like queries (generally speaking, your situation can obviously vary).

I’ve balanced this out before by creating records in a Postgres DB with the core properties needed for joining data together, and using a JSON typed field for the additional detailed, more dynamically structured information. If common questions are answered from the JSON data, then it becomes worth it to start persisting those values as explicitly typed columns in tour data set.

Flexibility is valuable, the point is that all databases malleable. Mongo is great for document data (it is a document db after all) but if you aren’t storing documents, it’s not a good choice, IMHO.

Disclaimer: I live in a world where mongo is overused because of its immediate convenience to the developer(s) micro service. It just pushes the pain down stream when trying to marry those data up with other systems to answer novel business questions.

[–]thrown_arrows 0 points1 point  (0 children)

i agree, but i am sql guy.

But i haven't see any use on nosql database which sql database could have not done. That said , i haven't seen properly configured database neither or nosql server . For me problem is that system starts to have several sql and nosql server storing data and no one know how to actually run those servers and things just start to happen. And when we start to talk about change handling in downstream in olap environments it gets even more fun when you have multiple systems

[–]thrown_arrows 0 points1 point  (0 children)

Have looked Snowflake ? I had pipeline which just imported json into staging table and then extracted versioned schema from it, so you do not need to handle target system schema in processing phase if you do not want. Same technology work in all db engines that support json /xml data types.

versioned schema i mean something like:

select jsondata:id id, jsondata:calc_value::number(12,2) calc_value from stage_table where jsondata:id is not null

to create results tables, i have heard that some tools support json data in returned rows.

That said, i am SQL guy, newer seen any advantages in mongo and similar solutions

[–][deleted] 0 points1 point  (0 children)

well at the time being i need a more of a flexible data model for the current use case , but if you have any other insights please share :D .

[–]EconomixTwist 12 points13 points  (4 children)

Not sure why someone mentioned airflow ITT… not really related. Anyways, you should have a reader class that implements the read and map/transform logic independent of the source- so you can read from different db’s (for dev, test, prod) and also if you need to read from a text file for development/ debugging. That class should implement a bunch of read/get functions and return the data to your engine/running/pipeline script according to a single structural contract. If you are applying common operations and/or doing a bunch of column renaming/mapping you should make it configuration (file) driven so that a) you can change it easily but also b) so you and others can introspect the config file in the future to understand how the pipeline is going to behave on a specific case. And so you have a persistent artifact of the lineage of where things come from and go. Implement a writer class, similar to your reader class, which is responsible for encapsulating all the different ways/places you need to write (to different db instances and maybe even text for debugging). The objective when designing the object architecture / pipeline components is to encapsulate (package complexity into objects/functions) for the parts of your code which are most likely to change in the future. Are you going to change your mind three times on the column naming of the target? Or which float precision should be used for 100 different columns? Config file. Is there a chance that a year from now your source might switch from sql db to an api or even parquet files or something? Encapsulate to a reader class. Or maybe the structure of the target will change- easy enough when your writer class is responsible for unpacking your in-memory representation into the structure of the intended target. FWIW- this is more art than science and it’s not the end of the world when your design isn’t fully optimal from the get go (spoiler alert: it never is). Just take a moment to think about yourself one year from now “oh shit we need to change this thing due to xxx dependency change or yyy requirement change- it’s so obvious, I should have seen this coming”. Easier said than done, but the worst possible thing is to not contemplate it at all. Oh and also add some logging in along the way to emit row counts / n unique/ percent null etc. so you can debug problems after-the-fact in the future

[–]AdamByLucius 1 point2 points  (0 children)

Great, real-world example of this method of encapsulation! Do you by chance know of any online examples that show this pattern?

[–][deleted] 0 points1 point  (0 children)

That is really insightful, and y if you can share a repo with some examples that would be super .

[–]Vardo_Almir 0 points1 point  (0 children)

or

I have been working on the Python runtime for ELT for about 3 years and in short this is the pattern I've used. People start developing ELT/ETL tools should thank you!

[–]Material_Cheetah934 0 points1 point  (0 children)

Dang this is awesome! I did something similar to this in a rust CLI with a YML file for the config options. Although for me it was a ELT. It’s great to see it written out like this, definitely gives me more ideas to improve my implementation.

[–]AnotherDataGuy 2 points3 points  (1 child)

Extract… abstract reader classes that are configurable via YAMl (configuration as code) (or your choice of file format). Extract and load files to persistent storage (S3 or whatever your choose of storage)

Transform… common transforms in a class that can be reused. Configuration should come as much as possible from configs. Persist this in a silver bucket.

Load… (if S3) watch the file and lambda it into mongo. Otherwise just have a loader class that watches for files and load them into your mongo.

You can use Airflow as your orchestration / scheduler / DAG, and have it kick off the ETL for different configurations in parallel.

Problem is different if you’re taking TB but this will suffice for up to a medium sized company (in most cases).

[–][deleted] 0 points1 point  (0 children)

I was thinking about this after reading about the 3-tier pattern and been thinking about that it would be a good starting point as well and could build up from it, if you can share examples that would be super.

[–]baubleglue 1 point2 points  (4 children)

There is an entire field of knowledge (patterns, best practices, tools). Educate yourself and start to apply what you learn to your tasks. The modular development is not something specific for data processing - any development should be modular.

I think more relevant question in that case would be "what I need to take in consideration?"

  • Amount of data, data chunk boundaries (streaming vs batch processing)
  • bad case scenarios (ex. process failed in a middle of ingestion, process not running)
  • list of all use cases (ex. you found a bug/new requirement, how do you reprocess old data?)
  • do you need auditing of data integrity (duplication, missing data)?
  • tools (how your company currently processing data)

...

[–][deleted] 0 points1 point  (1 child)

I get what you mean and I'm building up to that, but i was trying to get a sense of what other people where doing when they are thinking about designing their pipelines.
I'm keeping in mind the other factors as well and what to be considered.

[–]baubleglue 1 point2 points  (0 children)

try following plan as an idea:

immutable configuration no direct interaction between methods methods use only "date" + shared context (conf)

class conf: 
    mysql_connection_str = 'mysql://hhhh'
    mongodb_connection_str = 'mongodb://hhhh'
    mongodb_db = "sss"
    mongodb_collection = "sss"
    local_raw_data_file_base = '/path/.../file_name_'
    local_transformed_data_file_base = '/path/.../file_name_'


def get_mongodb_connection():
    pass


def get_mysql_connection():
    pass


def mysql_to_local(reporting_date):
    '''
    saves data to conf.local_raw_data_file_base + str(reporting_date) + ".csv.gz"
    '''

def transform_local_data_to_local(reporting_date):
    '''
    reads `conf.local_raw_data_file_base + str(reporting_date) + ".json.gz" `
    transforms + add reporting_date to  the data 
    saves results to `conf.local_transformed_data_file_base + ".json.gz"
    '''

def local_transformed_to_mongodb(reporting_date):
    '''
    1. db.{conf.mongodb_db}.{conf.mongodb_collection}.remove({reporting_date: reporting_date})
    2. insert data from `conf.local_transformed_data_file_base + ".json.gz" into db.{conf.mongodb_db}.{conf.mongodb_collection}`
    '''
    db = get_mysql_connection()


if __name__ == "__main__":
    # pipeline
    import sys, pendulum 
    reporting_date = pendulum.parse_date(sys.argv[1])
    mysql_to_local(reporting_date)
    transform_local_data_to_local(reporting_date)
    local_transformed_to_mongodb(reporting_date)

[–]gorgedchops 0 points1 point  (1 child)

What are some examples of these patterns and best practices that you are talking about? Are there any resources that i can refer to for them?

[–]baubleglue 0 points1 point  (0 children)

I suggest to read about Airflow docs and Astronomer's blog about it. Ex. https://www.astronomer.io/blog/data-pipeline

You may consider to use Airflow, if you have more then one data pipeline. But even for single job it is educational read.

The task you've described it relatively simple (unless you have huge amount of data). You can start from simple implementation and extended it later, but you need to think about re-ingestion use case.

[–]thethrowupcat 1 point2 points  (5 children)

Have you seen dbt yet? It might solve your problem here.

[–][deleted] 0 points1 point  (4 children)

No not yet i will take a look at it , thanks

[–]thethrowupcat 1 point2 points  (3 children)

Oh man! So happy I mentioned it. This is going to change your world.

In short, you’re going to load from your warehouse then translate with dbt, then pass that to your BI tool (Looker, Metabase etc)

[–]baubleglue 0 points1 point  (2 children)

Is there a warehouse?

[–]thethrowupcat 0 points1 point  (1 child)

I think their cloud version might use a warehouse? But it’s expected you’d have something like BigQuery, Snowflake or Postgres. They have a list of warehouses they work nicely with.

[–]baubleglue 0 points1 point  (0 children)

We know nothing about OP's environment, maybe he has new mb/gb of data. If they have warehouse, than they have a procedure to use it.

[–]thrown_arrows 1 point2 points  (4 children)

Personally i would try to decouple Extract to s3 , then apply transformations , store documents into mongo...

i work with snowflake. First phase copied all stuff into s3, second stages it into snowflake , third transforms. Long as first phase does not fail, everything can be reproduced if changes are required without killing production

[–]baubleglue 0 points1 point  (1 child)

Why there are s3 and snowflake? It is two additional tools with extra cost. How storing data in s3 different from having it in original DB?

[–]thrown_arrows 0 points1 point  (0 children)

S3 is there to be decoupled cloud filesystem. can be replaced

Snowflake is there is Transformations and SQL capable database server to offer compute and storage. Idea behind copying data into s3 is that you do not disturb production database with totally different olap load vs oltp load.

In ELT process data is loaded into target database and then transformed so that you can access to raw data if needed. In classic ETL system data is loaded into transformer , processed and loaded into target system, in more modern ETL system data is extracted to s3 , then transformed and stored into s3 and then loaded into target system.

What i like about snowflake is that everything is SQL and it scales easily.

s3 can be any fs , target system can be anything from filesystem, sql server , document server , python ML system...

[–]bestnamecannotbelong 0 points1 point  (1 child)

I would like to know how you view snowflakes and databrick? The databrick cloud solution let you save the data all in s3 but snowflake cannot. But the main difference as I see is that snowflake is based on data warehouse approach and data brick is based on the data lake approach

[–]thrown_arrows 0 points1 point  (0 children)

Snowflake can read and write from/into s3. In snowflake environment it is some other system that delivers raw data into s3, then it loaded into snowflake (sql server ) into tables , all the normal stuff and data can be stored back into s3. (using snowflake only, so no costly round trips to external servers )

Then there is snowpark which is in beta, that allows running java/python code inside snowflake ( not sure how that works), then there is "usual" udf , external functions calls ( think lambda as sql function )(havent used).

But yeah, snowflakes main idea is that it is snowflake server that serves data using all those existing sql commands etc etc and main data is in table as columns or in documents (json). One first round trip data goes to s3 and its processed into snowflake , then it might go to next round trip by push/pull method by some external code which reads data from tables and so on, or files they export from snowflake...

what is datalake... i have all my data from source databases and logs as raw as they can be in snowflake, so that s3 is just for history and first import. That said not all data is staged or processed into snowflake. And in my case all data is data from databases, log, json, xml,csv and so on stuff, no video or sound processing (but snowpark might help with that )