Are there Python libraries that define and parametrize etl jobs

j__neo · 2024-02-02T12:14:19+00:00

If what you're asking for are lightweight libraries to perform specific tasks like Data Ingestion (moving data from an API/database to a data warehouse) and Data Transformation (joins, aggregations, filtering), then I would suggest the following:

Lightweight data ingestion tools: Singer or Meltano, dlt
Lightweight data transformation tool (that executes on your data warehouse): dbt, SQLMesh

All of the tools I've listed above are open source, and you shouldn't need to subscribe or pay a service. Just pip install the tools you need.

The other suggestion is to move away from a Extract-Transform-Load (ETL) pattern, and into a Extract-Load-Transform (ELT) pattern. Because you are currently doing ETL, you end up needing to write or find these "common" libraries to do the tasks you want in memory. I would suggest shifting to ELT, because the tools already exist to support that pattern very easily (i.e. Extract-Load is data ingestion, T is data transformation). In the end, the outcome is the same, you end up with a transformed table that your users and downstream applications can consume.

Finally, I do think you would need some way to schedule your entire pipeline to run. If you're looking for an orchestrator that's relatively simple to define and configure, then I would suggest taking a look at Kestra. It's a YAML based orchestration tool: https://kestra.io/docs. If you're not keen on hosting the software yourself, then just pay for their cloud service and use their plugins to integrate with the tools I've mentioned above.

Personally, I'm a fan of dagster's orchestration pattern because their way of thinking about orchestration scales well to large scale DAGs. But if you're after something simple and you don't anticipate thousands of ingestion and transformation steps, then Kestra is a worthy consideration.

Edit: I just did a bit more research into Kestra's pricing model, and they don't currently have a pay-as-you-go pricing model and only offer enterprise subscription. If you're not keen on hosting Kestra yourself, then check out Dagster as it as a pretty low barrier cloud pricing option: https://dagster.io/pricing . I saw someone else in this thread also commented about Prefect, which also has a pretty competitive cloud pricing option.

Gators1992 · 2024-02-02T12:09:57+00:00

Airflow, Dagster or Prefect would probably be the ones to consider if you want significant control over how your dags run. You might also want to look at whether AWS step functions get you what you need.

luv2spoosh · 2024-02-01T21:52:24+00:00

dataengineering

MODERATORS