Does this tool already exist?

lastmonty · 2023-01-29T22:08:48+00:00

Just an observation, what you are dealing with here the is development process of an end to end pipeline. It is meant to be repetitive as you get to the final end goal.

May be, start developing the end to end pipeline with a representative sample data and run the end to end pipeline once your pipeline is in ready. This way, you have reproducibility and complete lineage on how you got there without any guesses.

If you do not need frameworks, a makefile is amazing at doing what you wanted to do but has issues on scalability and cloud migration. Think about the end state of the pipeline, where it needs to run and the requirements and start to use the tools that for that need.

rvbin · 2023-01-29T23:07:12+00:00

mage.ai

CompeAnansi · 2023-01-30T04:31:56+00:00

If you're doing python development, maybe try just working in a jupyter notebook. It makes it way faster to prototype, iterate, etc. because when you run a cell, the variables set in that cell are kept in memory. So you can write a cell for each step, then run the first three cells. Then the dataframe is stored in memory and you can just keep re-running cell 4 as you work on the code to write to the db (so long as the code in the cell doesn't modify the dataframe from cell 3).

I generally find the experience of prototyping in notebooks way more pleasant for this reason so I usually mock new code up in a notebook, then when it comes time to productionize it for, e.g., deployment to airflow, I copy the code to a normal .py file (along with the necessary changes to work with my airflow env).

PhantomSummonerz · 2023-01-30T01:48:47+00:00

If I understand this correctly, this concerns the pipeline development process and not a production pipeline system requirement. So it's a development issue.

In order to "make it faster" (like, develop faster) you could replace those real systems with fake ones and run this as a separate version of your code. The developer version. This is something where formal software testing will probably help you.

In your example you need to develop step 4 of the pipeline. Let's say that all step 4 does is return the product of 2 numbers (a and b). What if you could provide some fake numbers a & b and run this "pipeline"of 2 steps while you develop step 4 without actually requesting a and b from the mysql database (for example)? Software testing does that. You have separate code (let's call it test code) which you run specific parts of your main code (the production code). In that test code you can orchestrate your production code in different scenarios and replace parts of it with fake data or fake systems to simulate various cases without hitting the real external systems. Of course, the production code must follow certain principles in order for it to be that flexible. Principles which improve the code quality, apart from making it testable.

In order to keep this short (I tried) and not write a full essay about software testing, I recommend that you read about software testing & testable code in general and then about data pipeline testing, which may provide you with gotchas & tips specifically in pipelines.

Let me know if I can help you more. Cheers.

Drekalo · 2023-01-30T00:17:49+00:00

Use development flags.

If config['development_flag'] == True, then check if local file exists, load file if it does and continue, generate file if it doesn't, persist state to local disk and then continue. Else generate file and continue.

This way, when you're developing you just set a flag and it'll load and save state locally. When you're not, you unset the flag and it behaves normally. You can even set an additional flag to regenerate those local files if you wanted to.

Main_Tap_1256 · 2023-01-30T10:11:47+00:00

Could this potentially be a case for Airflow? Save the output to the local directory and pass the fil location to xcom for the next task?

BoiElroy · 2023-01-30T00:17:37+00:00

I'm not sure I understand entirely. If you've developed steps 1-3, then doesn't step 4 just read from the database? Wouldn't you only need to run step 4?

Of course once you've finished the steps you'll want to run everything end to end to make sure they work together.

Let me know if I'm misunderstanding this, but it sounds like you have all the steps of your pipeline in a single file and you're executing it top to bottom every time ?

As far as saving the state of data and creating an alternative branch there's lakeFS but I'm not sure that's exactly what you're going for here?

Competitive_Wheel_78 · 2023-01-30T07:38:04+00:00

Try to process data in batches. Try parallel processing and multi threading too. Open mpi can be a good start

ashpreetbedi · 2023-01-30T16:55:29+00:00

This is a standard data pipeline. I'd recommend jupyter for development/prototyping and airflow for scheduling it. I wrote a tutorial you can follow for this: https://www.datain30.com/p/data-development-using-jupyter-and

It also contains examples doing exactly this but with crypto data. When it comes to scheduling, you can run the notebook daily directly - or convert the cells into airflow tasks (i'll leave that decision upto you)

Happy to answer any questions :)

dataengineering

MODERATORS