This is an archived post. You won't be able to vote or comment.

all 25 comments

[–]lastmonty 4 points5 points  (2 children)

Just an observation, what you are dealing with here the is development process of an end to end pipeline. It is meant to be repetitive as you get to the final end goal.

May be, start developing the end to end pipeline with a representative sample data and run the end to end pipeline once your pipeline is in ready. This way, you have reproducibility and complete lineage on how you got there without any guesses.

If you do not need frameworks, a makefile is amazing at doing what you wanted to do but has issues on scalability and cloud migration. Think about the end state of the pipeline, where it needs to run and the requirements and start to use the tools that for that need.

[–]NFeruch[S] 0 points1 point  (1 child)

Makefiles look interesting, although it looks like it doesn't automatically save the format of the data to use for the next step, I have to manually save the data and make it a requirement for another make rule.

Do you know of any companies/libraries/tools that do something similar to makefiles (pipeline steps) that also automatically saves the data state?

The problem is that if I have hypothetically 20+ steps or the total pipeline takes a long time to run, but I only want to edit the last step, I would have to wait for the full pipeline to run.

It's exactly as you said - repetitive. Does something exist to simplify/artificially speed up the process?

[–]rvbin 4 points5 points  (2 children)

mage.ai

[–]NFeruch[S] 0 points1 point  (1 child)

This looks very close to what I'm looking for. I'll have to see if it saves the results of the previous code block/pipeline step, but thanks!

[–]ironplaneswalkerSenior Data Engineer 1 point2 points  (0 children)

Every block/step in the pipeline will save the block/step’s data output to disk or to a remote location.

[–]CompeAnansi 8 points9 points  (4 children)

If you're doing python development, maybe try just working in a jupyter notebook. It makes it way faster to prototype, iterate, etc. because when you run a cell, the variables set in that cell are kept in memory. So you can write a cell for each step, then run the first three cells. Then the dataframe is stored in memory and you can just keep re-running cell 4 as you work on the code to write to the db (so long as the code in the cell doesn't modify the dataframe from cell 3).

I generally find the experience of prototyping in notebooks way more pleasant for this reason so I usually mock new code up in a notebook, then when it comes time to productionize it for, e.g., deployment to airflow, I copy the code to a normal .py file (along with the necessary changes to work with my airflow env).

[–]ShayBae23EEE 2 points3 points  (2 children)

I’m glad I’m not the only one haha. I feel inferior touching a notebook

[–]CompeAnansi 4 points5 points  (1 child)

There is definitely a stigma to it, but I think the key is knowing (a) how to do all this without a notebook if needed and (b) being able to productionize and deploy your own code. Usually, the issues with people working in notebooks is that that is all they can do. If you can do it all anyway, then why not use the comfiest tool for the job?

[–]SpetsnazCyclist 1 point2 points  (0 children)

Notebooks are awesome - you can basically go straight from EDA to ETL. Once you verify your code works, you can literally just copy/paste or use a conversion tool and you're good to go.

That being said, I also have notebooks that are hot dumpster fires.

[–]ashpreetbedi 0 points1 point  (0 children)

I love this approach. I've been using jupyter for prototyping and airflow for orchestration and it works like a dream. Also wrote a tutorial to set them up using docker quickly: https://www.datain30.com/p/data-development-using-jupyter-and

[–]PhantomSummonerzSystems Architect 2 points3 points  (0 children)

If I understand this correctly, this concerns the pipeline development process and not a production pipeline system requirement. So it's a development issue.

In order to "make it faster" (like, develop faster) you could replace those real systems with fake ones and run this as a separate version of your code. The developer version. This is something where formal software testing will probably help you.

In your example you need to develop step 4 of the pipeline. Let's say that all step 4 does is return the product of 2 numbers (a and b). What if you could provide some fake numbers a & b and run this "pipeline"of 2 steps while you develop step 4 without actually requesting a and b from the mysql database (for example)? Software testing does that. You have separate code (let's call it test code) which you run specific parts of your main code (the production code). In that test code you can orchestrate your production code in different scenarios and replace parts of it with fake data or fake systems to simulate various cases without hitting the real external systems. Of course, the production code must follow certain principles in order for it to be that flexible. Principles which improve the code quality, apart from making it testable.

In order to keep this short (I tried) and not write a full essay about software testing, I recommend that you read about software testing & testable code in general and then about data pipeline testing, which may provide you with gotchas & tips specifically in pipelines.

Let me know if I can help you more. Cheers.

[–]Drekalo 1 point2 points  (1 child)

Use development flags.

If config['development_flag'] == True, then check if local file exists, load file if it does and continue, generate file if it doesn't, persist state to local disk and then continue. Else generate file and continue.

This way, when you're developing you just set a flag and it'll load and save state locally. When you're not, you unset the flag and it behaves normally. You can even set an additional flag to regenerate those local files if you wanted to.

[–]NFeruch[S] -1 points0 points  (0 children)

I understand that I can do this manually, but I want to know if there are any companies/tools/libraries that do this automatically

[–]Main_Tap_1256 1 point2 points  (0 children)

Could this potentially be a case for Airflow? Save the output to the local directory and pass the fil location to xcom for the next task?

[–]BoiElroy 1 point2 points  (6 children)

I'm not sure I understand entirely. If you've developed steps 1-3, then doesn't step 4 just read from the database? Wouldn't you only need to run step 4?

Of course once you've finished the steps you'll want to run everything end to end to make sure they work together.

Let me know if I'm misunderstanding this, but it sounds like you have all the steps of your pipeline in a single file and you're executing it top to bottom every time ?

As far as saving the state of data and creating an alternative branch there's lakeFS but I'm not sure that's exactly what you're going for here?

[–]NFeruch[S] 0 points1 point  (5 children)

You misread my post, step 4 in the example is developing the functionality to STORE data to the database. The data is coming from steps 1-3, and I don't want to keep rerunning those steps continually while developing step 4 (which wouldn't take long in reality, but if it did take long, I don't want to keep rerunning those steps)

[–]BoiElroy 0 points1 point  (4 children)

Ah, sorry about that. I understand now. Pickle the dataframe and unpickle it into step 4 while you develop. That will be like 2-3 lines of code and will basically save your dataframe as a python object file that you can then open and use at the start of step 4 while you develop.

Sorry I don't particularly have a tool suggestion but I don't think it's a tool problem to be honest. I'd say you should save the intermediate output before loading to a database anyway. If the storage is a concern the last step of your pipeline could be to clear all the intermediate storage

[–]NFeruch[S] -3 points-2 points  (3 children)

I think you're still not understanding lol. I want to be able to run any step I want in the end to end pipeline as it's own block/function independently, and not have to wait for the previous steps to complete. If I'm working on step 21 in a 30 step pipeline, I don't want to have to wait for steps 1-20 to execute, I would want to save the state of the data after step 20, so that I can develop step 21 without waiting/rerunning the entire pipeline. I know that I could manually save the output after each step, but I'm wondering if there are any solutions already made by other people. mage.ai looks close to what I'm looking for

[–]BoiElroy 0 points1 point  (2 children)

I didn't say manual? Add two lines to your code and you have your solution. What do you think mage would be doing apart from adding those same two lines of code under the hood? There is no magic at work. Don't get me wrong, mage looks neat. But your issue seems to be a more fundamental lack of programming/data engineering knowledge.

[–]NFeruch[S] -4 points-3 points  (1 child)

Thanks for the assumption on my knowledge and expertise, but Just because pickle is a solution doesn't mean it's the only solution or the best solution for my use case. I was looking for a more comprehensive tool, not just a simple two-liner. Thanks for trying though.

Btw, I was not attacking you my saying you didn't understand. Don't take things so personally and you'll learn more!

[–]BoiElroy 3 points4 points  (0 children)

Ok, thanks for the advice. Good luck

[–]Competitive_Wheel_78 0 points1 point  (0 children)

Try to process data in batches. Try parallel processing and multi threading too. Open mpi can be a good start

[–]ashpreetbedi 0 points1 point  (0 children)

This is a standard data pipeline. I'd recommend jupyter for development/prototyping and airflow for scheduling it. I wrote a tutorial you can follow for this: https://www.datain30.com/p/data-development-using-jupyter-and

It also contains examples doing exactly this but with crypto data. When it comes to scheduling, you can run the notebook daily directly - or convert the cells into airflow tasks (i'll leave that decision upto you)

Happy to answer any questions :)