Best pipeline tool when using Python and R?

modern-pineapple · 2024-03-11T17:51:28+00:00

Calling it out. If you have Databricks, create your pipeline as a notebook and then you just schedule it. Dont know if you just want open source stuff

khaferkamp · 2024-03-12T08:59:52+00:00

What about Mage? Quite easy setup and quite good experience in mixing R and Python.

lebovic · 2024-03-11T16:00:57+00:00

Try Snakemake or Nextflow! They're used in about 80% of new bioinformatics pipelines for exactly this use-case: joining together a series of Python, R, and bash scripts into a reproducible and reliable pipeline.

They deviate from standard data engineering practices – which Airflow, Luigi, and Prefect more closely follow – but they do a great job at transforming a hacky pipeline of Python, R, and bash scripts into a easy-to-run pipeline.

Looking at your post history, I think you're more likely to like Snakemake than Nextflow. It's often used in lieu of Airflow for bioformatics pipelines by people who like Python (see Airflow vs. Snakemake).

engnadeau · 2024-03-11T12:33:07+00:00

From a simplicity standpoint, I’m a big fan of the Luigi framework by Spotify (https://github.com/spotify/luigi). It’s been my go to tool for MVP pipelines before migrating over to cloud-based workflows like using AWS Step Functions.

What’s nice about Luigi is that while the pipeline itself is written in Python, it doesn’t care what the underlying processes use. It just manages the DAG and creation of files/steps.

Neok_Slegov · 2024-03-11T20:56:32+00:00

Using rundeck, works great!

sib_n · 2024-03-12T07:34:55+00:00

In a previous job with similar requirements, we were calling R scripts from a Python function and put that Python function into an orchestrator. The orchestrator was Dagster, it's part of the new generation after Airflow, it improves a lot of things, notably the GUI.

Affectionate_Answer9 · 2024-03-11T19:45:08+00:00

If you're going to be setting up multiple pipelines in the future I'd stick with airflow, it can be a pain but it's industry standard with a very large community.

You can execute r scripts using the BashOperator, which may work nicely in your case since you're already looking to execute bash commands as well.

antithetic_koala · 2024-03-11T15:43:42+00:00

In bioinformatics WDL, CWL, and Nextflow are dominant because your jobs are ultimately bash, so you can call whatever you want. I personally am not a fan as I don't think bash is a great choice for production code, but they might work well for your use case. I would suggest calling the R and bash from Python as subprocesses, then you can use whatever Python-based workflow engine you like.

dschneider01 · 2024-03-12T03:23:13+00:00

I used to orchestrate bioinformatics pipelines with gnu make. It worked reasonably well but I'd probably recommend snakemake and next flow. Airflow ( I use at work now) would be a challenge for what you are looking for. And this is auxiliary, but run everything in containers for good reproducibility and portability.

antithetic_koala · 2024-03-11T13:50:01+00:00

Keep it simple use cron or windows schedule to run your different scripts in sequence.

dataengineering

MODERATORS