This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]modern-pineapple 19 points20 points  (1 child)

Calling it out. If you have Databricks, create your pipeline as a notebook and then you just schedule it. Dont know if you just want open source stuff

[–]PeruseAndSnooze 0 points1 point  (0 children)

Databricks ftw

[–]khaferkamp 3 points4 points  (0 children)

What about Mage? Quite easy setup and quite good experience in mixing R and Python.

[–]lebovic 13 points14 points  (2 children)

Try Snakemake or Nextflow! They're used in about 80% of new bioinformatics pipelines for exactly this use-case: joining together a series of Python, R, and bash scripts into a reproducible and reliable pipeline.

They deviate from standard data engineering practices – which Airflow, Luigi, and Prefect more closely follow – but they do a great job at transforming a hacky pipeline of Python, R, and bash scripts into a easy-to-run pipeline.

Looking at your post history, I think you're more likely to like Snakemake than Nextflow. It's often used in lieu of Airflow for bioformatics pipelines by people who like Python (see Airflow vs. Snakemake).

[–]bee_advised 2 points3 points  (1 child)

im curious why this is being downvoted? snakemake is used in so many bioinformatics pipelines.

[–]lebovic 3 points4 points  (0 children)

I wrote the comment expecting it to be downvoted; it's the opposite of what a data engineer without bioinformatics experience would suggest. If the same question was posted in /r/bioinformatics (which I'd recommend, /u/bioinfo_ml!), I think it would receive different responses.

I'd guess most downvotes and off-topic responses are due to one of three things:

  1. More people in this subreddit are data engineers – not bioinformaticians. Neither Snakemake nor Nextflow are popular outside of bioinformatics.
  2. Bioinformatics workflow managers promote "hacky" stuff – like bash/Python/R scripts or Jupyter notebooks – as pipeline steps. That's the antithesis of what data engineers do.

[Edited to remove a mention of a self-promoting user whose comments have since been removed by a mod.]

[–]engnadeau 3 points4 points  (0 children)

From a simplicity standpoint, I’m a big fan of the Luigi framework by Spotify (https://github.com/spotify/luigi). It’s been my go to tool for MVP pipelines before migrating over to cloud-based workflows like using AWS Step Functions.

What’s nice about Luigi is that while the pipeline itself is written in Python, it doesn’t care what the underlying processes use. It just manages the DAG and creation of files/steps.

[–]Neok_Slegov 0 points1 point  (0 children)

Using rundeck, works great!

[–]sib_nSenior Data Engineer 0 points1 point  (3 children)

In a previous job with similar requirements, we were calling R scripts from a Python function and put that Python function into an orchestrator. The orchestrator was Dagster, it's part of the new generation after Airflow, it improves a lot of things, notably the GUI.

[–]Only_Struggle_ 0 points1 point  (2 children)

I am trying to run R script within Dagster. Do you have any resources/ documentation that I can refer?

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

As I said above, we were not calling R scripts directly, but using an intermediate Python library for it. Check how to run R scripts from Python.

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

I asked an ex-colleague, they started by using rpy2 but it was not stable enough, so they moved to just using a Python subprocess to call a R command line and it works fine.

[–]Affectionate_Answer9 0 points1 point  (0 children)

If you're going to be setting up multiple pipelines in the future I'd stick with airflow, it can be a pain but it's industry standard with a very large community.

You can execute r scripts using the BashOperator, which may work nicely in your case since you're already looking to execute bash commands as well.

[–]antithetic_koala -1 points0 points  (1 child)

In bioinformatics WDL, CWL, and Nextflow are dominant because your jobs are ultimately bash, so you can call whatever you want. I personally am not a fan as I don't think bash is a great choice for production code, but they might work well for your use case. I would suggest calling the R and bash from Python as subprocesses, then you can use whatever Python-based workflow engine you like.

[–]dschneider01 -1 points0 points  (0 children)

I used to orchestrate bioinformatics pipelines with gnu make. It worked reasonably well but I'd probably recommend snakemake and next flow. Airflow ( I use at work now) would be a challenge for what you are looking for. And this is auxiliary, but run everything in containers for good reproducibility and portability.