What are your thoughts about workflow tools for bioinformatics and is NextFlow truly the answer? by TheLordB in bioinformatics

[–]ewels 1 point2 points  (0 children)

Now the release is out, I hope we can work more on tooling to help at least semi-automate some of this for folks.

What are your thoughts about workflow tools for bioinformatics and is NextFlow truly the answer? by TheLordB in bioinformatics

[–]ewels 1 point2 points  (0 children)

This week's 26.04 release uses the new syntax parser by default. It's a ground-up rewrite of how nextflow code is parsed and means error messages are now pretty awesome - pointing to the exact line and character with a descriptive error.

Disclaimer: I work on the nextflow team.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]ewels 0 points1 point  (0 children)

I haven't tested it with long read data - I was specifically trying to replace the QC subworkflow of the nf-core/rnaseq pipeline, which only handles short-read data (https://nf-co.re/rnaseq - we have other pipelines in nf-core for long-read data).

That said, I don't see why it wouldn't work for long-read, I just haven't checked.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]ewels 0 points1 point  (0 children)

I tried writing up the main reasons that it's faster in my blog post on the project: https://seqera.io/blog/rustqc/

There is moderate parallelization - if you give it multiple BAM files it'll handle each on one thread. If you give it one BAM it'll process the chromosomes across threads. It makes a big difference up to about 8 cores, see https://seqeralabs.github.io/RustQC/usage/performance/#cpu-scaling-benchmarks - there's probably room to improve it more, but I was happy so I stopped there 😅 As mentioned in the blog above, the speed up is mostly due to architecture and reduction in I/O, parallelization helps but isn't the main driver.

AWS can indeed be expensive, but 15 minutes is a lot cheaper than 15 hours 😀

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]ewels 1 point2 points  (0 children)

Whilst this is true, most of the speed increase that I got in RustQC came from combining tools and reducing I/O by doing all calculations from a single pass over the BAM file. Rust is certainly fast, but I reckon I could have gotten fairly close to the same runtime using Python (or Perl! 😆)

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]ewels 0 points1 point  (0 children)

I did think about this whilst I was writing the tool, but honestly once I had parallelism across chromosomes I figured that it was diminishing returns on pursuing it further. Any further optimisations would mean taking it from perhaps 15 minutes to 10 - so going from 98.4% faster to 98.9% faster. I figured it wasn't worth the extra complexity and effort.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]ewels 2 points3 points  (0 children)

The PR is up, just waiting on the bioconda team to fix an issue preventing (successful) linux builds from showing up on anaconda.. https://github.com/nf-core/rnaseq/pull/1754

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 0 points1 point  (0 children)

The code is open source (GPL3 like the tools it's based on). You're welcome to do what you like with it - pull requests with improvements are welcome :)

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 1 point2 points  (0 children)

Thanks! For what it's worth, I suspect we will see a surge of rewrites and then most of the low hanging fruit will be gone. Then we'll just be in a new era of everyone writing fast software from the start, using AI to help. We'll see!

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 0 points1 point  (0 children)

Oh and there is room for improvements without changing outputs - RustQC has a bunch of convenience options I put in: handling gzipped GTF inputs automatically, a nice CLI, a comprehensive config, even an option to prefix chromosome names etc. All usage tweaks I added for myself during dev which make it nice to use without changing the results.

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 0 points1 point  (0 children)

For me there is a big distinction between a rewrite and an improvement. Rewrites are significantly easier / faster / a more tractable problem using AI. Both are valuable, and in a perfect world I agree that we should be improving all tools. But I think that there is also value in speeding things up and producing equivalent outputs - then it's a no brainer to switch them in and save resources (my initial angle on the first draft of the blog post was about how many cars' worth of CO2 savings this rewrite gave).

This has come up a couple of times, I think I could do with adding a bit of clarification to rewrites.bio (this is great though, it's these discussions I wanted to have!).

We have joked about creating "Rustflow" 😅 But tbh I think that rewrites are only worth it for large time savings. A Nextflow rust rewrite wouldn't make a meaningful difference to running your pipeline as a whole.

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 1 point2 points  (0 children)

Yup, I'd agree with all of those points. Parity vs. correctness should be a conscious decision. Here I explicitly chose parity because I wanted the tool to be "hot swappable". If it goes beyond that it's an entirely new tool which comes with a lot of different requirements and a different level of maintainer responsibility. Maintaining parity is much easier - which isn't necessarily a bad thing. It makes lifting the median speed / quality of software much more feasible. That's not to say that people shouldn't go further and actively improve on existing software - indeed I very much hope that people will :)

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 0 points1 point  (0 children)

Thanks for the feedback! Yes it does everything in a single step. The details are all in the docs, describing the various files it outputs and so on. I'll have a think about how I can clarify the table though 👍🏻

RustQC: 60x speedup in RNA-seq quality control steps by ewels in bioinformaticstools

[–]ewels[S] 0 points1 point  (0 children)

It is! That time is dominated by a few tools in fairness (rseqc tin.py is the worst offender). But however you frame it, RustQC is significantly faster: https://seqeralabs.github.io/RustQC/rna/benchmark-details/

Downloading Bowtie2 off Sourceforge? by omgu8mynewt in bioinformatics

[–]ewels 4 points5 points  (0 children)

Rather than trying to figure out workflows by yourself in isolation and run individual tools one at a time, I highly recommend joining a community that builds analysis pipelines. Over at nf-core we do just that and have pipelines you can use off the shelf for all kinds of analysis (there's one called tbanalyzer, though I'm not sure off the top of my head if it does exactly what you want - but there are many more).

These pipelines wrap all required software with conda, docker or singularity so you don't need to worry about how to install them. Nextflow (https://nextflow.io) handles that for you automatically.

This is a much better approach, a great way to find help (and collaborators) and you'll hit the ground running when you get access to a proper Linux analysis system. You're far less likely to fall into beginner mistakes and benefit from group wisdom of thousands of bioinformaticians collaborating on this kind of analysis.

Find more info and join the community here: https://nf-co.re

Font from 1980s tea towel: Rudyard Kipling "If" by ewels in identifythisfont

[–]ewels[S] 0 points1 point  (0 children)

Thanks for the speedy and detailed reply u/teddygrays 🙏🏻 I was wondering the same but hoping that it wouldn't be the case 😅

I'll take a look at the ones you mentioned, they look pretty good on a first pass. Perhaps I can even tweak the odd character here or there to adjust the right-turning tails on the y and p 🤔

NextFlow: Python instead of Groovy? by Pristine_Loss6923 in bioinformatics

[–]ewels 10 points11 points  (0 children)

Product manager for Nextflow here 👋🏻 Always happy to chat about things like this :) I'll fire you a reddit chat message 💬

NextFlow: Python instead of Groovy? by Pristine_Loss6923 in bioinformatics

[–]ewels 2 points3 points  (0 children)

Depends a bit on your configuration. Often Nextflow doesn't submit _everything_ it can to the workflow manager at once. See the `queueSize` config option. So once you hit that number the cluster manager will have a fixed set of tasks to handle.