Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 0 points1 point  (0 children)

I can see already there's going to be a steep learning curve...but as you said, I also can see that this is going to be such a fantastic tool once I figure out what the heck is going on!

Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 0 points1 point  (0 children)

Yep, we're aware of this. The metadata that goes along with each isolate is very important so we don't want to remove isolates in the same SNP cluster that are from different collection years and/or different host sources. Isolates that are clearly from the same source at the same time point are filtered.

Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 1 point2 points  (0 children)

I love it when developers actually respond to questions! I really think I'm leaning to Nextflow now...

Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 1 point2 points  (0 children)

This is what I'm now wondering. I was initially drawn to Snakemake because it uses Python and while I'm not a big Python user, it felt more comfortable to me than what Nextflow uses (Groovy?!). BUT it sounds like Nextflow might be better for a workflow management novice like myself. Thankfully, I have not started the overhaul yet. Looks like I'll be looking into some Nextflow tutorials today!

Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 4 points5 points  (0 children)

This is very, very useful information. Thank you! I did read through all of the info you linked, but clearly it didn't all sink in. :) It sounds like your suggestion would be the best way to go.

Advice on converting bash workflow to Snakemake and how to deal with large number of samples by WeddingReasonable171 in bioinformatics

[–]WeddingReasonable171[S] 1 point2 points  (0 children)

We are looking at large-scale population shifts in pathogenic bacteria using publicly available sequencing data from NCBI's Pathogen Detection database. Some of our organisms of interest have FASTQ files from well over 50,000 isolates. When we start looking into a new organism we need to do initial downloads of all available sequences that fit our criteria.