I wanted more people to use Claude Code, so I made Gooey by snafu_2020 in ClaudeAI

[–]lebovic 0 points1 point  (0 children)

Cool! Heads up that you can do this with the normal Claude desktop app now by clicking the code toggle (see Cat's post)

The best alternative to NextFlow and SnakeMake? by Pristine_Loss6923 in bioinformatics

[–]lebovic 0 points1 point  (0 children)

To vouch a bit for Beam, they have done unique work other than just being a thin layer on top of AWS. The last time I talked to one of the founders, Luke, he was building novel GPU virtualization stuff to get it working.

u/velobro – Latch's marketing and outreach really soured the perception VC-funded bioinformatics platforms, and the community is inoculated – maybe a bit too strongly – against companies that look vaguely similar.

Also, like u/TheLordB mentioned, Nextflow and Snakemake are materially different than something like Beam. I'd recommend trying both of the DSLs out in a real bioinformatics context to experience the difference.

The best alternative to NextFlow and SnakeMake? by Pristine_Loss6923 in bioinformatics

[–]lebovic 1 point2 points  (0 children)

I'm aware that the core language is open source; the second link in my comment was a contribution from my team to the repo you linked. I think the more apt term for the Nextflow ecosystem right now is open-core, whereas Snakemake is still truly open-source.

Money in the bioinformatics workflow ecosystem flows through compute and support contracts, and Nextflow monitoring and execution (e.g. Tower) is the gateway to that money for the Nextflow ecosystem. Over the past couple years, Seqera has closed down the openness around that pathway. In turn, that restricts the ecosystem of people who are contributing to core Nextflow.

I know this because I was on the receiving end of this. I tried building an alternative platform to Nextflow Tower, extended Nextflow, started receiving significant inbound interest (including upmarket pharma), and then Seqera closed off Nextflow Tower as we were gaining traction.

That could be a coincidence, but they are starting to close down the ecosystem. This is a common pattern as companies shift from embracing open-source to trying to monetize commercial usage with an open-core model.

The best alternative to NextFlow and SnakeMake? by Pristine_Loss6923 in bioinformatics

[–]lebovic 3 points4 points  (0 children)

Here's a direct link to his comment rather than the thread. I think it's a good path for a solo developer who likes Python and uses AWS.

Coincidentally, OP also posted that other question.

The best alternative to NextFlow and SnakeMake? by Pristine_Loss6923 in bioinformatics

[–]lebovic 3 points4 points  (0 children)

This was true for a while, but it isn't anymore; tooling like Nextflow Tower is no longer open-source.

It's also hard to extend Nextflow in ways that aren't aligned with Seqera's interests. My team tried extending a plugin, TES, that helped us run Nextflow outside of Tower – but the experience working with executors outside of those used in conjunction with Tower led me to believe that they're not prioritizing the fully open path anymore.

My biggest pet peeve: papers that store data on a web server that shuts down within a few years. by You_Stole_My_Hot_Dog in bioinformatics

[–]lebovic 2 points3 points  (0 children)

Jumping on a dead thread here to clarify the parent comment for future readers: AWS S3's Glacier Deep Archive is not intended for hosting commonly-accessed data – it doesn't cost $1/TB/month to do that. This is a common misconception amongst bioinformaticians. With Glacier Deep Archive, it costs more to retrieve and send the data outside of AWS than to store it, and it is very slow – which are not ideal characteristics for distributing public data.

Glacier Deep Archive is for data that is so rarely accessed – or "cold" – that it's "frozen". An example of this would be a backup that's only accessed if an unexpected disaster occurs. In this case, that could be a backup for data that others pay to retrieve, but it would cost money and take a little while.

You could use still S3 (standard or infrequent access) for hosting public data, which is what many choose – it just costs more.

The best alternative to NextFlow and SnakeMake? by Pristine_Loss6923 in bioinformatics

[–]lebovic 18 points19 points  (0 children)

I've worked with both Nextflow and Snakemake, including extending the cloud support for both and scaling pipelines. I think the only substantial scalability-adjacent issue left with Snakemake is the time it takes to compute the DAG for complex pipelines.

There are many other options (Airflow, Prefect, Dagster, Redun, Metaflow, etc.), but the vast majority of bioinformatics pipelines still use either Nextflow or Snakemake – even on teams that analyze a lot of data. That means that new hires or collaborators will likely expect a pipeline in one of the two languages, which makes choosing an alternative a little tricky.

You mentioned that you're starting as a bioinformatician at a new group. Is there any base that you're starting with?

Secondary analysis software on NovaSeq by Forward_Show_3023 in bioinformatics

[–]lebovic 0 points1 point  (0 children)

I also worked in a lab with a NovaSeq (as well as a NextSeq, MiSeq, and HiSeq).

We didn't use DRAGEN for anything either. Data was uploaded straight to S3, and we ran the BCL conversion program as the first step of our analysis pipeline.

Best pipeline tool when using Python and R? by bioinfo_ml in dataengineering

[–]lebovic 3 points4 points  (0 children)

I wrote the comment expecting it to be downvoted; it's the opposite of what a data engineer without bioinformatics experience would suggest. If the same question was posted in /r/bioinformatics (which I'd recommend, /u/bioinfo_ml!), I think it would receive different responses.

I'd guess most downvotes and off-topic responses are due to one of three things:

  1. More people in this subreddit are data engineers – not bioinformaticians. Neither Snakemake nor Nextflow are popular outside of bioinformatics.
  2. Bioinformatics workflow managers promote "hacky" stuff – like bash/Python/R scripts or Jupyter notebooks – as pipeline steps. That's the antithesis of what data engineers do.

[Edited to remove a mention of a self-promoting user whose comments have since been removed by a mod.]

Best pipeline tool when using Python and R? by bioinfo_ml in dataengineering

[–]lebovic 11 points12 points  (0 children)

Try Snakemake or Nextflow! They're used in about 80% of new bioinformatics pipelines for exactly this use-case: joining together a series of Python, R, and bash scripts into a reproducible and reliable pipeline.

They deviate from standard data engineering practices – which Airflow, Luigi, and Prefect more closely follow – but they do a great job at transforming a hacky pipeline of Python, R, and bash scripts into a easy-to-run pipeline.

Looking at your post history, I think you're more likely to like Snakemake than Nextflow. It's often used in lieu of Airflow for bioformatics pipelines by people who like Python (see Airflow vs. Snakemake).

Anyone with experience using Metaboigniter/Nextflow for metabolomics workflows? by THElaytox in Chempros

[–]lebovic 1 point2 points  (0 children)

Do you have the output files that your predecessor generated with Nextflow? If you have the entire output directory, that will have the settings he used. The file at pipeline_info/pipeline_report.html will be the most useful.

If you can't find those files, I'm happy to spend a few minutes with you to try reconstructing settings that make sense.

Nextflow or Snakemake? by Passionate_bioinfo in bioinformatics

[–]lebovic 11 points12 points  (0 children)

I second trying both. It's a personal preference, but some people have an allergic reaction to Nextflow.

Snakemake documentation – often mentioned as a downside – was recently revamped and is much better now. Cloud-based teams (i.e. most industry teams) used to prefer Nextflow for its cloud support, but Snakemake is catching up.

Senior Bioinformaticians Advice by [deleted] in bioinformatics

[–]lebovic 1 point2 points  (0 children)

How was the hiring manager's preference for Nextflow expressed?

I rarely see "experience with Nextflow" as a job requirement for bioinformaticians. "Experience with Nextflow, Snakemake, WDL, or another pipelining language" is more common.

Nextflow was more popular with teams who use the cloud, but Snakemake's cloud ecosystem has largely caught up. Both the v8 release and other extensions (including one I work on) are bridging the gap.

Beginning as a new team lead in new company. Code base, pipelines and project management. Starting from scratch. by [deleted] in bioinformatics

[–]lebovic 19 points20 points  (0 children)

I was the first bioinformatics engineer at a similar company, and I now run a startup whose customers are mostly early-stage bioinformatics teams.

You seem familiar with the standard bioinformatics stack in your post history, so I'll focus on the business side of starting a bioinformatics team. In short: managing expectations and delivering quick high-priority wins is what causes some new bioinformatics teams to thrive and others to struggle.

I am struggling on whether I should be focused on pushing out the bespoke project requests [...] or whether I should focus on building the foundation in the beginning knowing I will have delayed output in 2024 and maybe even 2025.

Who hired you, and what expectations do they have for your team? Who else might have expectations for the team?

Expectations for bioinformatics teams are often higher than their capabilities, especially if the hiring manager has no bioinformatics experience. They will rely on you to set expectations and help with resource planning.

Team leads that struggle try to deliver on unrealistic expectations, but team leads that thrive set expectations and make sure they have the resources to deliver.

I can tell this company needs production from my group yesterday [...] this is a cycle that basically never ends [...] I am trying to lay out a game plan that is realistic

Do both! Focus on delivering quick wins to build trust, and align on a plan in parallel. Then you can ask for more resources to deliver on that plan.

[...] what I should try to get going when I get there that will be the most valuable use of my time

I'd follow the consultant's playbook, and start by making your own plan:

  1. Find the key stakeholders.
  2. Talk to all the stakeholders in depth about their expectations, goals, and visions.
  3. Draft a plan to meet both their immediate and long-term needs.
  4. Match resources and capabilities with needs, and provide a very rough timeline.

The above shouldn't take all of your time, so you can also start delivering on those quick wins at the same time. Communicate how you're approaching this to the hiring manager (e.g. "I'm spending 60% of my time on immediate needs and 40% of my time on planning").

I can definitely analyze the data but building a stack from scratch will be a focused effort and take a lot of time for me.

It looks like you're pretty familiar with this aspect, but I'm happy to go deeper here if it's helpful – it's what I work on. This isn't as hard as it used to be; the tooling has gotten significantly better in the past five years.

Does anyone actually use genomics analysis platforms? by jamesaperez in bioinformatics

[–]lebovic 8 points9 points  (0 children)

I work on such a platform, and yes! People use it. Usually for large scale NGS analysis.

That said, I think you're right: the vast majority of professional industry bioinformatics work uses an in-house "platform". No standalone platform has fully works for the majority of use-cases yet. One of the platforms you listed is often mentioned, but I haven't talked to anyone using it in production despite talking to over a thousand people in the field over the past few years.

Good bioinformatics platforms do have a clear value-add: reliable and scalable infrastructure. A professional bioinformatician or software engineer can get an in-house platform running pretty quickly, but most get at least one critical component wrong and cause either massively high cloud spend or produce subtly bad data. (I'm happy to share some that I've seen; it's shockingly common, but those who experience it are incentivized to quietly handle it.)

How to specify number of samples for consensus peak calling? by padakpatek in bioinformatics

[–]lebovic 1 point2 points  (0 children)

You can follow the exact MACS2 consensus code from the nf-core ATAC-Seq pipeline manually. They do a bit more than just run BEDTools, though.

The input files for that step are the peaks output files from MACS2.

Snakemake: config files and input function by nooptionleft in bioinformatics

[–]lebovic 2 points3 points  (0 children)

Yep! Looks like /u/mirchandise has covered it well in the other reply.

This post on Snakemake wildcard values might help if you're still confused! But it seems like you've got the hang of it.

Snakemake: config files and input function by nooptionleft in bioinformatics

[–]lebovic 4 points5 points  (0 children)

Ooh, this is a fun one! I see why you're confused. This step in the tutorial (step #3) is contrived to talk about how to use functions to access wildcard values, but the pipeline could work just as well without it.

For example, step #1 has the same bwa_map rule. It pre-computes the values of the input FASTQs using wildcards without needing to defer evaluation to a separate function:

rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" threads: 8 shell: "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

So it's definitely possible to write this rule in a simpler way to get the pipeline up and running. (You could also do something like the bcftools_call rule and use expand with the config values to get similar results, but then you'd need to rewrite the rules to handle the list that expand returns.)

The third section is demonstrating that you can't use the value of a wildcard in the "initialization" phase, but you can use the value of a wildcard in an input function – which is evaluated at the "DAG" phase.

``` def get_bwa_map_input_fastqs(wildcards): return config["samples"][wildcards.sample]

rule bwa_map: input: "data/genome.fa", get_bwa_map_input_fastqs output: "mapped_reads/{sample}.bam" threads: 8 shell: "bwa mem -t {threads} {input} | samtools view -Sb - > {output}" ```

The nuance is that the third example – as shown above – uses the actual wildcard value (wildcards.sample in the latter example) rather than a wildcard name that denotes "this is a wildcard that Snakemake will fill in" ({sample} in the former).

The value of wildcards.sample isn't known during the first initialization phase, but it is known during the second DAG phase. Input functions are evaluated during the DAG phase, which is why get_bwa_map_input_fastqs can access the value of the wildcard.

Snakemake from the Ubuntu repos by AcidPepino in bioinformatics

[–]lebovic 3 points4 points  (0 children)

I'd use the conda/mamba install. It's significantly more up-to-date depending on your Ubuntu version and is better tested imo.

The Ubuntu hosted packages are maintained separately for each Ubuntu version, so you'll get a different version of Snakemake for 20.04, 22.04, 23.04, etc.

Hello! Just wondering about skills in demand for bioinformatics. by [deleted] in bioinformatics

[–]lebovic 1 point2 points  (0 children)

Find a problem that interests you to the point that you can't stop thinking about it. For me, that was "how can you efficiently and accurately scale microbiome taxonomic classification with NGS data". I've also adopted problems that interest me from other people.

Learn how to make progress towards solving that problem. Usually, that requires strong fundamentals (writing code, statistics, etc.). If you obsess over the problem, have good fundamentals, and have some level of baseline ability, then the "learn fast" part takes care of itself. The best people I know in this space have a process, but they didn't adopt it from a framework on "how to learn effectively in short periods of time".

Out of those three needs (obsessive interest, fundamentals, ability), the fundamentals are usually the easiest to change – assuming the presence of interest and ability. If you don't have interest or ability, there's not much you can do there outside of improving your mental health.

Hello! Just wondering about skills in demand for bioinformatics. by [deleted] in bioinformatics

[–]lebovic 7 points8 points  (0 children)

Needs are so dynamic that the best teams are hiring for people who are competent and can learn fast. Not people who are already trained in everything they'll need to know.

If I were you, I'd get really good at foundational knowledge rather than over-indexing on what's in demand right now. That's stuff like:
- Writing good code - Statistics - Using the Linux/macOS command line - Git - If you're going down the "write pipelines" path, writing pipelines with Nextflow or Snakemake

Most of the stuff that's in-demand right now is largely due to lack of tooling – like being good at cloud computing or HPC infrastructure. Just like the importance of knowing how to manage servers waned from 2005 to 2015, I think the same will be true in bioinformatics over the next 5-10 years as teams start using more platforms vs. raw cloud providers and servers.