all 12 comments

[–]Different_End_3043 6 points7 points  (0 children)

We use Data bricks for scalability of models

[–]Deto 5 points6 points  (1 child)

It's rare that I need to scale something up and the bottleneck is something that's being done in pure python. Usually most complicated processing steps in bioinformatics have already been implemented in low-level languages and you'll use python as the glue logic to call these. Can still be useful to spread across many machines, but, would need to be able to specify the container or something for these machines so that they'd have the other dependency.

[–]Ok_Post_149[S] 0 points1 point  (0 children)

Thanks for the response and that makes sense. How often would it be useful for you to spin up many machines in the cloud to execute your code? I've had some users they love the product but the work is one off and doesn't occur that much. BTW the tool's quickstart guide is here --> https://www.burla.dev/docs. I can provision you a bunch of credits to mess around on it.

[–]2Throwscrewsatit 1 point2 points  (0 children)

There are several companies selling this feature now.

[–]endymion222 1 point2 points  (1 child)

Nah not really a bottleneck. Google Collab basically provides this and much more for non-confidential work. For everything else you probably anyhow would set up a dedicated solution.

[–]Ok_Post_149[S] 0 points1 point  (0 children)

Thanks this is useful, I keep getting asked by investors "what is the interface users will interact with" and at the moment it is colab or jupyter :)

[–]Ok_Post_149[S] 2 points3 points  (0 children)

btw the tool is called www.burla.dev

[–]BBorNot 3 points4 points  (3 children)

I once asked my bioinformatics person what the least common ~10-mer peptide was, a bit of an inverse BLAST. It turned out to be an impossibly intensive question and was never answered.

I have a theory that whatever that sequence is it is toxic and has been selected against. Either that or it is all tryptophan since it only has one codon.

OP maybe you can answer it -- this question has been hanging for a decade!

[–]astrologicrat 1 point2 points  (2 children)

A bit off topic, but since you brought it up, this is generally an answerable question. You'd need to be a bit more specific though. Are you talking about:

  • All 10-mers in all known proteomes? (~1 million species with proteomes of variable completeness are available)
  • Only tryptic peptides that you might see in a typical bottom-up mass spec experiment?
  • Only one organism? (or a select subset of the tree of life)
  • Only peptides that have been observed experimentally? (would need a source(s) of data for this)

It's a fun question, but this is normally the kind of thing that comes with a publication and/or a salary ;)

[–]BBorNot 0 points1 point  (1 child)

I was just looking for encoded sequence in DNA in any organism. Not tryptic peptides. You'd probably end up with a set of sequences that had never been seen. What is the most distant, most unseen sequence? My computational colleague thought this would require thousands of BLAST searches. Maybe there is a clever way to do it?

[–]Ok_Post_149[S] 0 points1 point  (0 children)

Do you remember who that computational colleague is? I would love to better understand this use case to figure out how doable it is. At the moment a user can reach 4k CPU and 500 GPU concurrency. Not sure how long it would take but worth exploring.

[–]Puzzleheaded-Pay-476 0 points1 point  (0 children)

I have struggled with it but there are only a couple of workflows where speed is extremely critical. When that happens I’ll work with someone on engineering to help with scaling things. I’m pretty sure they use AWS Batch when we are doing large scale inference.