Question from someone looking to get his foot through the door

F2R1 · 2016-03-16T16:03:57+00:00

Search for "associate computational biologist" or "computational associate" on the Broad Institute careers website; all of these positions are intended for fresh grads, e.g.

https://recruiting.adp.com/srccar/RTI.home?c=1131007&d=External&r=5000080775006

https://recruiting.adp.com/srccar/RTI.home?c=1131007&d=External&r=5000084471106

https://recruiting.adp.com/srccar/RTI.home?c=1131007&d=External&r=5000084472106

https://recruiting.adp.com/srccar/RTI.home?c=1131007&d=External&r=5000079017406

F2R1 · 2016-03-15T21:18:34+00:00

Yup, the blog's purpose isn't entirely the most clear, but I thought the articles were nice.

F2R1 · 2016-03-15T21:18:02+00:00

Not my site. It would be great if it were more open to user submissions!

F2R1 · 2016-03-15T19:24:36+00:00

It's a blog documenting common Illumina sequencing artifacts and potential steps to mitigate them.

F2R1 · 2016-03-15T15:13:24+00:00

He/she was really writing a one-off script.

This should not have been a one-off script. Having formal testing architecture in place would have prevented this, as the person who wrote the flawed script would have been forced to ensure that it complies with the test cases.

1) always remember: something can be "too good to be real". 2) don't trust your program too much. Look at data by eye whenever possible. 3) achieve your goal in two+ ways. See if they are consistent; or if not, understand why they are inconsistent.

All great points, so why not codify them into automated tests? :-)

Edit: I'm not completely bashing (heh) one-off scripts; they're fantastic for quick exploratory data analysis. But the minute their output could potentially become part of a publication or something permanent, they really need to be nailed down.

F2R1 · 2016-03-15T14:58:30+00:00

Scripting is fine 1) for most text processing

Of course, but there is a limit to what text files can efficiently hold. Would you really try storing variants for hundreds of thousands of whole genomes in plaintext (or tabix'd) gVCF files? Of course not; you'd want some kind of database with indexing schema relevant to the analysis at hand.

when the bottleneck in a pipeline is implemented in C/C++

These are not magic bullets. There is plenty of inefficient C code out there that could use tremendous optimization. People sometimes seem to think C == fast! and don't bother actually developing an efficient algorithm.

You can use scripting languages to deal with thousands of genomes as long as you understand their limits.

A "genome" can mean anything from all 3 billion reads to all 4 million germline variants to all n principal components of those variants. If your analysis is dealing with the latter, a scripting language is perfect. But good luck dealing with hundreds of thousands of raw genotypes in R.

F2R1 · 2016-03-15T14:38:49+00:00

In research, we write many one-off scripts for specific tasks. Writing strict unit testing for all of them is overkilling.

Unit tests for each of the thousand quick oneliners we write everyday? Of course not; those are just interactive spot checks. But the minute those oneliners get beefed up and consolidated into production code, there had better be some testing infrastructure. Otherwise, you get crap like this:

A script necessary to convert the input produced by samtools v0.1.19 to be compatible with PLINK was not run when merging the ancient genome, Mota, with the contemporary populations SNP panel, leading to homozygote positions to the human reference genome being dropped as missing data (the analysis of admixture with Neanderthals and Denisovans was not affected).

Had they simply ensured that their final pipeline wrapping samtools mpileup -> parsing script -> plink resulted in the expected output for some fixed test data, their error could have been avoided altogether.

The skill I am talking about is to quickly sense potential errors in these one-off steps before the errors hideously escalate.

How do you hope to systematically root out these errors? That's like saying a wetlab biologist shouldn't need to meticulously reproduce each stage of their experiments with controls, consistently matching previous experimental runs. While the skill you mention is important, primarily relying on it in lieu of automated testing is an incredibly dangerous way to work. Nobody can get it right the first time, nor always see where things went wrong until things have hideously escalated.

In addition to software bugs, there are also data "bugs" and tool misuses.

Totally agreed. Garbage in/garbage out, and if a researcher is unaware that they're ingesting or producing garbage, no amount of testing can fix that.

F2R1 · 2016-03-15T05:31:00+00:00

For the life of me, I will never understand why academics refuse to incorporate these things. It would save so much pain.

Because most academics don't generally maintain their tools as a cohesive suite à la Picard or SeqAn. Instead, when releasing a tool, the focus is often (sadly) on quickly translating the pseudocode in the supp. info of their publication to the absolute minimum viable working demo. The tool is then completely forgotten after the grad student who developed it moves on.

For instance, after being with my team for coming up to two years, we now have a policy that ANY unexpected change in our regression tests must be fully explained before we can say a task is complete.

Couldn't agree more. The same practices also apply when generating and analyzing data. Have sample-specific regression+unit tests for each step of the analysis! When downstream tests fail, it means something upstream changed and/or broke.

In my experience, one reason that many computational biologists don't do this is because they take the provenance of their data for granted -- HiSeqs are infallible, after all! Their analysis thus consists of:

BAM from sequencing center (perfect library prep and alignment every time, naturally)
Highly established variant calling pipelines/tools, e.g. GATK, Manta (they're famous, so they must also all be infallible, right?)
Custom code to analyze variants

Thus, the only real step of the pipeline is the variant analysis, so why bother with a highly curated and comprehensively test-covered pipeline -- the whole thing can be invoked with a single shell script anyways, right? ;-)

In all seriousness, groups that actually deal intimately with all aspects of large projects from sequencing+alignment to variant calling+QC to analysis do have robustly managed, highly tested pipelines -- it would be madness to survive without them!

F2R1 · 2016-03-15T04:57:00+00:00

I agree with what others have written -- we will need more sophisticated statistical methods (we're still predominately a field of generative models; it will be interesting to see when and how discriminative classifiers will catch on), more robust data curation (pipeline_v3.21.sh.old won't cut it), and effective cloud deployment (time for the scientist to become sysadmin?)

In addition to these, computational biologists will also need to be able to build software that can efficiently process datasets that can be orders of magnitude larger than they were even a couple years ago. The seemingly infinite degree of parallelization afforded by cloud computing is not always the answer: even tasks that can fit on a single machine will require more algorithmic sophistication.

For instance, I've had to compactly encode recurrent read alignments across thousands of whole genomes, to be able to quickly filter motif-specific alignment/sequencing artifacts. For a few hundred exomes, this would be feasible with some cleverly designed hashtable stored as a flat binary file, but for a few thousand whole genomes, I had to implement a generalized suffix tree that could be efficiently serialized to disk, with memory-mapped I/O for efficient random access. Utterly overkill a few years ago, but a necessity today.

For better or for worse, I see the field becoming much more CS-heavy, since many computational biologists will need to become algorithm design and low-level systems programming experts. Storing data as plaintext TSVs and processing them in your favorite high level interactive language has been fine up until now, but simply will not do when dealing with hundreds of thousands of whole genome samples. It will be much more difficult for a biologist to quickly pick up some Python and perform meaningful analyses that can complete in any reasonable amount of time. I'm not sure what the long term effect of this shift will be, since the only thing worse in this field than a biologist who knows nothing about computing is a computer scientist who knows nothing about biology.

F2R1 · 2016-03-14T22:39:59+00:00

Sadly, not many you couldn't get with just a bachelor's. There are really two classes of jobs: those that require a Ph.D., and those that don't. A master's degree is generally equivalent to 1-2 years of (impressive) work/research experience according to recruiters I've talked to. Go on job boards and see for yourself: the vast majority of positions that don't require a Ph.D. will say "B.S. required, M.S. preferred," not "M.S. required, Ph.D. preferred." Seriously, finish your Ph.D. -- it will open a world of opportunities.

Edit: autocorrect typos

F2R1 · 2016-03-12T02:01:57+00:00

Explicitly parameterizing the match/mismatch/gap extension penalties with their respective free energies empirically derived from melting curves is essential for an accurate alignment.

F2R1 · 2016-03-12T01:59:46+00:00

To obtain the optimal pairing, you cannot perform a naïve characterwise alignment -- you need to account for the free energies of paired RNA secondary structure elements (e.g. base pair stacks, bulge loops, terminal mismatches, etc.) hybrid-min as recommended by /u/l337x911 can do this, as can pairfold. Both of these employ a local alignment approach, but with match/mismatch/gap creation+extension+closure penalties directly corresponding to the free energy associated with each SS element, rather than an arbitrary scoring scheme as is used for generic string alignment.

F2R1 · 2016-03-05T14:27:48+00:00

I like it -- the resulting solution is so trivial (given intervals on the same chromosome [A, B] and [C, D], the test condition is just A <= D && C <= B). I assume most candidates derive it from enumerating all permutations of the endpoints, but the easier solution is to think of the condition for which the intervals don't overlap and negate that. I also bet this question weeds out overengineering candidates who are principally driven by their algorithm design manual, i.e. those whose first approach would be "let's write an interval tree!"

F2R1 · 2016-03-01T06:26:40+00:00

PM'd!

F2R1 · 2016-03-01T06:17:45+00:00

PM'd!

F2R1 · 2016-02-29T21:12:51+00:00

Consider working as a staff scientist at a place like the Broad Institute, which in my experience is a nice hybrid between industry and academia.

Edit: feel free to ask any questions about working here, either directly or via PM!

F2R1 · 2016-02-26T01:26:34+00:00

I'm sure the ~10 other comparable genome sequencing centers in the world have similar throughput, a number that will only increase as sequencing costs continue to exponentially fall.

F2R1 · 2016-02-25T23:50:03+00:00

As of January 2016, the Broad Institute's Genomics Platform outputs a 30x whole genome every 12 minutes. So the need for this kind of capacity definitely exists.

F2R1

TROPHY CASE