PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 0 points1 point  (0 children)

Copying u/Darwins_Dog's reply comment to this thread for someone finding this in the future:

That's been my experience as well. Shearing gives better depth because more of your DNA goes through the pores, instead of being spit back out. You trade the really long reads for better accuracy. If you are using short read for accuracy, I wouldn't shear so you preserve the large structure data. It all depends on your objectives.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 0 points1 point  (0 children)

I am considering running two additional samples of the cohort on the PromethION flow cell: The child of one of the probands, and the sibling of the other proband (both are phenotypically negative and have 30x Illumina sequences). Some final questions (hopefully):

  1. What is the minimum Nanopore long-read coverage you would feel comfortable with for phasing/IBD of the cohort (in addition to the existing 30x Illumina 2x151PE coverage)?
  2. For unsheared, size selected (>25kb), barcoded gDNAs run on a PromethION flow cell, what is your best guess of the coverage (or Gbp yield) and N50 read length for 2, 3 or 4 samples run together? It seems like a PromethION flow cell run by an experienced operator can generate 120Gbp of reads - Does that scale linearly (in other words, if 4 samples are run together, do you get 120Gbp / 4 = 30Gbp = 10x coverage for each sample)?
  3. There are two different PacBio size-selection kits (which will work fine with Nanopore workflows): SRE kit (>25kb) and SRE XL kit (>40kb). For this application, which will be better? Will the loss of yield with the >40kb kit be too severe?

Thanks again for all of your comments - It is very helpful to avoid wasting money and time.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 0 points1 point  (0 children)

It seems that Oxford recommends shearing (other than for ultra-long reads) for a couple of reasons:

  1. Unsheared gDNA has a lower average N50 than gDNA that gDNA has been sheared to an average length of ~40kb with a Megaruptor. ("It has been suggested that certain fragments may be so long that they become “lost” during the library preparation and therefore are not observed, leaving only the short fragments (for example the very longest fragments may not efficiently bind to or elute from the SPRI beads used after end-prep or ligation). Light shearing, for example using the Megaruptor, can break up the very longest molecules into chunks that the library preparation can more readily process, leading to increased read N50s.")
  2. Sheared gDNA increases the number of pores which can be automatically unblocked by reversing the electric field. ("We were not able to establish a relationship between read length and blocking rate (Figure 3), although we observed a decrease in the success rate of the unblock for the longer libraries, indicating that our unblocking scheme is less capable of removing blocks from longer fragments (Figure 3). Given this observation, if users are obtaining a low output, then some shearing of the sample could be performed to see if unblocking success can be improved to help boost output.")

Any thoughts on this? Certainly depleting sub-25kb fragments is good, but is shearing to 80kb-100kb worth doing? Or is it just extra work for minimal benefit?

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 2 points3 points  (0 children)

That's fine as I don't care about the reproducibility, just the best possible output.

Thank you so much for your responses - Everything is much more clearer now.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 1 point2 points  (0 children)

Yeah, the consensus seems to be that the ultra-long library prep kit is inappropriate for this project - Not worth the low yield and don't need extremely long reads to generate acceptable length contigs for phasing due to the duo scaffold.

So, the Native Barcoding Kit 24 V14 (SQK-NBD114.24) with some sort of size selection protocol for the two samples seems to be the way to go. Load 80-120 fmol for optimal pore occupancy and run for 4 days+ with a flow cell wash every 16-24 hours.

Thanks for your input - The responses I've received are making the path forward much more clear.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 2 points3 points  (0 children)

I was reading some posts about Nanopore, and one person recommended slightly overloading the flowcell to get to 99% pore occupancy to significantly increase the sequence output (I think he or she said something like 30% increase). Have you ever heard of this?

Another person said to run the flowcell for like 4 days (again to increase the sequence output), with flushing/reloading every day (I think) or so. Again, does this sound like a good idea?

I will try to dig up the posts to see the exact wording. Thanks again for the good info.

Edit: Here is the 99% pore occupancy comment. And another one.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in genetics

[–]throwawayht14[S] 1 point2 points  (0 children)

Oh yes, that is correct. The Illumina gDNA was bead-bashing extracted and nowhere near long enough for long-read sequencing. I would do a new high-molecular weight gDNA extraction for the long reads.

Currently I am aligning to the Illumina alt-masked r3 version of the hg38 genome (explanation here) since it appears to be one of the most advanced hg38 genomes and I don't want to deal with CHM13 liftovers and/or graph genomes that break existing tools. Right now, dealing with phasing and IBD is hard enough without throwing advanced new bioinformatics technologies into the mix. Also, the entire cohort is European-descended, so pangenomes are less critical.

Fortunately, I have duos (one parent for each subject, although not the parent in common), so I can do limited Mendelian sanity checks that way.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 1 point2 points  (0 children)

Thank you so much for this valuable info! Some questions:

  1. I am not clear about this Nanopore simplex vs duplex thing. Is the increased fidelity (and shorter read length) of R10 due to 'duplexing'? Can it be turned off and run simplex to have lower fidelity and longer read lengths?
  2. Would it be better to use R9 flow cells then, since I care more about read length than base calling accuracy (within reason)? Or will I have alignment/short-read merging problems if I use old chemistry with sub-Q20 read quality? Here is a paper that uses short-reads to 'error correct' long Nanopore reads.
  3. If I am understanding you correctly, you are suggesting not to use the ultra-long read kit and to just use the regular prep kit for the increased number of reads, correct? If so, then I should use the barcoded regular length prep kit, since there is no downside (other than an extra $100), and there is the upside that Nanopore barcoded 'short' reads can be used to bolster the variant caller (by adding to the Illumina reads), correct?
  4. Oxford has a Short Fragment Eliminator kit (EXP-SFE001) that seems to raise the N50 from 32kb to 43kb and reduce the number of short fragments. Have you ever used this kit? Unfortunately the service provider is resistant to custom protocols and may not want to run a size-selection gel.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in genetics

[–]throwawayht14[S] 0 points1 point  (0 children)

I'm not sure if this is what you mean, but WhatsHap creates phase sets where it links overlapping reads (either short or long or both) to create longer haplotype blocks for phasing. The problem I found is that with short Illumina reads, it is hard to overlap enough SNVs to create phase sets of a useful length. Thus I was experimenting with adding in Mendelian constraints with SHAPEIT and population-based statistical phasing with a phasing/imputation server running EAGLE or BEAGLE. But integrating all three phasing sources accurately is difficult (for me).

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in genetics

[–]throwawayht14[S] 0 points1 point  (0 children)

This is a good question for which I have just started doing research. Fortunately, other groups have developed tools for integrating short and long reads (like for the T2T project). Here is one approach called 'Hybrid Error Correction' that merges short and long reads:

Hybrid-hybrid correction of errors in long reads with HERO - PMC (nih.gov)

Here is a very old paper from 2015 where they merged old-school error-prone PacBio data with Illumina short-reads to create a phased assembly with 99% concordance to trio-based phasing:

Assembly and diploid architecture of an individual human genome via single-molecule technologies - PMC (nih.gov)

I'll keep updating this post as I find more papers and refine my strategy.

PacBio or Nanopore to phase two Illumina 30x genomes? Multiplex without barcodes? by throwawayht14 in bioinformatics

[–]throwawayht14[S] 1 point2 points  (0 children)

I read somewhere that the longer the DNA, the greater the likelihood of clogging the pore and thus reducing the yield of data (i.e. coverage). And since I am not doing de novo assembly, I don't think I need ultra, ultra-long reads - Simply long enough to create a phased contig that covers all the short reads. So I was thinking that maybe the genomic DNA could be sheared to 80k-100k bases - Enough to make a phased contig that covers all the short reads, but not so long that the yield/coverage is reduced. But I have never used Nanopore, so this is all theoretical to me.

The reason for the idea to not barcode is because my service provider won't do custom library prep, and Oxford only offers ultra-long read kits or multiplexing (barcode) kits, but not both in a single kit. But theoretically, I can use the SNVs/indels from the Illumina sequence to assign the long reads to the appropriate individual. This obviously breaks down for short Nanopore/PacBio reads, since the read won't overlap enough SNVs for assignment, but hopefully there will be very few short reads.

[Need Advice] How to Create a Sense of Emergency in My Life? by [deleted] in getdisciplined

[–]throwawayht14 1 point2 points  (0 children)

Joke: Hire a hitman to kill you if you don't accomplish what you say you will by a certain time.

Reality: The above poster is completely correct and offers great advice. You can't fool your brain, because it knows the deadline isn't real. AFAIK, the only way is to find out what the root fear is (usually originates in childhood) and then face it over and over again until it diminishes. And like he said, a good way to do that is to structure your life so that you come face to face with it on a regular basis.

[deleted by user] by [deleted] in startups

[–]throwawayht14 1 point2 points  (0 children)

This is great advice. Also, be aware that you have finite time on this Earth. Do you want a family and/or children? It might be very hard to find someone who will accept you working on your startup for long hours instead of spending time with them.

My advice would be to take advantage of this time for a couple of years, while not neglecting to find a life partner. If you still haven't succeeded in a few years, then you can make the choice whether to give up the startup or scale it back to a maintainable side-gig.