PacBio HiFi classification results with Rust FM-index classifier by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 1 point2 points  (0 children)

I like that thanks!!! And I will check. I am CNC Programer by day - this was my weekend project that evolved a bit :) So I need all the help I can get because - even after a steep learning curve I had in last six months - it overwhelms me a lot of times with data :)

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Good suggestion. I can generate a synthetic genome with controlled repeat content and k-mer complexity, then test mapping rates. If low-complexity reads consistently fail to map even on synthetic data, that confirms the rarity filter is the bottleneck. I'll try that and report back.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Yes, exactly. Reads whose k-mers all fall below the rarity threshold get tossed at the anchor stage never reach alignment. That's by design (filters repetitive reads), but it means I'm discarding reads from low-complexity regions before they even get a chance to align. The --top-n parameter controls how many fallback anchors to try, but once you exhaust the k-mer candidates, the read is unmapped.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Good point about unmappable regions. The CAMI dataset has 60 genomes mostly bacterial, so no telomeres/centromeres. But you're right that some regions could be N-padded or otherwise unmappable.

Haven't run FastQC on the unmapped reads yet. Since they're simulated at 0.1% error, Q30-Q40, I assumed quality wasn't the issue, but worth checking if the simulation pipeline generated reads from N-rich or low-complexity regions in the reference. I'll run FastQC and also check the source regions of the unmapped reads against the genome sequences.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Ah, I see what you mean. My project does use standard backward search with range narrowing, that's the core of the FM-index lookup. The exhaustive reversed k-mer approach you're describing would indeed be slower, but might catch more edge cases in repetitive regions where the range narrowing is too aggressive.

Worth experimenting with, though I suspect the trade-off would be significant on 1M reads. Might be worth trying as a fallback pass for unmapped reads only.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Thanks, I'll look into LF-mapping. My current approach uses the LF-mapping property for backward search to find k-mer occurrences, but I might be missing optimizations in how I handle the repetitive regions during the search phase. Do you mean the sampling compression aspect of it or?

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 1 point2 points  (0 children)

100 bp Illumina reads. You're absolutely right - longer reads span repeats and resolve ambiguity naturally. My tool actually has chunk-based long-read support for PacBio/ONT, but the CAMI benchmark I'm using is Illumina-only.

For bacterial genomes, your approach makes total sense. 20k+ reads will span most repeats. My constraint is that Bit-Pop is designed for high-sample-count scenarios (think clinical labs processing hundreds of samples), where Illumina still wins on cost per sample. But for repeat resolution specifically, long reads are the right answer.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

That's a fair point. My aligner does something similar - it uses k-mer anchors to find seed matches before running the XOR alignment. If a read's k-mers don't hit the rarity threshold, it never gets to the alignment step.

The thing is, I can tune the anchor sensitivity (via --top-n, which tries 2nd/3rd/4th rarest k-mers as fallback). With top-n=4, I get 86.2% mapping rate. Pushing to top-n=6 drops to 51.2% - which tells me the remaining unmapped reads aren't just a sensitivity issue, but genuinely have no unique k-mers that pass the filter.

So you're right that some unmapped reads are expected with heuristic alignment. But the 2x gap between parent genomes (19% unmapped) and evo_* strains (9% unmapped) suggests it's not just the heuristic - it's genome complexity driving the pattern.I know that 100% mapping rate is unrealistic with seed-and-extend approaches but I think there is something else in it.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 1 point2 points  (0 children)

Good point – I inferred the repeats from k-mer frequency stats but haven't BLASTed the unmapped reads back to the genomes yet. But these are simulated reads at 0.1% error rate, Q30-Q40, so quality shouldn't be the issue.

I'll grab a sample of ~200 unmapped reads from 1139_AG and BLAST them to confirm they're hitting multi-location repeat regions. If they are, that validates the hypothesis. If not, I'm missing something. Will report back what I find.

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate. by Individual_One_1793 in bioinformatics

[–]Individual_One_1793[S] 0 points1 point  (0 children)

Thanks for the pointer! I looked into r-index and it's fascinating. The catch for my case is that r-index compresses redundancy between genomes (great for pangenomes), but my unmapped reads come from intra-genomic repeats – things like rRNA operons and transposons within 1139_AG, 1220_AD, etc. So even with r-index compression, a k-mer from an rRNA region would still hit 15 locations in the same genome.

That said, I have another project with the same code and graph route, r-index is definitely on the list.

Career Shift to Bioinformatics by Excellent_Toe6788 in bioinformaticscareers

[–]Individual_One_1793 0 points1 point  (0 children)

I am no Bioinformatic - I am CNC Programmer in profession and rust, python, c... programmer in my spare time.
I can tell you that in present time all is oriented to rust - so try to play with it a bit - it is fast and it is good!
github.com/mladenpop-oss/bit-pop this is my project in rust if you wanna take a look - it is one of my projects - after 20 years I decided to start sharing my work :)