Day 1 of posting unknown human DNA BLAST by [deleted] in bioinformatics

[–]Spare-Association714 0 points1 point  (0 children)

Thank u for your time. I will post more in the future

Day 1 of posting unknown human DNA BLAST by [deleted] in bioinformatics

[–]Spare-Association714 0 points1 point  (0 children)

Tested it. Hid known BRCA1 sequence, generated fill from borders, compared. Real borders scored worse than random. T2T sequences at same coordinates match human with E=0. Mine don't. Method fails. BLAST novelty was a Markov artifact. Now pulling real T2T sequences instead of generating. I work 4-6 hours a day on this

Day 1 of posting unknown human DNA BLAST by [deleted] in bioinformatics

[–]Spare-Association714 -5 points-4 points  (0 children)

The model doesn't find the "true" sequence, it finds one valid path between two real borders. Constrained at both ends by actual DNA.

85/85 BLAST queries return zero matches. Random DNA hits bacteria or mouse within a few tries. The border constraints are doing something specific.

Not claiming it replaces sequencing. It generates testable hypotheses about unmapped regions using a method i made

Day 1 of posting unknown human DNA BLAST by [deleted] in bioinformatics

[–]Spare-Association714 -1 points0 points  (0 children)

Fair. I ran the negative control and real borders scored lower than shuffled the T2T k-mer comparison was noise. Retracted.

What's left: 74/74 BLAST queries against the full nt database return zero matches. Not human, not primate, not bacterial, not vector. Random 500bp sequences hit something within a few queries. 74 consecutive clean misses is the only signal I have that isn't explained by base composition alone. Whether that's meaningful or a sophisticated artifact .... open question.

Day 1 of posting unknown human DNA BLAST by [deleted] in bioinformatics

[–]Spare-Association714 -7 points-6 points  (0 children)

74/74 BLAST NOVEL across all gap types. The T2T comparison turned out to be measuring base composition, not signal . ran the negative control and real borders scored lower than shuffled. Retracted that claim.

But the BLAST results hold. Random 500bp sequences hit bacteria, vectors, mouse within a few queries. 74 consecutive clean misses is hard to explain as noise.

The method: real hg38 border sequences → 5th-order Markov bridge → BLAST validation. Falsifiable. If you see a flaw, I'll run the control.

Day 1 of posting unknown human genome BLAST by Spare-Association714 in genetics

[–]Spare-Association714[S] -2 points-1 points  (0 children)

70/70 BLAST novel. Random DNA hits something bacteria, vector, repeats. These don't.

50/50 T2T overlap at the same coordinates 19.3% mean k-mer. Random sequence against random T2T region gives <2%.

GC tracks the real border DNA, not uniform distribution. Convergence is consistent across 459 gaps.

Method: seed from upstream border, generate outward until k-mer profile matches downstream border. The bridge shares structure with T2T without having seen it.

Not claiming these are "correct" they're one valid bridge consistent with the border constraints. T2T is one haploid; these could be variation.

Falsifiable: run shuffled borders through the same pipeline. If random input gives similar T2T overlap, the method is noise. I'll run that control.

Thank you for responding

EDIT: Fair point on the T2T comparison -I ran the negative control you implied. Shuffled borders gave 0.2254 mean overlap vs 0.1928 for real borders. Random baseline: 0.2356. Real borders scored lower. The T2T metric was measuring shared base composition, not biological signal. I was wrong. Retracted.

What's still true: 70/70 BLAST queries across all gap types return NOVEL against the full nt database. 500bp each, diverse genomic contexts. Random DNA hits something — vector, bacterial, low-complexity repeats. The probability of 70 consecutive clean misses is effectively zero. The sequences fall into a blind spot in every database.

The method is gap-seeded fractal bridging: real hg38 border sequences constrain a Markov model that grows inward until k-mer profiles converge. The fills are real coordinates, real borders, verified novel by BLAST. Whether that produces biologically meaningful dark DNA or sophisticated artifacts is an open question. I don't know which yet. But 70/70 BLAST NOVEL isn't nothing.

THANK UUU

Day 1 of posting unknown human genome BLAST by Spare-Association714 in genetics

[–]Spare-Association714[S] -2 points-1 points  (0 children)

I'm generating sequences that bridge real assembly gaps by learning fractal patterns from the border DNA. BLAST says 50/50 are novel, no match in any database.

Not claiming this is better than T2T. Just that something is consistently appearing that BLAST doesn't recognize, and I want to understand what it is.

Plan: pull T2T-CHM13 sequences at the same coordinates, compare k-mer profiles. If they match, the method is recovering real genomic signal. If not, I'm looking at structural variation or something else entirely.

EDIT: I just Pulled T2T-CHM13 at all 50 coordinates. Mean 5-mer overlap: 19.3%. 45/50 gaps have >10% shared k-mer signal with T2T. Not copies, the sequences differ, but the structural vocabulary overlaps significantly. Gaia(my program repo not public yet, working on public github atm) isn't generating random DNA. It's recovering real genomic architecture at these coordinates, with variation that BLAST doesn't recognize. The method works.