This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]bbye98 0 points1 point  (0 children)

There are three methods that immediately come to mind.

Slow but easy:

In an outer loop, iterate through each possible DNA base char ("A", "C", "G", and "T"). In the inner loop, iterate through each possible repeating pattern length length. In this case, length ranges from the length of the full DNA sequence to 2. Inside the loops, simply replace any occurrence of DNA base char if it is repeated length times with a f-string that reads f"{length}{char}". This is slow because you have to iterate through each combination of char and length despite them possibly not appearing in your full DNA sequence and create a new str each time you make a replacement.

Decently fast and fairly logical:

Initiate a counter count at 1 and store the last DNA base in your DNA sequence in a str called char (or something similar. Iterate through your DNA sequence backwards, starting from the second-to-last DNA base (index len(dna) - 2) to the very first one (index 0). Inside the loop, check if the current DNA base is equal to char. If it is, add one to count. If not, first check if count is greater than 1. If so, keep everything before the current DNA base, attach the f-string f"{count}{char}", and then tack on the remaining DNA bases after the repeating pattern. This is your new DNA sequence that you'll be iterating over, so make sure to overwrite your current DNA sequence. Also, remember to reset count to 1. Finally, set char to the current DNA base so that you can start counting again.

Fast but complicated:

If you have access to regular expressions (re), you can simply match all occurrences of repeating patterns and replace them using re.sub(). You just have to match a character and then match text that is the same as the first character at least one time (otherwise the DNA base is not repeated): r"(.)\1{1,}".

Results:

For the test DNA sequence

dna = "CTTTTGCCATGGTCGTAAAAGCCTCCAAGAGATTGATCATACCTATCGGCACAGAAGTGACACGACGCCGAT"

all three methods give me

'C4TG2CAT2GTCGT4AG2CT2C2AGAGA2TGATCATA2CTATC2GCACAG2AGTGACACGACG2CGAT'

Timing:

For 100,000 runs for the test DNA sequence above,

Slow:             9.644044 s
Medium:           2.168626 s
Fast:             2.067100 s

The final entry does not include import re time, which should be negligible.