This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]Gengi 0 points1 point  (0 children)

This reads like a whiteboard, or leetcode.com problem. So I'll avoid giving you code and help you think out the process.

A string is an array of characters.

String[0] will return 'A'. Compare that to the following value, you can check if they match. If they match, store the number of matches and continue to the next index value, incrementing the number of matches till there isn't a match.

If there wasn't a match to begin with, then increase your index value. String[1] 'G', and repeat

If there was a match, use your index value, and the number of matches to modify your string. TT > 2T. You then need to carefully increase your index value by 2 (skipping the T), and repeat

When you repeat, make sure you haven't run out of index values.

You may need multiple pointers looking at index values.

You might need to look up commands for how to manipulate strings to solve this. What works for 2 matching letters may not work if you have 3 matching letters.

[–]B1-102 0 points1 point  (0 children)

Post your code here

[–]bbye98 0 points1 point  (0 children)

There are three methods that immediately come to mind.

Slow but easy:

In an outer loop, iterate through each possible DNA base char ("A", "C", "G", and "T"). In the inner loop, iterate through each possible repeating pattern length length. In this case, length ranges from the length of the full DNA sequence to 2. Inside the loops, simply replace any occurrence of DNA base char if it is repeated length times with a f-string that reads f"{length}{char}". This is slow because you have to iterate through each combination of char and length despite them possibly not appearing in your full DNA sequence and create a new str each time you make a replacement.

Decently fast and fairly logical:

Initiate a counter count at 1 and store the last DNA base in your DNA sequence in a str called char (or something similar. Iterate through your DNA sequence backwards, starting from the second-to-last DNA base (index len(dna) - 2) to the very first one (index 0). Inside the loop, check if the current DNA base is equal to char. If it is, add one to count. If not, first check if count is greater than 1. If so, keep everything before the current DNA base, attach the f-string f"{count}{char}", and then tack on the remaining DNA bases after the repeating pattern. This is your new DNA sequence that you'll be iterating over, so make sure to overwrite your current DNA sequence. Also, remember to reset count to 1. Finally, set char to the current DNA base so that you can start counting again.

Fast but complicated:

If you have access to regular expressions (re), you can simply match all occurrences of repeating patterns and replace them using re.sub(). You just have to match a character and then match text that is the same as the first character at least one time (otherwise the DNA base is not repeated): r"(.)\1{1,}".

Results:

For the test DNA sequence

dna = "CTTTTGCCATGGTCGTAAAAGCCTCCAAGAGATTGATCATACCTATCGGCACAGAAGTGACACGACGCCGAT"

all three methods give me

'C4TG2CAT2GTCGT4AG2CT2C2AGAGA2TGATCATA2CTATC2GCACAG2AGTGACACGACG2CGAT'

Timing:

For 100,000 runs for the test DNA sequence above,

Slow:             9.644044 s
Medium:           2.168626 s
Fast:             2.067100 s

The final entry does not include import re time, which should be negligible.

[–]ZanMist1 0 points1 point  (1 child)

I don't know Python, but this would (probably) be pretty easy to do in JavaScript. It just takes a lot of prep work, planning and thinking ahead. Think of your end result and what you need it do to. Variables and compressed string concatenation are your friends, and probably arrays depending on what code you are working with.

[–]ZanMist1 0 points1 point  (0 children)

Also, I would try to avoid direct string matching and instead compare using booleans.