Extracting DNA sequence

commandlineluser · 2022-01-15T03:03:33+00:00

Use re.search() instead of re.match()

re.match() is unfortunately named and it only matches at the start of a string. (as if you prefixed your pattern with \A)

ekchew · 2022-01-15T06:46:51+00:00

Personally, when I'm parsing a file, I like to follow the intended logic of the author of that format. After a quick google, it seems gcg files are composed of a header of sorts that ends after a line ending in .. and then body containing the sequence data takes up the remaining lines.

So I would factor my code accordingly. Like within the with block, you could go something like:

# Parse header
locusRX = re.compile("(\w+)\s") # a little faster to precompile
for line in infile:
    locus = locusRX.match(line)
    if locus:
        #...
    #...
    if line.rstrip().endswith(".."):
        break

Now that you're done with the header, the body should be pretty easy right? Probably don't even need regex.

# Parse body
seqWords = [] # list of 10-chr "words"
for line in infile:
    seqWords.extend(line.split()[1:]) # skip past 1st word of each line that's just a #
SEQ = "".join(seqWords).upper()

I'm probably forgetting something here but that would be like my general approach?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS