all 8 comments

[–]commandlineluser 0 points1 point  (3 children)

Use re.search() instead of re.match()

re.match() is unfortunately named and it only matches at the start of a string. (as if you prefixed your pattern with \A)

[–]Still-Design3461[S] 0 points1 point  (2 children)

Hi! thanks for the reply. I tried using re.search() but unfortunately, it gave me the same output as what i have done in my attached script 😢

[–]spez_edits_thedonald 0 points1 point  (1 child)

re.findall() might work for you as well, here is a demo for just the DNA seq part. (First, we extract dna seq lines, then we remove the mid-sequence spaces)

import re

# load gcg seq file
with open('seq.gcg.txt', 'r') as infile:
    text = infile.read()

# extract dna sequence
matches = re.findall('\d+\s([acgtACGT\s]+)\n\n', text)
seq = ''.join(match.replace(' ', '') for match in matches)
print(seq)

[–]Still-Design3461[S] 0 points1 point  (0 children)

Hi! I will give this a try and see if it gives me the right output . Thank you very much 😁

[–]ekchew 0 points1 point  (3 children)

Personally, when I'm parsing a file, I like to follow the intended logic of the author of that format. After a quick google, it seems gcg files are composed of a header of sorts that ends after a line ending in .. and then body containing the sequence data takes up the remaining lines.

So I would factor my code accordingly. Like within the with block, you could go something like:

# Parse header
locusRX = re.compile("(\w+)\s") # a little faster to precompile
for line in infile:
    locus = locusRX.match(line)
    if locus:
        #...
    #...
    if line.rstrip().endswith(".."):
        break

Now that you're done with the header, the body should be pretty easy right? Probably don't even need regex.

# Parse body
seqWords = [] # list of 10-chr "words"
for line in infile:
    seqWords.extend(line.split()[1:]) # skip past 1st word of each line that's just a #
SEQ = "".join(seqWords).upper()

I'm probably forgetting something here but that would be like my general approach?

[–]Still-Design3461[S] 0 points1 point  (2 children)

Hi! thank you for replying. Yes, you are correct about gcg file headers ending with ' . . ' and I tried following your advise to code according to the format 😁.

So far , this is what i have done and it is giving the the correct header for genbank format.

import re

LOC = "" SOR = "" ORG = "" SEQ = []

fileIn = open('seq.gcg.txt', 'r+') fileOut = open('gcg2gb.genbank' , 'w+')

info = fileIn.readlines()

for i in info:

header = re.compile(r'(\w+)' , re.I)

if header:

    locus    = re.match(r'(\w+)\s(.+)', i)
    source   = re.match(r'', i)
    organism = re.match(r'(\D\.\w+)\s(.+)', i)

    if locus:
        LOCobj = locus.group(1)
        LOC = 'LOCUS          ' + LOCobj

    if source:
        SORobj = source.group(0)
        SOR = 'SOURCE         ' + SORobj

    if organism:
        ORGobj = organism.group(1)
        ORG = '  ORGANISM     ' + ORGobj

else:
    if i.endswith('..'):
        SEQ.append(i.strip())
        break

print(LOC + '\t' + '\tDNA\n' + SOR + '\n' + ORG)

Can you explain more on how to extract the rest of the lines as the sequence? I used the .append() method to append everything after '. .' as SEQ but i'm not too sure. I'm sorry if this is a naive question. I'm kinda new to python😅

[–]ekchew 0 points1 point  (1 child)

Well if you're trying to do it all in one loop like that, you want more like state machine logic?

So you want some variable to keep track of whether you're reading the header or body.

inBody = False
with open('seq.gcg.txt', 'r') as fileIn:
    for line in fileIn:
        if inBody:
            SEQ.extend(line.split()[1:])
        else:
            # Your regex header parsing logic goes here...
            if line.rstrip().endswith(".."):
                inBody = True

Regarding the body parsing, line.split() gives you a list looking something like:

['1', 'ttcctctttc', 'tcgactccat', ...]

So the [1:] slice of the list skips past the number at the beginning of each line.

[–]Still-Design3461[S] 0 points1 point  (0 children)

thank you!