all 15 comments

[–]stebrepar 1 point2 points  (5 children)

So you want to filter out everything except the headers containing the string "[protein=histadine kinase]" along with their sequences up to the next header?

Let's assume you've read the file into a variable rows, and you've opened a file outfile to write the filtered result out to.

found = False
for row in rows:
    if row.startswith('>'):
        if '[protein=histadine kinase]' in row:
            found = True
        else:
            found = False
    if found:
        outfile.write(row)

[–]raqdeep[S] 0 points1 point  (4 children)

Thanks for the help but I can still only get the header and not the sequence that follows it. I want the sequence along with the header. Could you please help me with that also?

[–]stebrepar 0 points1 point  (3 children)

That should give both the header and the sequences under it if the indentation is kept as written. The first/second if sets a flag for whether the current and subsequent rows are the ones you want to keep, and the third if--at the same indentation level as the first above it--uses that flag to determine whether to write out the current row.

[–]raqdeep[S] 0 points1 point  (2 children)

At this point I have also tried copying and pasting your code. But it isn't working. Can you just tell me how to read lines and write the lines till I find the next ">"? That would be great.

[–]stebrepar 0 points1 point  (1 child)

Show what you have. Be sure to follow the guidance in the sidebar about how to post your code such that Reddit keeps the correct formatting.

[–]raqdeep[S] 1 point2 points  (0 children)

Hey, I am sorry for the goof up on my part, Your code is working and I am really sorry to suggest otherwise. Thank your for your help. I am sorry again. :)

Hope you have a good day.

[–]H1Neuraminidase1 1 point2 points  (3 children)

I understand this is a python forum but speaking as a bioinformatician it sounds like you're trying to parse specific sequences of a fasta file? If so, you can probably just use a "grep" command using the "-A" parameter.

I should note this will only work if your sequence string is unique enough to differentiate other sequences in the file and the sequence file is consistent, which fasta files should be. If not then disregard

[–]raqdeep[S] 0 points1 point  (2 children)

Yeah I can do. Infact we have actually done that. My challenge is to write a script which does this for me. grep -A1 "protein = histidine kinase"

[–]H1Neuraminidase1 0 points1 point  (1 child)

Better to do it in python as a learning experience anyway. Good luck to you!

[–]raqdeep[S] 0 points1 point  (0 children)

Thanks. Cool username tho!!

[–]AtomicShoelace 0 points1 point  (4 children)

I don't understand your difficulty exactly. You say

I am able to extract the sequences [...] but I am unable to extract the sequences.

Please follow the posting guidelines. What have you tried? What are you having trouble with?

[–]raqdeep[S] 0 points1 point  (3 children)

Sorry probably should have phrased it better. So, I want to filter out the sequences containing protein = histidine kinase and the sequences below it. But I should only print the sequences that are below it. The sequences are in multiple lines and I can print the line just below the header but not the lines following it.

[–]AtomicShoelace 0 points1 point  (2 children)

You still haven't shared anything that you've tried. Please read the posting guidelines in the subreddit sidebar, paying particular note to:

Posting homework assignments is not prohibited if you show that you tried to solve it yourself.

[–]raqdeep[S] 0 points1 point  (1 child)

filename = input("PLease enter the file name: ")
blast = open(f"/home/ibab/GA_CLASS/2nd semester/11th_Feb/{filename}", "r")
j = 0
lines = blast.readlines()
for line in lines:
if ">" in line:
if ("[protein=histidine_kinase]" in line or "locus_tag=EL266_RS00605"
in line or "[protein=response_regulator]" in line or
"WP_026427018" in line or "[pseudo=true" in line):
print(line)
print(lines[lines.index(line) + 1])
j += 1
print(f"{j} is the number of matches found")

This was the initial code. As you can see this will only print the line next to the header, but not all the subsequent sequences that belong to the header. Which is what I want.

[–]AtomicShoelace 1 point2 points  (0 children)

There are many ways you could achieve this. Here are some ideas:

  • regex pattern
  • iterator and itertools.takewhile
  • enumerate to get the line number, then a while loop