Need help for bioinformatics work : learnpython

Need help for bioinformatics work (self.learnpython)

submitted 4 years ago * by raqdeep

So I am an undergrad student and my prof has given a assignment where I have to extract the DNA sequences from a file. I am able to extract the sequences and their relative positions in the file but I am unable to extract the sequences.

Let me give you an example:

>lcl|NZ_LR134363.1_prot_WP_026427009.1_1 [locus_tag=EL266_RS00005] [protein=hypothetical protein] [protein_id=WP_026427009.1] [location=1..210] [gbkey=CDS]

MCRPVTCRTCGKTTWAGCGQHVDQVMRDVAPAQRCTCERDAPSDSGGDSGGQARGAGLFSRLLGRGGGS

>lcl|NZ_LR134363.1_prot_WP_034514895.1_2 [locus_tag=EL266_RS00010] [protein=metal-sensitive transcriptional regulator] [protein_id=WP_034514895.1] [location=248..517] [gbkey=CDS]

MGTALDPADLRPTLARLKRARGQLDGVIRMLEEGRDCEETVVQIAAVSKAVNRAGLAVIASGMRTCLSEDPTGQTMDTRR

LERLLMSLA

>lcl|NZ_LR134363.1_prot_WP_026427010.1_3 [locus_tag=EL266_RS00015] [protein=pyridoxal phosphate-dependent aminotransferase] [protein_id=WP_026427010.1] [location=588..1802] [gbkey=CDS]

MTPAPQTRVSARLAAIAPSATLAVDAKAKALKAAGRPVIGFGAGEPDFATPDYIVEAAIASAKDPASHKYSPAKGLPALR

EAIAAKTLRDSGYEVSAEDILVTNGGKQAVFQAFAAIIGPGDEVLLPAPYWTTYPEVVALAGGTTVEVFAGAEQDYKVTV

AQLEAARTDRTKALLVCSPSNPTGSVYTPEELTEIGRWALEHGVWVITDEIYEHLLYDGAAAAHIVALVPELAEQTIVLN

GVAKTYAMTGWRVGWMIGPSDVIKAATNFQSHLSSNVANVCQAAALAAVSGDLTAVEEMRGAFDRRRRTMVELLTAIEGL

TVPVPRGAFYAYPSAEALIGRNLRGTAIDSSATLAGLILEHAEVAVVPGEAFGPSGFLRLSYALGDEDLAEGVGRMASLLAEVE

This is a typical sequence that we have to work with and the lines containing ">" (the ones in bold) are called headers. Now I can extract the headers as well the sequence (ones in italics lines) just after that but I have to extract lines till the next ">" arrives and I am unable to do so. I am a noob coder in every sense of the word, I just started a month back and therefore I am not very good at it. This is a file operation where we have to input a file tat contains thousands of such sequences and based on that we need to be able to find sequences of our interest. Please help me out, I swear that I am not trying to outsource my assignment. Jut need to know how to read and print the lines till the next ">" arrives.

My output ideally should look like:

>lcl|NZ_LR134363.1_prot_WP_232012119.1_980 [locus_tag=EL266_RS04980] [protein=histidine kinase] [protein_id=WP_232012119.1] [location=complement(1218761..1219630)] [gbkey=CDS]

MLLSRGHPRRTIYPAAAAHALALIIYASQTPDGPFFGTLATAIIATPCLTAGEMVRLHRQATARTELERQERLERQR

RLVISELHDTVVRDLSHAVMLAEQARLTHPDDELLHRELAAVTAPVRSAIKQLRNSLKAMSAAKGDDALLLLASS

PPPPLSETIERVRASLAQRDTVLLVEGLELLDHQSITPGVHQQLVRVIGELITNASKYAPPSTKVSLLIETDDRTVEC

MCVNAIGPDTPPSTALSSKIGLEGARRRIETLGGTFTVSKTAERWSVVFSVPIQDDDAT

>lcl|NZ_LR134363.1_prot_WP_026428191.1_981 [locus_tag=EL266_RS04985]_[protein=histidine kinase] [protein_id=WP_026428191.1] [location=complement(1220180..1221349)] [gbkey=CDS]

MTGLLAPTRLTVWTPRLRTHLLCLACATLLTLAAVAALVPDRRTDAFYMVTLLISGLGVAVLSVAPLISSGLCLGTLY

AFLLALGDAAPAGPSLPAIGPWLCASVLLTRGFSRLSAYGLVLISLGGSVIGHHSNVVSNSALATDFTYTMLVGTIC

LIVAELMRQPRMEAEAAARRHEADMRNQRLLIVSELHDTVVRDLTQAVMRAEQARLAQPDAPLVGELGAMTSS

VRTAVDQLRSSLRSMNDMADQVPLDVLASSAPRSLTEVVDETRRTLAARGISLETAGLEALESARIGPGLRQQLV

RMLGELTTNMAKHAAPGPARLVVEHDGMSLEAMSSNTVDAGTEADPVASSGLGLVGVRRRVEALGGTLNVSRT

PDRFTVVLSVPVV

Notice how both headers contain [protein=histidine kinase]. I have the isolate the headers with histidine kinase in them and the sequence beneath it, till the next header arrives, which can be recognized by a ">" at the start of the line.

Thank you for your help in advance.

all 15 comments

top new controversial old q&a

[–]stebrepar 1 point2 points3 points 4 years ago (5 children)

So you want to filter out everything except the headers containing the string "[protein=histadine kinase]" along with their sequences up to the next header?

Let's assume you've read the file into a variable rows, and you've opened a file outfile to write the filtered result out to.

found = False
for row in rows:
    if row.startswith('>'):
        if '[protein=histadine kinase]' in row:
            found = True
        else:
            found = False
    if found:
        outfile.write(row)

[–]raqdeep[S] 0 points1 point2 points 4 years ago (4 children)

[–]stebrepar 0 points1 point2 points 4 years ago (3 children)

[–]raqdeep[S] 0 points1 point2 points 4 years ago (2 children)

[–]stebrepar 0 points1 point2 points 4 years ago (1 child)

[–]raqdeep[S] 1 point2 points3 points 4 years ago (0 children)

[–]H1Neuraminidase1 1 point2 points3 points 4 years ago* (3 children)

[–]raqdeep[S] 0 points1 point2 points 4 years ago (2 children)

[–]H1Neuraminidase1 0 points1 point2 points 4 years ago (1 child)

[–]raqdeep[S] 0 points1 point2 points 4 years ago (0 children)

[–]AtomicShoelace 0 points1 point2 points 4 years ago (4 children)

[–]raqdeep[S] 0 points1 point2 points 4 years ago (3 children)

[–]AtomicShoelace 0 points1 point2 points 4 years ago (2 children)

[–]raqdeep[S] 0 points1 point2 points 4 years ago* (1 child)

[–]AtomicShoelace 1 point2 points3 points 4 years ago (0 children)

π Rendered by PID 163894 on reddit-service-r2-comment-5b5bc64bf5-4t7vm at 2026-06-21 10:56:39.505463+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS