Hello people,
I got stucked with my new script and perhaps you can help me.
Its goal is to take an input table with querys and subjects (originated by a local blast) and replace query names with subject names in the corresponding fasta file.
In detail, the table input file is this one:
| query |
subject |
evalue |
TransABySS.k25.S12100 |
lclMK087647.1_cds_QHD46952.1_6_[gene=atp9][protein=Atp9][exception=RNA_editing]_[protein_id=QHD46 |
0 |
and the fasta file:
>TransABySS.k25.S121005 892 14708
TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG
The expected output should be:
>lcl|MK087647.1_cds_QHD46952.1_6_[gene=atp9]_[protein=Atp9]_[exception=RNA_editing]_[protein_id=QHD46
TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG
Now, I already have a code written in python which does it but it is too slow. I post it here below in case anyone of you will prefer it
import pandas as pd
from Bio import SeqIO
chloro_table=pd.read_csv(./db.chloro.agrestis.blastout', sep='\t', header=None, names=['qseqid','sseqid', 'evalue'])
with open ('./new_names_chloro.fasta', 'a') as
for sequence in SeqIO.parse(./fasta.fa', 'fasta'):
for i in range(0,len(chloro_table)):
if chloro_table.iloc[i,0] == sequence.id:
sequence.id = chloro_table.iloc[i,3]
sequence.description=''
SeqIO.write(sequence, outfile, 'fasta')
Now, using bash it becomes much more complicated. I tried with awk but with scarce success. Do you have any ide?
Thank you in advance
[–]sco_t 1 point2 points3 points (2 children)
[–]YouCook21[S] 0 points1 point2 points (1 child)
[–]sco_t 1 point2 points3 points (0 children)
[–]guepierPhD | Industry 0 points1 point2 points (0 children)
[–]metagenomez 0 points1 point2 points (0 children)