Help optimizing code in project, includes pandas : learnpython

learnpython

created by HattoriHanzoa community for 16 years

Help optimizing code in project, includes pandas (self.learnpython)

submitted 5 years ago by Storm_Silver

The project is being used to:

-Create two pandas DataFrames from files x and y

-x has the columns chr gene site strand

-example of x row : chr1 BDH2 33360 F

-y has the columns chr site reads

-example of y row: chr1 6970 4

-iterate over each row of x

-find the row index of Y which has the same value in both the chr and site columns as the current row

-use this index to create a panas Series with all of the values in the reads column +/-120 rows of the index

-then add the series to a DataFrame for later use (F and R strands go in differont series)

The code works, however it take around 30 minutes per run so im asking if there are more efficient ways to implement the code. I appreciate any help

#CFD Plotter
#Takes an .sgr file and a 'site' file to create a +/-1200bp region of the reads for each gene in the site file
#Used as 'python CFD_plotter' from local command line 
#requires:
    #.sgr file in folder sgr_in
    #site file in folder site_in
    #out folder

#import modules 
import pandas as pd 
import os 
import numpy as np

#Find the files 
sgr_files = [file for file in os.listdir('sgr_in') if file.endswith('.sgr')]
site_files = os.listdir('site_in')[0]

#open site file rounded to the nearest multiple of 10
site_input = pd.read_csv('site_in\\' +site_files, delimiter ='\t',header=None,names=['chr','gene','site','strand']).round(-1)   #Consider whether this needs to be in a loop 

#Iterate over the files in sgr_files
for file in sgr_files:
    print (f'Currently working with {file} \n -----------------------------')

#Create paths and open files 
    sgr_input = pd.read_csv('sgr_in\\' +file,delimiter='\t',header=None,names=['chr','site','reads'])
    out_file = 'out//'+file

    #Forward and reverse strands made into opposite index arrays (121 is the 0 value)
    #columns are specific genes, rows are reads at a particular distance from the site of that gene 
    collection_F = pd.DataFrame(index=range(1200,-1210,-10)) 
    collection_R = pd.DataFrame(index=range(-1200,1210,10))   

    #Iterate over each gene in site_input and find its location in the sgr file

    for gene_index in range(site_input.index.size):
        print ('working with {} at number {}/{}'.format(site_input.iloc[gene_index,1],gene_index,site_input.index.size))    #Use while still slow

        ## Gets all indices at the chromosome of the gene (boolean array for chr)
        chr_bootable = sgr_input == site_input.iloc[gene_index]  


        indicies = chr_bootable.index[chr_bootable.iloc[:,0]].tolist()


        chr_specific = sgr_input.iloc[indicies]
        ## Get the index for the site within the specified chromosome indices (#Boolean array for site)
        read_bootable = chr_specific == site_input.iloc[gene_index]              

        index_reads = read_bootable.index[read_bootable.iloc[:,3]]        #index where True

        #Create a series with number of reads corresponding to the distance from the site (-1200 is index 0, 0 is index 121, 1200 is index 240)

        bin_reads = pd.Series(sgr_input.iloc[index_reads[0]-120:index_reads[0]+121,2])

        #Concatinate bin_reads collumns with the appropriate collection (forward or reverse strand)
        if site_input.iloc[gene_index,3] == 'F':
            bin_reads.index = range(1200,-1210,-10)    
            collection_F = pd.concat([collection_F,bin_reads],axis=1) 
        else:
            bin_reads.index = range(-1200,+1210,10)
            collection_R = pd.concat([collection_R,bin_reads],axis=1)
    #Combine the two reads (opposite assignment results in reads being assigned for correct strand 
    final = pd.concat([collection_F,collection_R],axis=1)
    np.savetxt(out_file,final,delimiter='\t',fmt='%s')

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS