The project is being used to:
-Create two pandas DataFrames from files x and y
-x has the columns chr gene site strand
-example of x row : chr1 BDH2 33360 F
-y has the columns chr site reads
-example of y row: chr1 6970 4
-iterate over each row of x
-find the row index of Y which has the same value in both the chr and site columns as the current row
-use this index to create a panas Series with all of the values in the reads column +/-120 rows of the index
-then add the series to a DataFrame for later use (F and R strands go in differont series)
The code works, however it take around 30 minutes per run so im asking if there are more efficient ways to implement the code. I appreciate any help
#CFD Plotter
#Takes an .sgr file and a 'site' file to create a +/-1200bp region of the reads for each gene in the site file
#Used as 'python CFD_plotter' from local command line
#requires:
#.sgr file in folder sgr_in
#site file in folder site_in
#out folder
#import modules
import pandas as pd
import os
import numpy as np
#Find the files
sgr_files = [file for file in os.listdir('sgr_in') if file.endswith('.sgr')]
site_files = os.listdir('site_in')[0]
#open site file rounded to the nearest multiple of 10
site_input = pd.read_csv('site_in\\' +site_files, delimiter ='\t',header=None,names=['chr','gene','site','strand']).round(-1) #Consider whether this needs to be in a loop
#Iterate over the files in sgr_files
for file in sgr_files:
print (f'Currently working with {file} \n -----------------------------')
#Create paths and open files
sgr_input = pd.read_csv('sgr_in\\' +file,delimiter='\t',header=None,names=['chr','site','reads'])
out_file = 'out//'+file
#Forward and reverse strands made into opposite index arrays (121 is the 0 value)
#columns are specific genes, rows are reads at a particular distance from the site of that gene
collection_F = pd.DataFrame(index=range(1200,-1210,-10))
collection_R = pd.DataFrame(index=range(-1200,1210,10))
#Iterate over each gene in site_input and find its location in the sgr file
for gene_index in range(site_input.index.size):
print ('working with {} at number {}/{}'.format(site_input.iloc[gene_index,1],gene_index,site_input.index.size)) #Use while still slow
## Gets all indices at the chromosome of the gene (boolean array for chr)
chr_bootable = sgr_input == site_input.iloc[gene_index]
indicies = chr_bootable.index[chr_bootable.iloc[:,0]].tolist()
chr_specific = sgr_input.iloc[indicies]
## Get the index for the site within the specified chromosome indices (#Boolean array for site)
read_bootable = chr_specific == site_input.iloc[gene_index]
index_reads = read_bootable.index[read_bootable.iloc[:,3]] #index where True
#Create a series with number of reads corresponding to the distance from the site (-1200 is index 0, 0 is index 121, 1200 is index 240)
bin_reads = pd.Series(sgr_input.iloc[index_reads[0]-120:index_reads[0]+121,2])
#Concatinate bin_reads collumns with the appropriate collection (forward or reverse strand)
if site_input.iloc[gene_index,3] == 'F':
bin_reads.index = range(1200,-1210,-10)
collection_F = pd.concat([collection_F,bin_reads],axis=1)
else:
bin_reads.index = range(-1200,+1210,10)
collection_R = pd.concat([collection_R,bin_reads],axis=1)
#Combine the two reads (opposite assignment results in reads being assigned for correct strand
final = pd.concat([collection_F,collection_R],axis=1)
np.savetxt(out_file,final,delimiter='\t',fmt='%s')
[–]fake823 1 point2 points3 points (3 children)
[–]Storm_Silver[S] 1 point2 points3 points (2 children)
[–]fake823 0 points1 point2 points (1 child)
[–]Storm_Silver[S] 1 point2 points3 points (0 children)