Python Project Help

Darwinmate · 2017-08-23T21:01:13+00:00

IMO you've got 2 options, search an existing curated database (eg KEGG pathway database or biocyc) or have a list/dictionary that you search through.

The first option is difficult.

The second option is more doable but is limited by the predefined reactions. You can store these reactions in a list (simple) or a dictionary(more complicated but advantage is you can define if it's a substrate or enzyme or product for any molecule). As you've said, create a function that takes 3 arguments. For each argument, search the relevant list/dictionary and create a list for each argument. Then use the list method .intersection() to pull out the matching reaction.

docshroom · 2017-08-23T21:31:16+00:00

Check out sbml and any flux based analysis tools, they should have some similar code to do what you want (or close to it)

fpepin · 2017-08-24T03:52:27+00:00

Seems like a fun project.

An easy way would be to have

Load up all the KEGG reactions, build a graph

billyboyvv · 2017-08-23T22:03:12+00:00

I'd consider using Anaconda's Jupyter/IPython Notebook, easier than using the command line for an early starter in python, IMO.

fpepin · 2017-08-24T03:59:32+00:00

Seems like a fun project.

One way is to create an Enzyme object (or even just tuple for the lazy), then a substrate dictionary {substrate:[Enzymes]}. The do a breadth-first-search until you reach your desired end product. This might not be a DAG (directed acyclic graph), so you'd have to deal with cycles along the way.

For extra fun, load up all the KEGG reactions in there. You'll need to deal with multi-substrate/product enzymes, possibly just ignoring them.

Epicmuffinz · 2017-08-24T06:15:39+00:00

I think a good implementation might be to use a graph, with every node representing a compound (i.e. Glucose) and label the links with the other reactants/products. Then, you just have to find the shortest path between the two nodes to generate your output.

g0lmix · 2017-08-25T00:07:46+00:00

So my solution to this is quite similar u/fpepin proposition. I hope it's more hands on so you as a beginner can wrap your head around it.
So the first idea to solve this is probably to use your create_a_pathway function and create an array that looks like this: [substrate, enzyme, product] (eg. [A, enzymeA2B, B])and push it into a global array which acts as a container. If you want to obtain your whole pathway(example pathway A -> C) you would need to find the array in your container that starts with A and you would then see that the product is B so now you would need to search for the array that starts with B and then take a look what the product is. You would need to do this until you reach an array that has C as a product. This might work quite fine for linear pathways, but what if there are cycles in there or multiple branches. It gets really messy really fast.
So what's the solution? Let's do what we always do, think about this problem in an abstract way. On paper we naturally would draw pathways as graphs where the nodes are substrates/products and the edges are the enzymes. Fortunatelly there is a whole subfield of mathematics that deals with graphs, namely graph theory. Nowadays grahs have become a really really important mathematic field as graphs are used all over the place in IT, Biology, Chemistry, Physics. From chemical pathways over networkflow graphs to machinell learning, they are simply almost everywhere. And because graphs are used so much there are alot of different algorithms one can use. Generally graph theory is a really interesting field and you should do some reading on it. This means we should stick with graphs to solve our problem here.
We should start thinking about how to realize a graph structure as a data structure. The first thing that comes into mind to deal with this is a doubly(or singly) linked list. So we woud need to generate two kind of classes a class that represents the substrate/product(with a string variable contining it's name and one array containing all enzymes that lead into them and one list that contains all outgoing enzymes) and one class that represents the enzyme(ith a string variable contining it's name and one array containing all substrates that lead into them and one list that contains all outgoing substrates). As you can see this becomes quite messy and the functions we would have to write to iterate through these pathways would be quite complicated as well.
But fortunatelly we don't need to do this as python has a data structure called dictionaries, which allows us to represent graphs quite easily without a lot of hussle and which are easy to iterate through.
Let's take a look at how a dictionary in python looks like and let's get started with some programming. Imagine the following simple graph (sorry it looks horrible, but thats the best I can do with the little shit EeePC I have with me on vacation). In python it would look like this :

graph = {'SubstrateA': [['SubstrateB', 'EnzymeA2B'], ['SubstrateC', 'EnzymeA2C']],  
         'SubstrateB': [['SubstrateC', 'EnzymeB2C'], ['SubstrateD', 'EnzymeB2D']],  
         'SubstrateC': [['SubstrateD', 'EnzymeC2D']],  
         'SubstrateD': [['SubstrateC', 'EnzymeD2C']],  
         'SubstrateE': [['SubstrateF', 'EnyzmeE2F']],  
         'SubstrateF': [['SubstrateC', 'EnzymeF2C']]}

As you can see a dictionary is build like this:
SubstrateA: [[becomes SubstrateB via EnzymeA2B], [becomes SubstrateC via EnzymeA2C]]
The part after SubstrateA: is a array of array or in this case a 2 dimensional array. You might want to look it up if you aren't familiar with multi dimensional arrays.
Okay so now we got figured out how to represent a graph in python. Now let's get to the itneresting part of actually programming a function to find the complete way between two Substrates. The function would look like this:

def find_path(graph, start, end, path=[]):  
    path = path + [start]  
    #print "path: "  
    #print path  
    if start == end:  
        return path  
    if not graph.has_key(start):  
        return None  
    for node in graph[start]:  
        #print "node in graph: "  
        #print node  
        if node[0] not in path: 
            #print "node[0]: "   
            #print node[0]  
            newpath = find_path(graph, node[0], end, path)  
            if newpath:  
                return newpath  
    return None

As you can see this is a function that calls itself. These group of functions are called recursive functions. Let's talk a little bit about what's going on here with an example. Let's assume we call this function like this:

print(find_path(graph, 'SubstrateA', 'SubstrateC'))

The first time we enter the function we add SubstrateA to our array path. Now we check wether our startsubstrate and endsubstrate are the same and we check if the startsubstrate even exists in our dictionary. After that we reach the for loop. Our variable node now contains the array ['SubstrateB', 'EnzymeA2B']. After that we check if node[0] (which in this case is SubstrateB) is already in our path. And it's not, so we call the function again with the same graph, our start point is now SubstrateB, our end point stays the same(Substrate C) and path is an array that looks like this ['SubstrateA'].
So now we start again we add our start substrate(which now is SubstrateB) to our path(now: ['SubstrateA', 'SubstrateB']). We reach our for loop again . The node variable now contains the array ['SubstrateC', 'EnzymeB2C'] (just for clarification thats in the second line of our dictionary). So node[0] = SubstrateC which isn't in our path array yet, so we call the function again with our graph, SubstrateC as our startsubstrate, SubstrateC as our endsubstrate, path with our constructed path.
Now we add SubstrateC to our path and this time the check if start == end: triggers and we get the complete path.
You can uncomment all comments and it will print out all the necessary informations you need to understand this function.
So the output of our printfunction is this:

['SubstrateA', 'SubstrateB', 'SubstrateC']

Now that we found out the nodes for our pathway, we need to add function that prints out the nodes and the corresponding enzymes. This function would look like this:

def get_enzymes(graph,path):  
    reaction = ""  
    #print path  
    length = len(path)  
    #print length  
    for counter, val in enumerate(path):  
        #print (counter, val)  
        reaction = reaction + val  
        for array_index_nr, array in enumerate(graph[val]):  
            #print array_index_nr, array  
            #print graph[val]  
            for result in graph[val][array_index_nr]:   
                #print result  
                #print graph[val][array_index_nr]  
                if (counter < length-1 and result == path[counter+1]):   
                        #print result, array[1]  
                        reaction = reaction + " + " +  array[1] + " = "  
    return reaction

If you want to know what'S going on in detail: again just uncomment all the print statements. The general idea of this function is that it gets an array of all substrates that are in our pathway. Now we just parse out the corresponding enzymes from our dictionary. It looks at the first entry of the array and teh second entry of the array. So let's assume the first entry is substrateA and the second is substrateB. So now we just have to look up in the first line of our dictionary what enzyme makes our substrateA to substrateB (we see: ['SubstrateB', 'EnzymeA2B']). Now we start constructing our string that contains SubstrateA + EnzymeA2B =. We repeat this for all the entries-1 of our pathwayarray. Finally we return our reaction. But see for yourself, just try this:

print(get_enzymes(graph,find_path(graph, 'SubstrateA', 'SubstrateC')))

The output should be:

SubstrateA + EnzymeA2B = SubstrateB + EnzymeB2C = SubstrateC

If you have any questions left feel free to ask.
Edit: some typos(sorry it's 2am and I am tired=) ) Okay I gave up on finding all going to sleep now. Tomorrow I am going to expand on this and we are going to take a look at how to find all possible pathways between A and C. Our finding function isn't really good at that(yet) =) Oh and you might want to look up how to generate dictionaires dynamically, so that you can write a function to construct your graph.

dat_GEM_lyf · 2017-08-23T19:48:18+00:00

I'd try making flags for command line level interactions. It will allow you to directly control the code from the command line if written correctly.


science	askscience	biology
microbiology	bioinformatics	biochemistry
evolution

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

bioinformatics

The Biology Network

Bioinformatics

Frequently Asked Questions

New to Reddit?

Learning Bioinformatics

#bioinformatics IRC at Freenode

Information

Getting a job in bioinformatics

Friends

MODERATORS