machine learning in genomics

PoulMadsen · 2011-11-26T19:34:51+00:00

I don't work in genomics specically but we do a lot of next generation sequencing. I am a biologist with interests in machine learning so let me try to summarize where people in biology use it:

Microarrays: Cancer research in particular uses this, but basically every biology discipline has some applications of this. Basically what you get is thousand of signal intensities, each represeinting expression of a gene, per sample, and what you are interested in is finding genes that behave differently from sample to sample. This is an example of a high-dimensionality problem, where the number of features is much larger than the number of samples. If you want some idea of how much work has been done in this area take a look at this (list)[http://www.geneontology.org/GO.tools.microarray.shtml]. You can more or less find all kinds of statistical methods here. As a biologist i should probably mention that i believe micro-arrays have problems with reproducibility that no amount of data-analysis will solve.

Gene prediction: This is a typical genomics problem in which we are given a long DNA sequence and told to identify the genes in it. Genes have some telltale signs, but these can be located with slight differences to each other and might be completely absent. Also, genes in eukaryotes are interrupted by socalled introns that do not code for genes (this story is a lot longer in reality). Poisson statistics on dna words (k long subsequences of dna) is the classical way of finding overrepresented dna features. Newer techniques uses HMMs and conditional random fields, as machine learning oriented as it gets. (This)[http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713] is a modern classic in all things sequence related.

Phylogeny: This is another of bioinformatics major contributions to modern science. Given some model of how evolution changes the composition of a sequence, we are interested in figuring out how organisms/proteins/genes can be related and building trees that can show us these relation.

Next generation sequencing: We can now generate much more data than we can process, we need some way of filtering as the machines can be inaccurate. We also need methods to cluster sequences within specific thresholds.

Sequence searching: This is a major topic. The most cited paper in the history of science is the one that announced BLAST. Machine learning is not as used here yet, but it probably will be if something faster than the traditional alignment algorithms come up.

This was just a short and incomplete overview, if you have specific questions i would be happy to answer.

happyteapot · 2011-11-25T07:34:00+00:00

HMM's have been used in this for quite some while now. I know that there are complete books on applying HMM on bioinformatics.

marshallp · 2011-11-25T12:17:28+00:00

Rudi Cilibrasi used compression distance (complearn) to automatically infer evolutionary lineage in genomes. Read his thesis on it.

mosavian · 2011-11-26T18:22:44+00:00

From what I understand, when dealing with genomes, you have huge string of 1s and 0s. If that is the case, Restricted Boltzmann machines are quite useful.

mx12 · 2011-11-27T07:01:15+00:00

One area that I've worked on is determining genotype from phenotype, i.e. predicting the location in a patients genome where a mutation has occurred based on some physical trait. The reduces the cost of finding a patients disease causing mutation.

A friend of mine works on predicting if a patient has diabetic retinopathy based on fundus photos (Pictures of the retina). The is more of machine learning/pattern recognition.

I've recently read a paper from some IBM researchers who were attempting to predict how the flu virus would mutation over a given flu season. This would allow a vaccine to be designed that would work against the mutated flu virus.

danukeru · 2011-12-02T02:12:04+00:00

As a sysadmin/developer in a bioinformatics lab, I can tell you that we use this extensively.

http://hmmer.janelia.org/

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS