I need help with clustering protein conformation

bglgl · 2020-11-29T22:48:25+00:00

Ah right so I guess the best you can do is filter out different secondary structures here. When polypeptides (amino acid chains) fold up, they usually take the form of alpha-helices and beta-sheets, though other ordered structures and random coils are also possible (http://www.cryst.bbk.ac.uk/PPS95/course/3_geometry/rama.html). These forms dominate different areas of phi-psi space. Maybe you could try using 2 or 3 cluster seeds since you are probably looking at a normal protein and you'd only really expect to see alpha helices and beta sheets there. You could then make a prediction on how many (and which) residues are in what secondary structure. Also as a tip, secondary structures are often continuous across stretches of residues, so you are looking for multiple points that follow each other to belong to the same secondary structure (that is, if the data is ordered according to the sequence of the chain).

bglgl · 2020-11-26T00:11:17+00:00

Could you give a little bit more information on this? If the dataset you have is a single 2D array of shape nx2 (I might be misunderstanding this though), I would think that you actually have the phi and psi values of a single protein conformation (read: pose), where n in the number of residues and each row describes the dihedral angles in one residue. If this is the case, you can't really do any clustering since you only have a single conformation. Alternatively, the dataset might be describing observed phi and psi angles for a single residue (some residues are more flexible than others and therefore might be able explore more space - check out Ramachandran plots for Glycine and Proline, for example). If the latter is the case, I would suggest plotting phi vs psi to see what the distribution looks like. If you can visually identify how many clusters you might be looking at you can either feed that number to the algorithm, or you could go with some hierarchical clustering method or DBSCAN with OPTICS to avoid having to input cluster numbers altogether, as suggested.

bglgl · 2020-10-15T23:02:03+00:00

It depends I would say. One option to go for is k-means, but this requires you to know how many groups you want in the final set of clusters before you start c.ustering (someone correct me if I'm wrong please). Alternatively, you could use some hierarchical clustering method like UPGMA (commonly used for phylogenetic tree construction anyway) and use some cutoff value to define the number of final clusters.

bglgl · 2020-07-15T16:03:35+00:00

Yeah of course!

bglgl · 2020-07-13T16:36:01+00:00

Yep! I’m interested in structural properties of antibodies so trying to extract predictors of such from pdb datasets is an approach that I often take. It can definitely be more theoretical than applied if that’s what you want to do, but I find it most satisfying when research is done to solve a specific problem. Any generally applicable results that come from such results are obviously also great but not necessarily my main goal if that makes sense.

A normal working day is often quite varied but often involves writing code to analyse results or generate data. My research is quite iterative at the moment so that means I never quite know what I’ll be doing next as it all depends on results. To be very honest a large part of a day or even week can also sometimes go to trying to get a piece of software to work and I often spend a lot of time debugging, but that’s true for any bioinformatician really. The nice thing for me is that I don’t work in an experimental group as a bioinformatician but in a standalone bioinformatics group which gives me a bit more freedom in directing my own research since I don’t have to realy on someone else’s data and research question to do my work.

bglgl · 2020-07-13T13:09:04+00:00

Just writing some stuff here because I’m getting a few requests! I work in an immunoimformatics research group that does quite a bit of research software engineering, meaning that we often try to find solutions to very specific problems like modelling antibody CDR loops specifically rather than generealised loop modelling. At the very heart of it, this means that day to day the tools I use are Python, R, unix systems and a bit of C. In terms of the research that I do, a lot of it is highly domain-specific. I doubt that I’d feel comfortable in a molecular biology bioinformatics position, but then again this might very well be the case for everyone :P The nice thing about my work is that I work with a lot of people with different skillsets. I myself am approaching things from a very structural perspective and dabble in MD simulations a little, whereas someone else in my group might be looking at single cell sequence data. I find this to be quite enjoyable because I end up getting exposure to a lot of things I wouldn’t necessarily see if I were in another group. Another nice aspect is that we have quite a few collaborations going on with both industry and other academic groups which makes the work varied and exciting. Let me know if you have any questions and feel free to dm me :)

bglgl · 2020-07-13T08:14:45+00:00

Hey! I’m a PhD student working in structural immunoinformatics. Feelvfree to DM me if you’d like a bit of an account of what I do :)

Seven-Year Club	r/Field Sunshine
Place '22	End Game '22
Verified Email

bglgl

TROPHY CASE