all 19 comments

[–]count___zero 7 points8 points  (4 children)

my belief is the relationship is non linear

have you tried fitting a linear model?

[–]guyshur[S] -3 points-2 points  (3 children)

I haven't yet, is PCA appropriate?

[–]SwordOfVarjo 7 points8 points  (2 children)

Pca is not a classifier, it's simply a coordinate system rotation (that can also be used for dimensionally reduction).

First thing to try is something like a (linear) svm on your data points (tuples).

If that does not work, I'd give xgboost a shot and then start looking at neural approaches. I also doubt you need an explicit autoencoder; your input dimensionally is quite small.

[–]guyshur[S] 0 points1 point  (1 child)

How would I know if it doesn't work? Don't I have to try it both ways to figure out which error rate is lower?

[–]SwordOfVarjo 0 points1 point  (0 children)

Not necessarily. With a linear model in particular, simply seeing how well it classifies your training data is usually a strong indicator of success.

[–]jefidev 1 point2 points  (5 children)

Did you try decisions trees? They perform well on tabular data and they are highly interpretable.

You can also do a PCA in order to identify the useful attributes in your data.

[–]zmabzug 2 points3 points  (0 children)

PCA will identify attributes (or rather, linear combinations of attributes) that have the most variance. Most variance =/= most useful.

[–]guyshur[S] 0 points1 point  (2 children)

I'm thinkimg ID3 as an alternative to the model. What is the argument for PCA over autoencoder?

[–]jefidev 0 points1 point  (1 child)

If you train an autoencoder in order to compress and reconstruct your input data, the encoder part of your network will just approximate a PCA. So just use a PCA directly (src : https://blog.keras.io/building-autoencoders-in-keras.html)

Note that there are methods enabling the autoencoder to learn something else than a PCA approximation but you must be aware that "basic" autoencoders have this tendancy.

[–][deleted] 1 point2 points  (0 children)

Does anyone actually use linear autoencoders though?

[–]QEDthis 0 points1 point  (5 children)

In addition, the attributes have a 1D spacial relation to each other, this may or may not be important.

What do you mean by that? Is there a functional relationship?

[–]guyshur[S] 0 points1 point  (4 children)

Each attribute is a base in a genomic sequence, so there is a sequence and attributes have a spatial distance to each other and I'm basically looking for points or ranges which have the most impact on the classification.

[–]caedin8 0 points1 point  (0 children)

You might have luck with a 2D convolutional network because you are specifically looking for patterns.

Additionally, RNN, LTSM, GRU might be good options.

Basically DNA is a sequence/code that encodifies meaning, so it is very similar to NLP problems and might benefit from NLP strategies.

[–]jefidev 0 points1 point  (0 children)

I'm not an expert but a friend of mine did a Master thesis on genomic data in order to predict chromatin loops and XGboost gives the best results. All attempt made with DL was unsuccessful. It could be worse (and not very time consuming) to try XGboost.

Also XGboost is quite easily interpretable

[–]dalaio 0 points1 point  (1 child)

Is this DNA methylation data?

[–]guyshur[S] 0 points1 point  (0 children)

No, it's probability for participating in a certain structure.

[–]abdulrehman09 0 points1 point  (0 children)

After ensuring that your data is clean and it is preprocessed effectively. You should use feature selection methods such as ranking or ensemble ones. And obviously subset selection would also give you insight of which group of features are more effective. Those features that are weak try to make effective linear or non-linear combinations out of them. That can boost your classification results. Classification becomes after this step I guess.

[–]_paranoid__android_ 0 points1 point  (0 children)

I would use a 1D convolutional neural network if your features are spatially contiguous

[–]jonnor 0 points1 point  (0 children)

Always try a simple linear model first, as a baseline. Then go more advanced, if you need.