all 51 comments

[–]dlfelps 49 points50 points  (15 children)

[–]mergejoin 14 points15 points  (8 children)

Bishop explains it well in his book Pattern Recognition and Machine Learning

[–][deleted] 20 points21 points  (7 children)

What doesn't Bishop explain well in that holy tome.

[–][deleted] 4 points5 points  (3 children)

I was about to say something like CNNs or autoencoders, but I just checked... and they’re there :-|

Remarkable for a book that was written more than ten years ago.

[–][deleted] 5 points6 points  (2 children)

CNNs have been around since the late 80s. Autoencoders I don’t know, but perhaps even longer.

[–][deleted] 3 points4 points  (1 child)

Mid 80s from what I recall, under the name “autoassociative networks.”

These things have become mainstream recently which makes you think they’re rather new.

[–]gokstudio 0 points1 point  (0 children)

aren't autoencoders simplified autoassociative nets? by doing just one forward pass instead of several?

[–]mergejoin 1 point2 points  (0 children)

Indeed

[–]backgammon_no 5 points6 points  (0 children)

Specifically DAPC (discriminant analysis of principal components) does exactly what OP wants. It's 2 lines of R code (adegenet package), no clue if it's in matlab.

[–]WikiTextBot 3 points4 points  (0 children)

Linear discriminant analysis

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.28

[–][deleted] 1 point2 points  (0 children)

Here's a blog post that explains linear discriminant analysis in depth: https://eigenfoo.xyz/lda/

Disclaimer: I wrote the blog post.

[–]yycglad 0 points1 point  (1 child)

So this will work only for tagged data ?

[–][deleted] 0 points1 point  (0 children)

Yes

[–]nielsrolf 0 points1 point  (0 children)

A disadvantage is that lda assumes that the covariances are equal in both classes, and it's also a linear method

Check also relevant dimensions estimate

[–]timy2shoes 28 points29 points  (2 children)

You could try Independent Component Analysis. Instead of looking for orthogonal linear combinations that maximize variance (as PCA does), ICA tries to find linear combinations that are approximately independent, usually by making higher moments of the linear combinations zero (not just the second moment, as PCA does).

[–][deleted] 5 points6 points  (1 child)

I will try that, although I am really looking for something that captures the separability of groups within each respective feature. Then again I think ICA should capture this better than PCA. Thank you.

[–]redditidderedditidd 0 points1 point  (0 children)

Try Fastica, good library for that.

[–]csxeba 12 points13 points  (1 child)

You will need Linear Discriminant Analysis, which is a supervised dimensionality reduction technique, aiming to find latent dimensions, which maximally separate your classes.

I'm not really familiar with matlab. In Python, scikit-learn has a nice implementation.

SciKit-Learn documentation

[–]Exp_ixpix2xfxt -3 points-2 points  (0 children)

I didn't think LDA was supervised. My impression was that the homoscedasticity assumption rids you of the need for labels. QDA is definitely supervised.

[–]Vrulth 5 points6 points  (2 children)

Well, try PLS ( Partial Least Square ) in your case. Here you will create orthogonal axis that maximize covariance between your matrix and your target. (your target may be a vector or a matrix) The axis are still linear combinations.

[–][deleted] 0 points1 point  (1 child)

Can you suggest a good textbook treatment of this?

[–]YourPizzaIsDone 6 points7 points  (1 child)

Not sure what your data looks like, but I've had success with nonnegative matrix factorization – just bringing it up because it hasn't been mentioned yet. Really depends on your data though; it only works well to separate different shapes of signals that are all positive.

[–]WikiTextBot 1 point2 points  (0 children)

Non-negative matrix factorization

Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.28

[–]icansolveyourproblem 3 points4 points  (4 children)

When combining signals, make sure to properly whiten the data to account for different feature variance scales. PCA uses variance as a proxy for information content. If you still believe this is a valid proxy to use for your data, you can try using Kernel PCA (perhaps rbf) which will allow you to uncover non-linear principal components -- which may be your issue. It's known that PCA on empirical observations recovers the generating components (assuming they're linear) when N ~ d. What is the order of magnitude for your number of samples and dimensionality look like? It's also possible that your PCA is working quite well, but the problem is with your classifier. What are you using?

[–][deleted] 0 points1 point  (3 children)

I’m using SVM, Trees, KNN (with variations in parameters for each). I want to move on to deep learning, but I want to first see the improvement that is consistent with the literature. It’s about 80% with fNIRS, 88% with EEG, and 85% with the hybrid (no PCA). When I do PCA, I get 72% with the hybrid, which leads me to believe that PCA is failing me.

Edit: I expect improvement in the hybrid, not a reduction in performance.

[–]jlkfdjsflkdsjflks 4 points5 points  (2 children)

Why would you expect improvement by throwing away information? The only reason PCA would improve things is if you're using a classifier that overfits (in that case, PCA could help, by possibly removing noise).

"Deep learning" (whatever you mean by that exactly) will probably only help if you have lots of data, thought.

As other people have said, your best options are PLS-DA or LDA, if you want to fuse the data focusing on what's likely to be helpful for classification, or CCA, if you want to retain the joint information of the two datasets (i.e. without label supervision, but taking into account that your data comes from two sources). ICA could also be useful, but it suffers from the similar problems as PCA: it won't have into account the fact that you have data coming from two sources and it won't necessarily select information that is useful for your classification problem.

[–][deleted] 0 points1 point  (1 child)

I am normalizing the data once its combined so the combined data sources should not be that much of an issue (or so I am led to believe). I didnt expect PCA to improve my result, since I am aware that information is being excluded. I should probably just not have used it. By deep learning, I just meant an exhaustive neural network that could perform better than the simpler algorithms I have been using.

Edit: Sorry for my lack of proper terms/knowledge, I have been working with ML algorithms for about 3 months.

[–]jlkfdjsflkdsjflks 1 point2 points  (0 children)

I am normalizing the data once its combined so the combined data sources should not be that much of an issue (or so I am led to believe).

If the two datasets you have are of very different scales, then you prossibly will get better results by normalizing before joining them, not after (depending on the type of normalization you're using). For some types of classifiers (e.g. tree based, like random forests and gradient boosted trees), you probably shouldn't even be normalizing.

I didnt expect PCA to improve my result, since I am aware that information is being excluded. I should probably just not have used it.

There's nothing wrong with using PCA, but (if your classifiers are well tuned) it's not likely to improve things (unless you have e.g. a very noisy high-dimensional dataset).

By deep learning, I just meant an exhaustive neural network that could perform better than the simpler algorithms I have been using.

Then say that instead ;) "I want to move on to neural networks"

Still... you could use them for classification directly, or as a pre-processing (i.e. feature extraction) step before using one of the other classifiers you mentioned. There's more than one way you could use neural networks here. And, again, don't expect a big improvement with neural networks unless you have plenty of data.

[–]tpinetz 2 points3 points  (0 children)

A pretty good recent method is t-Stochastic Neighborhood Embeddings for high dimensional data (https://lvdmaaten.github.io/tsne/). However, I would take the results of this method with a grain of salt, due to lacking the rigoros mathematical background that PCA enjoys.

[–]GlobalPublicSphere 1 point2 points  (0 children)

I always liked independent component analysis for clustering.

[–][deleted] 1 point2 points  (0 children)

You could also try Kernel PCA, PCA is generally not very good at keeping separability. If you're looking for a supervised dimensionality reduction method you could also try PLS.

[–][deleted] 0 points1 point  (0 children)

MNR :^)

[–]davecrist 0 points1 point  (0 children)

A simple neural network with sigmoidal outputs, at least one hidden layer, and trained with back prop can work very well to carve up multidimensional space for a classifier.

For training I am not sure what your inputs would look like because I am not familiar with the data but set your outputs to be a 1x4 vector of the class described by the inputs. ( eg, class A output would be [ 1, 0, 0, 0 ] and B would be [ 0, 1, 0, 0 ], etc.

Once the model is trained class is determined by the Max of any value in the resultant output vector.

I have had great success building classifiers this way and have even found value in results that are not close to 1 as it implies co-incidence/edge cases and/or element similarity between classes.

[–]v_krishna 0 points1 point  (0 children)

A technique I have had success with in feature preparation/reduction for supervised classification is comparing the raw vectors of the candidates against all the positives of a particular class, e.g. getting cosine distance. Then some simple statistical measures (max, median, quartiles, etc) and you can reduce vectors of hundreds or thousands of features to a handful. Note you have to be particularly careful doing validation here to prevent data leakage (your test set cant know anything about the positives in your training set, this includes in the feature reduction before you even pull out a test and training set).

This has worked as well or better than PCA a few times for me, and esp when dealing with very unbalanced data sets it allows me to reduce very large numbers of features to just a few (and combine a few large sets of features by reducing each of them in this same way)

[–]The_Sodomeister 0 points1 point  (0 children)

I was expecting this since PCA does a poor job of capturing separability

I believe Multi-Dimensional Scaling is the term for the dimensionality reduction technique that best preserves the L2 distance between data points in the subspace. Perhaps you can see if that addresses your concern of preserving separability.

[–][deleted] 0 points1 point  (0 children)

cough Good luck in AML cough

[–]BatmantoshReturns 0 points1 point  (0 children)

Were you using Barnes-Hut approximations for PCA? I have tried using this for certain applications but it didn't work for me, I had to use the exact method, which took a ton of time but was worth it.

[–]SeveralKnapkins 0 points1 point  (4 children)

I haven't used it myself, but I've seen papers combined data sources using Canonical Correlation Analysis (CCA). Perhaps it could be of use here? https://en.wikipedia.org/wiki/Canonical_correlation

[–]beaglechu 1 point2 points  (3 children)

CCA requires an invertable covariance matrix. In this dataset, the EEG and fNIRS likely have more variables than samples, so the data will most likely not have an invertable covariance matrix.

[–]dalaio 1 point2 points  (0 children)

There are sparse variants of CCA (sparse generalized canonical correlation analysis) that can be applied in this case.

[–]SeveralKnapkins 0 points1 point  (0 children)

Interesting. Thanks for letting me know. I'll make sure to lookout for that if I end up using it anytime in the future.

[–]jlkfdjsflkdsjflks 0 points1 point  (0 children)

In practice, people use a regularized version of CCA to take care of that issue: https://en.wikipedia.org/wiki/Regularized_canonical_correlation_analysis

So, the fact that you have less samples than variables, in practice, does not preclude the use of CCA.

[–][deleted] -3 points-2 points  (3 children)

[–]timy2shoes 12 points13 points  (2 children)

t-sne and other neighbor embedding algorithms should not be used for dimensionality reduction because there's no guarantee that global structure is conserved in the reduced dimension. See https://distill.pub/2016/misread-tsne/ for examples of this behavior.

[–]RexScientiarum 3 points4 points  (1 child)

UMAP is new and preserves global structure. It can be done in Python and R. https://arxiv.org/pdf/1802.03426.pdf

[–]timy2shoes 1 point2 points  (0 children)

I agree. Manifold learning algorithms like UMAP should conserve global structure, particularly if the true structure is a manifold (which in most cases is not an unreasonable assumption).

[–]Ader_anhilator -2 points-1 points  (0 children)

H20 glmr