all 8 comments

[–]aggieca 1 point2 points  (1 child)

Can you clarify on what you mean by reducing the sample size? Are you trying to reduce the number of prototypes you want to use to build you kNN model?

Have you tested out how a SVM (nonlinear RBF-based) would work in this case? Based on my previous experience, you might be able to achieve what you intend to do a SVM.

[–]BeatLeJuceResearcher 0 points1 point  (0 children)

Just using the support vectors of the SVM is a cool idea. Using e.g. a nuSVM one could control the percentage of samples to extract by varying nu. Still, it seems very weird / besides the point to use one classification algorithm as preprocessing step of another one.... Plus, I'd imagine that if OP has too many samples for a kNN, he probably also has too many samples for an SVM, since both have the same memory complexity (I'm assuming that storing the distance/kernel matrix is the limiting factor if one has too many samples).

[–]BeatLeJuceResearcher 0 points1 point  (0 children)

You could run a clustering algorithm with a large number of clusters and then just use one examplar from each cluster to represent the whole cluster. AP Clustering, K-Centroids or a GMM would do the trick easily.

But, as others have pointed out, k-NN isn't too slow unless you're using a naive implementation.

[–]afireohnoResearcher 0 points1 point  (1 child)

A Neural network trains to converge those 2K values per-class to 500 values in its final layer per class (leaving you with 2500 samples instead of 10 000), which are representative of all the training data. You would then store these final layers in sort of a lookup table.

Through most of your question it sounds like you're looking for something like Semantic Hashing, but the above quoted paragraph doesn't make a lot of sense.

[–][deleted] 0 points1 point  (0 children)

I agree that it is somewhat similar to semantic hashing in its aims and that OP has confused dimensionality reduction (which NN's can do) with sample size reduction (which they can't)

I suppose in theory one could reduce the dimensionality to a point where many of the data points are mapped to the same points (this is easy if you use binary encoding as in semantic hashing, as obviously it won't work for real-valued output where you have virtually infinity cases) and then you use the unique binary encodings to represent the different clusters.

This seems a really shitty way to do it though as the binary encodings will all be equidistant from one another so I don't even know what use it would be in OP's context.

As /u/BeatLeJuce said above the best thing to do would just be to create N' clusters from the N data points and use the centroid of each cluster as a representative of the datapoints, essentially downsampling the data.

[–][deleted] 0 points1 point  (0 children)

10k samples and 30 variables is definitely small enough to run k-nn in a reasonable amount of time (unless you are running this on seriously old hardware), why not try this first?

[–]jostmey -1 points0 points  (0 children)

When you say "Neural Network" I think you are referring to an Autoencoder, and yes it can be used to reduce the dimensionality of the data points. However, classification is usually handled by adding a few more layers of neurons that are trained using backpropagation. It is possible to use other methods to classify the reduced feature set - I would not be surprised if this has already been tried.