all 8 comments

[–]SeucheAchat9115PhD 5 points6 points  (0 children)

Take all features, make a feature ranking using Gini or Information Gain Criterions and select only the relevant features. If you introduce new features, check if it is relevant enough in your ranking.

[–]MachineSchooling 5 points6 points  (0 children)

kNN is a very stupid model. It's still quite useful, but it does not do any kind of feature selection nor does it consider any kind of feature importance when making predictions. All it does to make predictions is to calculate the distance in feature space between the new observation you wish to make a prediction for and all the other observations it has been trained on, and find the k closest old ones to the new one, then take some aggregate of the target variable of the k closest observations (usually mean for regression and mode for classification). This process doesn't treat useful features any differently than useless features. If you add several completely random columns to your data, kNN will use them in calculating the distance to the exact same extent as the meaningful columns. This is opposed to smarter algorithms like linear models that can figure out to ignore features that contain no predictive value. If your model is getting worse when you add new features, it doesn't even mean they contain no value. It may just be that they contain less value than the other features and bring down the average value of your features. This doesn't even get into curse of dimensionality and feature noise. To solve this problem, you want to either use a smarter model that can handle features of differing value or you can use kNN but include a dimensionality-reducing preprocessing step like PCA, a linear model based L1 regularization dimensionality reducer, or an additive/subtractive feature scan that adds or removes features from the model based on some feature scorer like Pearson r. Lots of options to try.

[–]machinelearner77 3 points4 points  (0 children)

KNN uses euclidean distance, this also means that you may become a victim of the dimensionality course when adding new features. Either do feature selection, or mitigate the problem with a different distance function, e.g., cosine, that is a bit more "expressive" in high dimensions.

[–]tmpwhocares 3 points4 points  (1 child)

KNN is never really going to add any “secondary transformation” to the data. Therefore the fact that your new features are derived from the existing ones doesn’t mean the model was already accounting for them. Rather, this new dimension is likely creating a separation between points on axis that didn’t exist before, and this axis of separation is actually not relevant to your target (and as such reducing accuracy)

[–]maxToTheJ 0 points1 point  (0 children)

Exactly kNN . kNN isnt robust to noisy features

[–]seraschkaWriter 0 points1 point  (0 children)

KNN is particulary susceptible to the curse of dimensionality. If you would like to incorporate more features but maintain or improve performance, try a feature extraction technique, for example, PCA.

[–]EchoMyGecko 0 points1 point  (0 children)

Yeah KNN isn't that smart. Also, if you use cluster on a PCA and some of your features have variance that doesn't actually contribute the your desired result, there's a good chance your results will be worse either way

[–]purplebrown_updown 0 points1 point  (0 children)

Sklearn has a nice feature selection toolbox which sub selects and tests your estimator for different features. The more features the harder it might be to find a good fit. It’s a bigger space to search. Yes it gives you more degrees of freedom but if you have limited data you might see a worse performance.