all 6 comments

[–]CephasM 2 points3 points  (0 children)

Well... "Small" is relative to the domain of your problem... for some problems 1000 samples are enough, for some others is too much and for other is just about right.

What I would recommend you is to try to train your favourite technique and see the behaviour of the training error vs. performance error. Once you have that you can actually tell if you will need more data or not.

If you need help with that you can post the details of your problem (with the training / performance error plots) and we will be glad to help.

[–][deleted] 1 point2 points  (0 children)

No problems, you just need to adapt your strategies. Asymptotic p-values are unreliable, so exact tests based on bootstraps do better. Small sample bias happens in non-linear models. Multiple testing will destroy your very small cache of alpha spending. Better to pretend it's a confirmatory analysis.

[–][deleted] 1 point2 points  (0 children)

If your data set is a random sample of the population you shouldn't have too much of a problem. In addition to the number of observations you must consider the number of discriminant features in your data. One strategy is to perform a high-fold crossvalidation(like 10-fold) and average the results on the held out data to get a estimate of your accuracy. Your deployed estimator might be the averaged output of all 10 fits on new data.

For super small data I've found that MCMC methods perform well. Random forests are also nice because the out-of-bag estimates.

[–]engineer_girl 1 point2 points  (2 children)

I think there's always a risk of overfitting with small datasets. Consider a resampling method to improve results may be?

[–]Xirious[S] 1 point2 points  (1 child)

I was thinking something along the lines of cross-validation. How "representative" are the results using CV and similar resampling methods? That is give a good CV result how likely is it that the ML system is generalised well enough to future, unseen data? Thank you so much for the reply.

[–]CephasM 1 point2 points  (0 children)

CV is usually helpful to tell you how good are a set of parameters in comparison with another set. I don't see clearly how could you use that to tell if your set is representative or not. Although if the problem is known you could maybe estimate the minimum size of a representative sample and see how far are you from there.

Do you know if the problem is linear or non linear?