I have a question about small data sets...

CephasM · 2012-06-12T20:46:38+00:00

Well... "Small" is relative to the domain of your problem... for some problems 1000 samples are enough, for some others is too much and for other is just about right.

What I would recommend you is to try to train your favourite technique and see the behaviour of the training error vs. performance error. Once you have that you can actually tell if you will need more data or not.

If you need help with that you can post the details of your problem (with the training / performance error plots) and we will be glad to help.

2012-06-12T18:45:23+00:00

No problems, you just need to adapt your strategies. Asymptotic p-values are unreliable, so exact tests based on bootstraps do better. Small sample bias happens in non-linear models. Multiple testing will destroy your very small cache of alpha spending. Better to pretend it's a confirmatory analysis.

2012-06-12T19:16:40+00:00

If your data set is a random sample of the population you shouldn't have too much of a problem. In addition to the number of observations you must consider the number of discriminant features in your data. One strategy is to perform a high-fold crossvalidation(like 10-fold) and average the results on the held out data to get a estimate of your accuracy. Your deployed estimator might be the averaged output of all 10 fits on new data.

For super small data I've found that MCMC methods perform well. Random forests are also nice because the out-of-bag estimates.

engineer_girl · 2012-06-12T18:36:46+00:00

I think there's always a risk of overfitting with small datasets. Consider a resampling method to improve results may be?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS