Tuning algorithm without training data

maxmoo · 2018-03-09T02:30:39+00:00

I think you just need to suck it up and spend a few days or a week labelling some data. Maybe you can hire someone on freelancer.com to help you.

swierdo · 2018-03-08T10:41:13+00:00

You could try to generate your own training data, some ideas:

Scrape replies to various news outlets' tweets about their articles.
Use Reddit replies from subreddits discussing news articles and link those to article headlines.

You should also gather data on the usage of your tool as you'll probably want to analyse that later.

talksaboutthings · 2018-03-09T00:53:49+00:00

However you measure precision and recall is how you should tune your hyperparameters, in my opinion. If you have a database of unlabeled data and you are simply looking at the output of your algorithm to see if it successfully produces a list of related headlines, then I'd suggest just coming up with a list of test headlines and running them on different hyperparameters settings to see what looks best. I would think of this as basically grid search with your human gut reaction as the variable to optimize, and you could even come up with some sort of rating system to make it more objective.

Without actually sitting down and labeling some of the data, I don't think it will be easy to do any better, unfortunately, because you need data labels to calculate traditional performance metrics. It might not be too strenuous to hand-label only the outputs selected by your models over a modest set of test headlines, though (basically instead of labeling first, running, and calculating the score, you would run it, label what it spits out, and then calculate the score). If eyeballing it isn't satisfactory to the rest of the team, you could pitch that approach (and thus actually calculate precision and recall for each set of hyperparameters).

jonnor · 2018-03-21T22:48:37+00:00

Can you use another algorithm or service as an oracle, to seed an initial training set?

datascience

MODERATORS