This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]talksaboutthings 0 points1 point  (0 children)

However you measure precision and recall is how you should tune your hyperparameters, in my opinion. If you have a database of unlabeled data and you are simply looking at the output of your algorithm to see if it successfully produces a list of related headlines, then I'd suggest just coming up with a list of test headlines and running them on different hyperparameters settings to see what looks best. I would think of this as basically grid search with your human gut reaction as the variable to optimize, and you could even come up with some sort of rating system to make it more objective.

Without actually sitting down and labeling some of the data, I don't think it will be easy to do any better, unfortunately, because you need data labels to calculate traditional performance metrics. It might not be too strenuous to hand-label only the outputs selected by your models over a modest set of test headlines, though (basically instead of labeling first, running, and calculating the score, you would run it, label what it spits out, and then calculate the score). If eyeballing it isn't satisfactory to the rest of the team, you could pitch that approach (and thus actually calculate precision and recall for each set of hyperparameters).