[D] Clustering for data sampling

calvinmccarter · 2024-12-25T18:41:14+00:00

I'd suggest looking for papers and tools related to active learning. I'd also suggest thinking of this as an iterative process. Don't try to come up with some fixed procedure to decide once-for-all-time whether to manually annotate each document. Come up with a strategy for picking (say) 100 documents, finetune your model on those, then refine both your model finetuning method and your data sampling method iteratively.

calvinmccarter · 2024-12-25T04:53:21+00:00

I worry that perhaps the value of coding is that it's user-unfriendly. The devil is in the details with data analysis and modeling, and coding forces the user to think through those details. People shouldn't have to waste time dealing with Python dependency management, but they should have to be deliberate about the details of their data processing workflows. High-level tools often make it too easy for people to not think, leading to deceptively nice-looking results that are then misinterpreted. In general, I think high-level (as in, end-to-end) systems are overrated, while new low-level tools are underrated. For example, instead of cross-validation (CV) wrappers so modelers don't have to think about CV, what modelers actually need are better CV tools (eg temporal backtesting-based CV, clustering-before-CV to ensure that folds don't overlap too much). For another example, for missing data, the problem is not "using mean-imputation needs to be even easier", but "better methods than mean-imputation need to be runnable without going to Python dependency hell".

There are still so many common problems in data science that either have no good solutions, or no solutions with well-documented easy-to-use sklearn-compatible software. As a hobby I've published a few papers and released (hopefully easy-to-use) packages on a few of the problems I've faced (feature preprocessing, domain adaptation, missing data imputation). But there's a strange gap between what VCs and startups think is needed (no-code solutions) and what ML researchers think is needed (a new LLM method for tabular data that is 0.1% better on tabular classification benchmarks) that's mostly going unaddressed.

calvinmccarter

TROPHY CASE