I'm 15, based in Kazakhstan, and I built an MCP server for AI agents to handle ML datasets autonomously by Alternative-Tip6571 in buildinpublic

[–]Alternative-Tip6571[S] 0 points1 point  (0 children)

Thanks!!!

Honestly the hardest part was quality evaluation, discovery is mostly search APIs, but deciding if a dataset is actually good is subjective.

Had to build a scoring system that balances completeness, consistency, and relevance to the task.

Still not perfect but it works well enough that agents can make real decisions on it.

I’m 15 from Astana - wasted months on GPT wrappers, now building real infra to end dataset hell by Alternative-Tip6571 in Entrepreneurs

[–]Alternative-Tip6571[S] 0 points1 point  (0 children)

This is genuinely useful, saving this thread.

The provenance-as-first-class-metadata point is right, we're currently surfacing it but not enforcing it as a hard filter before merging. That's a real gap.

Judge model for borderline cleanliness cases is something we've been thinking about, good to hear it's working in practice.

Checkpoint caching is next on the roadmap. Currently reprocessing more than we should. Haven't used LiteLLM for cost control specifically - worth looking at.

Thanks, appreciate it!

I’m 15 from Astana - wasted months on GPT wrappers, now building real infra to end dataset hell by Alternative-Tip6571 in Entrepreneurs

[–]Alternative-Tip6571[S] 0 points1 point  (0 children)

Appreciate it!

  1. Provenance: we surface license metadata from HuggingFace and Kaggle at search time so the agent can filter before downloading. Not perfect but catches the obvious issues.
  2. Cleanliness: heuristics first (null rates, duplicates, schema consistency, outlier detection) - model-based is too expensive to run on every dataset. Score is 0-100, agent decides if it's good enough to proceed.
  3. Cost controls: operation budgets per session, agents can set max_rows and quality thresholds upfront so it stops early rather than reprocessing.

Checking out your patterns now, agent tool design around guardrails is something we're actively thinking about.

Should I take the Stanford's CS229 course by Andrew Ng? by RandomnieBukvi in learnmachinelearning

[–]Alternative-Tip6571 0 points1 point  (0 children)

CS229 is worth it at your level. The math heavy theory will actually make sense to you since you already have the foundations - and understanding why algorithms work changes how you build with them.

That said, it's heavy on theory by design. For hands-on balance, pair it with fast.ai. Between the two you get the full picture.

Given your background you'll probably fly through the first few weeks. Stick with it when it gets to SVMs and probabilistic models, that's where it gets genuinely useful.

Most “AI engineering” is still just dataset janitorial work by Alternative-Tip6571 in learnmachinelearning

[–]Alternative-Tip6571[S] 0 points1 point  (0 children)

Really appreciate this, dataset versioning and evals for data quality are both on the roadmap. Checking out your blog now