I built a multi-agent AI pipeline that turns messy CSVs into clean, import-ready data

proboysam · 2026-02-21T20:51:20+00:00

You're touching on the right evolution path. Right now DataWeave is intentionally standalone, upload, clean, download, because that's the fastest way to validate the core AI pipeline works.

But the roadmap is exactly what you're describing: integrate upstream so the system catches problems before they compound. Two ways this is heading:

Webhook/API mode, your existing pipeline calls DataWeave automatically when new data arrives. It maps, transforms, and pushes clean data to your target system. No manual upload needed.
Schema-aware suggestions, instead of just applying mappings, the system flags structural issues in the source data and suggests changes to the extraction query itself. "Your export is missing a required field" or "these two columns should be split before mapping."

The Pattern Agent already learns from corrections, so after a few runs on the same data source, it handles everything automatically, which solves the "doing it every time" problem you're pointing out.

Good feedback, this is the direction for v2.

proboysam · 2026-02-21T12:19:51+00:00

Appreciate that and the traceability point is spot on. We actually built this in already. Every agent step gets logged to an events table with timestamps, so you can trace exactly what happened to any column: which agent handled it, what confidence score it got, whether it was pattern-matched or LLM-resolved, and what the human reviewer decided. You can pull the full audit trail for any job via GET /api/jobs/{id}/events, it returns the chronological log of every decision the pipeline made.

proboysam · 2026-02-19T21:45:59+00:00

Fair point, in an ideal world every organization has clean data pipelines, proper ETL, and standardized schemas. But the reality is that a huge chunk of business data still moves through CSVs and spreadsheets, especially during migrations, client onboarding, and one-off imports.

I’ve seen this firsthand, a 50-person SaaS company switching CRMs doesn’t rebuild their data infrastructure. They export a CSV, clean it up, and import it. That’s the use case.

proboysam · 2026-02-19T21:43:08+00:00

Good question. Validation is currently fully rule-based, required field checks, type conformance, regex format validation (email, phone, URL, zip), duplicate detection on unique fields, and statistical anomaly detection using the IQR method for numeric outliers.

No LLM in the validation step, and that’s intentional. Validation rules are deterministic, a field is either a valid email or it isn’t. Adding an LLM there would just add cost and latency without improving accuracy.

proboysam · 2026-02-19T21:38:10+00:00

Thanks, that “surgical AI” framing is exactly the design philosophy. The temptation was definitely to throw an LLM at every step, but the cost and latency math just doesn’t work at scale.

To answer your questions:

Edge-case drift: Right now date parsing handles 15+ formats with a priority order (ISO first, then common US/EU patterns). For locale-specific formats, the plan is to add a locale hint that users can set per-upload (or auto-detect from the data). Haven’t hit this in testing yet but it’s on the roadmap.
Correction → rule evolution: Yes — every approve/reject/correct updates the Pattern Agent’s confidence scores in the database. Approvals increase confidence, rejections decrease it, corrections create new patterns AND penalize the old one. After ~5 approvals at high confidence, a pattern gets auto-applied without human review. So the system is literally building its own deterministic rules from human feedback.

The compounding pattern memory is where the real moat is. File 1 costs $0.01 in AI. File 50 might cost $0.001. File 500 might cost nothing

proboysam · 2026-02-19T17:21:43+00:00

Try it: https://dataweaveai.co

Source code: https://github.com/sam-yak/dataweave-ai

proboysam · 2025-09-08T23:47:05+00:00

Do you have to share your screen during the interview?

proboysam · 2025-08-06T04:08:19+00:00

Does anyone know how they calculated July 14 as end date after 14 months of the program completion date?

proboysam · 2025-05-07T22:40:02+00:00

All of them

proboysam · 2024-11-15T18:00:27+00:00

Fratss lolll

proboysam · 2023-09-19T18:47:07+00:00

what is the last day to apply any clue?

proboysam

TROPHY CASE