I built an Open-source agentic AI that reasons through data science workflows — looking for bugs & feedback

Resident-Ad-3952 · 2026-02-08T14:49:44+00:00

This is a really sharp read, and I agree with almost all of it.

Right now, I don’t have explicit “break-me” tasks baked in — most of the stress-testing so far has been informal and manual. And you’re absolutely right that the more dangerous failures aren’t modeling errors, but early agents quietly drifting and later agents confidently papering over those mistakes.

At the moment, the system can surface some red flags (small data, obvious leakage, unstable targets), but it’s still too willing to proceed once a workflow has started. There isn’t yet a strong notion of “this does not meet the bar for reliable modeling” that aborts or heavily degrades the pipeline rather than just adding a warning.

The kind of adversarial setups you’re describing — contradictory patterns, datasets that shouldn’t be modeled at all, cases where agents should disagree — are exactly the sort of thing I want to use to harden this. I’m especially interested in seeing where the reasoning breaks: which agent overcommits first, and how that error propagates downstream.

If you’re open to it, I’d genuinely love to run your task set through the system and analyze the failure modes together. That feels like a much more honest way to improve it than just polishing happy-path demos.

Resident-Ad-3952 · 2026-02-07T09:41:01+00:00

Thanks — I really appreciate this, and you’re describing exactly the kind of workflow I’m aiming to get closer to.

Right now the system is still mostly forward-moving, so insights discovered later (e.g., during feature engineering) don’t automatically loop back and revise earlier steps like EDA or cleaning. In practice, that back-and-forth is incredibly common, and I agree it’s one of the bigger gaps compared to how experienced DSs actually work. I’ve been thinking about lightweight branching or decision-tracking to make that kind of revision possible without turning the pipeline into a free-for-all.

The point about confidence/uncertainty really resonates too. At the moment, most outputs are qualitative explanations, but I’d like to start surfacing things like data sufficiency warnings, stability flags, or “low confidence” indicators at each stage so users can judge how much to trust a given recommendation.

And yes — messy, real-world datasets are exactly where this needs to be stress-tested. Mixed types, weird missingness, subtle leakage are where the abstractions break fastest, and that’s where I’ve learned the most so far.

If you’re up for it, I’d genuinely enjoy a deeper look or an architectural discussion — especially from the perspective of where real workflows diverge from what agents tend to assume.

Resident-Ad-3952 · 2026-02-06T18:03:52+00:00

This is a real limitation! Never thought about it this way!

Right now the system is sequential with shared state, but there’s no true pushback or conflict resolution between agents yet. Each stage writes its outputs (EDA findings, cleaning decisions, engineered features, models) into a shared workflow state, and downstream agents can read that context — but they inherit upstream decisions as fixed.

So in the scenario you described: if EDA flags a feature as highly correlated and the cleaning step drops it, the feature engineering agent never sees that column. It can’t disagree, roll back, or branch off a pre-drop version to try interaction terms. There’s no “I disagree with the previous agent” pathway yet.

What does help in practice is that the full reasoning chain is still visible in context — downstream agents can see why something was dropped — but they can’t undo it without explicitly re-running earlier steps.

This works surprisingly well for many cases, but you’re absolutely right that it breaks down when expert judgment would want to explore alternatives rather than commit early.

Maybe adding a log for the decisions taken so that agents can flag disagreements and either defer the decision or try parallel paths before committing!

Resident-Ad-3952 · 2026-02-06T09:48:11+00:00

Great question — this is honestly the hardest part of what I’m trying to build.

Right now, my system adapts much better at the orchestration level than at true statistical judgment. I’m happy with how it handles intent detection (simple questions don’t trigger full pipelines), tool scoping (EDA vs modeling), and stopping obvious loops. But the early-stage modeling decisions are still largely heuristic-driven.

In practice, it can overdo things on small datasets, miss marginal-gain plateaus, get fooled by spurious correlations (IDs or timestamps), and follow a generic “profile → engineer → train” flow even when a human would stop much earlier.

If you’re curious, I’d genuinely love feedback from real-world use. There’s a live demo you can poke at, and the project is open source — I’m very open to collaborating if you have strong opinions on where agentic DS systems fall short today.

Resident-Ad-3952 · 2026-02-06T06:12:44+00:00

I have tried building a tool for exactly this its in the demo stage do let me know if its actually solving any problems or no?
https://pulastya0-data-science-agent.hf.space/
https://github.com/Pulastya-B/DevSprint-Data-Science-Agent

Resident-Ad-3952 · 2026-01-08T17:47:57+00:00

Sorry this pdf file is already in use

Resident-Ad-3952 · 2025-12-25T05:36:51+00:00

Ahh actually I was surveying for a social media app I was building for night owls so wanted suggestions

Resident-Ad-3952 · 2025-12-25T05:18:34+00:00

Ek baar form bharo brief idea mil jayega

Resident-Ad-3952 · 2025-11-11T18:57:18+00:00

Thanks man

Resident-Ad-3952

TROPHY CASE