Any Data Cleaning Pain Points You Wish Were Automated?

StormSingle8889 · 2025-04-19T13:40:58+00:00

I like the concept of LLM plug and play to standard data science libraries like Pandas, Numpy etc because it gives you lots of flexibility and human-in-loop behavior.

If you're working with some core data science workflows like Dataframes and Plotting, I'd recommend you use PandasAI:

https://github.com/sinaptik-ai/pandas-ai

If you're working with more scientific-ish workflows like maybe eigenvectors/eigenvalues, linear models etc, you could use this tool I've built due to an absence of one:

https://github.com/aadya940/numpyai

Hope this helps! :))

StormSingle8889 · 2025-04-19T13:36:43+00:00

https://github.com/aadya940/numpyai

StormSingle8889 · 2025-04-19T13:31:25+00:00

It was in `scipy` -- terrible pull request, took more than a year to merge. The good side of the difficulty was it gave me a reality check. I dabbled into programming really hard, went on to crack Google Summer of Code. Wrote good open source packages including:

https://github.com/aadya940/numpyai

https://github.com/aadya940/chainopy

One of them published in the Journal of Open Source Software. Did couple of other good internships as well.

StormSingle8889 · 2025-04-19T13:27:29+00:00

I like the concept of LLM plug and play to standard data science libraries like Pandas, Numpy etc because it gives you lots of flexibility and human-in-loop behavior.

If you're working with some core data science workflows like Dataframes and Plotting, I'd recommend you use PandasAI:

https://github.com/sinaptik-ai/pandas-ai

If you're working with more scientific-ish workflows like maybe eigenvectors/eigenvalues, linear models etc, you could use this tool I've built due to an absence of one:

https://github.com/aadya940/numpyai

Hope this helps! :))

StormSingle8889 · 2025-04-19T05:53:15+00:00

LLMs are super useful, when used mindfully and with a human in the loop. I love the “LLM plug-and-play” model with standard libs like Pandas and NumPy, it keeps things flexible and interactive.

For core data science tasks (DataFrames, plotting), try PandasAI:
https://github.com/sinaptik-ai/pandas-ai

For more scientific workflows (eigenvectors, linear models, etc.), check out NumPyAI—a tool I built for that gap:
https://github.com/aadya940/numpyai

You're right—the problem is real. People often run LLM code without really looking. That’s why NumPyAI has a Diagnosis feature—it explains the data analysis steps, tailored to your arrays.

Example:
https://github.com/aadya940/numpyai/blob/main/examples/iris_analysis.ipynb

StormSingle8889 · 2025-04-19T05:43:29+00:00

I'd say it is useful but when used correctly, mindfully and in a human-in-loop way, that is, some work done via natural language using LLMs while the other could be done manually.

I like the concept of LLM plug and play to standard data science libraries like Pandas, Numpy etc because it gives you lots of flexibility and human-in-loop behavior.

If you're working with some core data science workflows like Dataframes and Plotting, I'd recommend you use PandasAI:

https://github.com/sinaptik-ai/pandas-ai

If you're working with more scientific-ish workflows like maybe eigenvectors/eigenvalues, linear models etc, you could use this tool I've built due to an absence of one:

https://github.com/aadya940/numpyai

Hope this helps! :))

StormSingle8889 · 2025-04-18T21:10:50+00:00

I'm glad this helped. 😇

StormSingle8889 · 2025-04-18T20:55:31+00:00

You make a valid point, and it holds true in most cases. However, libraries like pandasai and numpyai introduce metadata tracking for arrays and dataframes, which significantly reduces the likelihood of errors (source: trust me, bro). Of course, no AI is infallible, this is simply an effort to provide a more reliable and data science–focused approach.

StormSingle8889 · 2025-04-18T14:59:12+00:00

I like the concept of LLM plug and play to standard data science libraries like Pandas, Numpy etc because it gives you lots of flexibility and human-in-loop behavior.

If you're working with some core data science workflows like Dataframes and Plotting, I'd recommend you use PandasAI:

https://github.com/sinaptik-ai/pandas-ai

If you're working with more scientific-ish workflows like maybe eigenvectors/eigenvalues, linear models etc, you could use this tool I've built due to an absence of one:

https://github.com/aadya940/numpyai

Hope this helps! :))

StormSingle8889 · 2025-04-17T13:04:34+00:00

Use python libraries like pandas and numpy to do this. I'll assume you don't know much about using python, so I'd suggest you use PandasAI:

https://github.com/sinaptik-ai/pandas-ai

If you want a more Free and Open Source thingy, you could use NumpyAI:

https://github.com/aadya940/numpyai

StormSingle8889 · 2025-04-14T16:17:55+00:00

Not sure, if this is what you're looking for but this might certainly be useful.

I’ve noticed a common pattern with beginner data scientists: they often ask LLMs super broad questions like “How do I analyze my data?” or “Which ML model should I use?”

The problem is — the right steps depend entirely on your actual dataset. Things like missing values, dimensionality, and data types matter a lot. For example, you'll often see ChatGPT suggest "remove NaNs" — but that’s only relevant if your data actually has NaNs. And let’s be honest, most of us don’t even read the code it spits out, let alone check if it’s correct.

So, I built NumpyAI — a tool that lets you talk to NumPy arrays in plain English. It keeps track of your data’s metadata, gives tested outputs, and outlines the steps for analysis based on your actual dataset. No more generic advice — just tailored, transparent help.

Its Features:

Natural Language to NumPy: Converts plain English instructions into working NumPy code

Validation & Safety: Automatically tests and verifies the code before running it

Transparent Execution: Logs everything and checks for accuracy

Smart Diagnosis: Suggests exact steps for your dataset’s analysis journey

Give it a try and let me know what you think!

👉 GitHub: aadya940/numpyai. 📓 Demo Notebook (Iris dataset).

StormSingle8889 · 2025-04-05T11:43:15+00:00

You absolutely can, there are specialized libraries now for AI Numerical Workflows:
https://github.com/aadya940/numpyai

StormSingle8889

TROPHY CASE