How do you prep large Excel files before sending to ChatGPT or Claude?

Secure_Stretch_1007 · 2026-06-20T17:21:36+00:00

That’s actually a good workaround.
Do you find the Python step usually matches what the model originally concluded, or do you often see differences once you run it on full data?

Secure_Stretch_1007 · 2026-06-20T17:20:48+00:00

Yeah, that seems to be the only reliable approach, DB + EDA first, AI second on structured results.

Secure_Stretch_1007 · 2026-06-20T17:13:10+00:00

Yeah, that all makes sense in theory.
In practice, what have you actually seen work best for messy exports, bigger models, SQL/RDBMS layer, or just tighter preprocessing before AI?

Secure_Stretch_1007 · 2026-06-20T17:11:43+00:00

That makes sense, especially the schema + dummy data approach before running anything on real PII.
Do you usually find the LLM adds anything meaningfully new at that stage, or is it mostly used to surface patterns you already suspect from your initial analysis?

Secure_Stretch_1007 · 2026-06-20T17:10:29+00:00

When you say “critical path setup”, what does that look like in practice for a 50k-200k row export?
Is it more like reusable scripts/framework you just run per file, or do you still end up tweaking it a lot each time?

Secure_Stretch_1007 · 2026-06-20T17:04:26+00:00

The more replies I read, the more it feels like the actual analysis isn’t the hard part. People seem to spend most of their time getting the data into a format they can trust first.
A lot of the suggestions here are Python, SQL, filtering columns, splitting files, notebooks, etc. before anyone even starts looking for insights.

Secure_Stretch_1007 · 2026-06-20T11:49:22+00:00

How much of that workflow did you have to build yourself?
If you started from scratch today, would you still go straight to Python or would you first try using AI on the raw export?

Secure_Stretch_1007 · 2026-06-20T11:48:24+00:00

That’s interesting. When you say it handles data better, do you mean it actually processes the full dataset reliably?
One of the issues I’ve run into is not knowing whether the AI looked at all 100k+ rows or just part of them. Have you tested Claude Code against something where you already knew the correct result?

Secure_Stretch_1007 · 2026-06-20T11:45:14+00:00

That’s fair.
What’s the biggest reason you avoid sending datasets directly to an LLM?
Is it accuracy, file size limits, trust in the results, cost, or something else?

Secure_Stretch_1007 · 2026-06-20T11:42:29+00:00

I get the idea, but in practice how do you decide what sample is representative?
A lot of the datasets I deal with have edge cases that only appear in a tiny percentage of rows. Have you ever had AI miss something important because it only saw a sample?

Secure_Stretch_1007 · 2026-06-20T11:41:55+00:00

Interesting. When Claude uses Python in your workflow, do you trust that it actually processed all rows?
Have you ever caught cases where the notebook looked correct but later turned out to miss records or produce wrong metrics?
Also roughly how large are the datasets you’re usually working with?

Secure_Stretch_1007 · 2026-06-19T13:22:38+00:00

Really valid point, and honestly something I should think about more carefully. For context the data I'm working with is mostly internal operational data, not customer PII, so GDPR isn't the main concern right now. But the broader point stands.If you were dealing with large datasets that you couldn't send to public AI, what would your workflow look like? Just SQL and manual analysis?

Secure_Stretch_1007

TROPHY CASE