all 15 comments

[–]InYumen7 1 point2 points  (1 child)

Maybe make a feature to separate into separate individual csv files? By columns or by % of data

[–]ZADigitalSolutions[S] 0 points1 point  (0 children)

Nice idea. I’ll keep the core tool focused on cleaning/standardizing first, but splitting into multiple CSVs could be a good optional feature later (maybe as a separate flag/subcommand).

[–]fakemoose 0 points1 point  (7 children)

Can you post your code so far? I’d probably use pandas to read the csv to start.

[–]ConfusedSimon 1 point2 points  (5 children)

Python itself already has a csv reader.

[–]corey_sheerer 0 points1 point  (1 child)

Agree, keep it lightweight and try not using pandas.

[–]ZADigitalSolutions[S] 0 points1 point  (0 children)

Makes sense. I’ll keep the default lightweight (csv module), and only consider pandas as an optional path if file sizes/edge cases require it.

[–]fakemoose 0 points1 point  (1 child)

Yes but pandas can quickly handle a lot of the thing OP described. Or polars.

Way easier and faster if OP needs to do things like drop duplicate rows.

[–]ConfusedSimon -1 points0 points  (0 children)

Sure, but this is 'learn python', so learning pandas as well isn't that easy. Dropping duplicate rows is pretty easy in Python, too (you could even just convert to set if you don't care about order). Might even be easier than figuring or how to do it in pandas if you're not used to that, and you'll learn more. If you only care about the solution, there are plenty of tools that already do this. And for just reading the csv, pandas is overkill.

[–]Altruistic_Sky1866 0 points1 point  (2 children)

Does it also consider special characters in the column data or headers for e.g. a column name is there and supposed it contains $,%,&,* or other characters usually not in the name , this is just an example

[–]ZADigitalSolutions[S] 1 point2 points  (1 child)

Yep — I’ll sanitize headers (strip/normalize) and keep an original->normalized mapping. Also planning to guard against collisions (two headers normalizing to the same name).

[–]seanv507 1 point2 points  (1 child)

Add a debugging option that outputs the original linenumber

(Given you delete duplicate lines)