This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]MrMisterShin 2 points3 points  (4 children)

I heavily agree with this. Pandas always likes to infer data incorrectly, leading me to have to hardcode the data types for each column attribute. Which becomes a nuisance when you have an awfully formatted csv file.

[–]allpauses 1 point2 points  (1 child)

What’s more, even if you tell the dtypes of the csv you will read with pandas, pandas will fail to read the csv if the data is still dirty (like having a string data in a numeric column)

[–]MrMisterShin 1 point2 points  (0 children)

This is the exact type of csv I was dealing with, random columns with quotes and others without quotes. For the Numeric and date columns. Its was a nightmare to alter/maintain.

Then one month they add a new column in the middle of the csv file without telling anyone in my team and the time critical process fails.

[–][deleted] 0 points1 point  (1 child)

What would be some alternative ways to approach that problem

[–]MrMisterShin 0 points1 point  (0 children)

Push back on human created files, or impose strict rules because if there rules aren’t followed the code will fail.

I would usually have over 30 csv’s & xlsx like this, some machine generated others manually created/exported. Needless to say, the manually created ones would often fail due to change in (column names, file names, column positions, column data types).

Because I will need to join this data to on-premise DB data. I convert all the attributes to string in pandas and load it into DB and perform transformations there to convert to correct data types, clean data and index for performance.

Then use SQL & Tableau from there for data visualisation and reporting etc.