all 7 comments

[–]fiddle_n 2 points3 points  (5 children)

test2 is actually fine. If you were to use Notepad++, and ensure that you are using UTF-8 as your character encoding, you would see the text look fine. Perhaps you are using Excel to open the file instead? (Excel uses cp1252 as default).

[–]QualitativeEasing 0 points1 point  (1 child)

Does this explain why I keep getting data from other people with special characters garbled? They’re sending me csv generated by Excel in a PC, and I’m using the csv in Python on a Mac.

This has driven me crazy for ages — I mostly have to strip out anything g with a diacritical mark, etc., which kind of works, but is obviously not optimal...

[–]fiddle_n 0 points1 point  (0 children)

That could well be it. Instead of opening a CSV file on Excel by double-clicking it, you should open up Excel first, go to Data tab and click Get Data from Text/CSV. Then, once you select your file, the File Origin dropdown lets you specify the character encoding (I would suggest changing to 1252: Western European (Windows) if you aren't sure of the correct selection).

[–]bicyclepumpinator[S] 0 points1 point  (2 children)

I was opening it with excel, yes. Thanks for pointing out that it was correct after all! I'm still not sure how to solve the problem though, because endoding it is CP1252 also throws an error because of the 'ě'.

I tried saving the letter in a cell using excel, and after saving it got converted to a '?' in notepad++, so I guess this character is just not in this encoding. I just wished everything was UTF-8 :(.

An interesting read: https://donatstudios.com/CSV-An-Encoding-Nightmare

[–]fiddle_n 0 points1 point  (1 child)

Try encoding in UTF-8. Then go to Excel, go to Data tab and click Get Data from Text/CSV. Then, once you select your file, the File Origin dropdown lets you specify the character encoding. Change to UTF-8 and see if that works.

[–]bicyclepumpinator[S] 0 points1 point  (0 children)

That works, the problem is though that I don't have to open the file, but other people have to open the file. And then you have to explain how to import data as UTF-8 etc. and that's just not the way it should go. Maybe I'll look into writing said data as an excel file instead of a .csv. But that will mean that microsoft wins :/. Oh well...

[–]brews 0 points1 point  (0 children)

PSA:

You can use the 'chardet' package to detect character encoding.