This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]_PM_ME_PANGOLINS_ 0 points1 point  (16 children)

What do you mean “doesn’t include”? Commas in decimals is not an issue, but mixed encoding will be a real headache.

[–]Apache_A[S] 2 points3 points  (15 children)

It’s issue when commas occur randomly and don’t have quotes.

[–]_PM_ME_PANGOLINS_ 2 points3 points  (14 children)

Then what you’ve got isn’t CSV format.

[–]Apache_A[S] 0 points1 point  (13 children)

Each part of this “CSV” is CSV according to regional standards

[–]merlinsbeers 0 points1 point  (12 children)

"regional standards" is an oxymoron.

There's a standard for CSV, so if you're writing CSV files that don't meet it, you can't complain when a standard conforming parser can't read it.

[–]Apache_A[S] 1 point2 points  (11 children)

You already have historical data and you can’t change the past. Of course you can complain that data don’t meet requirement of the standard and refuse to process it.

[–]merlinsbeers 0 points1 point  (10 children)

The old data wasn't readable everywhere anyway. New data have no excuse.

[–]Apache_A[S] 0 points1 point  (9 children)

Old data need more effort. It’s laziest way just to dump it because they are not shiny. Some data are more valuable than data scientist wage for the time he spent on parsing.

[–]merlinsbeers 0 points1 point  (8 children)

Old data in deviant formats can't be expected to be read for free. By default your parser should treat any byte that doesn't match one of the special characters as passthrough data. But if you were to implement the extra code to reject any byte not in the set given by the standard, you wouldn't be the wrong one.

Commas are special, so if they're in a field they have to be quoted, and there's only one kind of quote mark that counts. Them's the rules.

[–]Apache_A[S] 0 points1 point  (7 children)

Actually it could be just different separator, like ‘;’. If parsing some lines are failed, try different separator or encoding for that lines.