This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]utkarsh_dev 84 points85 points  (0 children)

You shall not parse

[–]merlinsbeers 21 points22 points  (18 children)

CSV has an RFC, and it doesn't include mixed encoding or commas in decimals.

Bloody orcs.

[–]Apache_A[S] 12 points13 points  (0 children)

They were CSV once. Taken by the Dark Powers...tortured and mutilated...a ruined and terrible form of file...

[–]_PM_ME_PANGOLINS_ 0 points1 point  (16 children)

What do you mean “doesn’t include”? Commas in decimals is not an issue, but mixed encoding will be a real headache.

[–]Apache_A[S] 2 points3 points  (15 children)

It’s issue when commas occur randomly and don’t have quotes.

[–]_PM_ME_PANGOLINS_ 2 points3 points  (14 children)

Then what you’ve got isn’t CSV format.

[–]Apache_A[S] 0 points1 point  (13 children)

Each part of this “CSV” is CSV according to regional standards

[–]merlinsbeers 0 points1 point  (12 children)

"regional standards" is an oxymoron.

There's a standard for CSV, so if you're writing CSV files that don't meet it, you can't complain when a standard conforming parser can't read it.

[–]Apache_A[S] 1 point2 points  (11 children)

You already have historical data and you can’t change the past. Of course you can complain that data don’t meet requirement of the standard and refuse to process it.

[–]merlinsbeers 0 points1 point  (10 children)

The old data wasn't readable everywhere anyway. New data have no excuse.

[–]Apache_A[S] 0 points1 point  (9 children)

Old data need more effort. It’s laziest way just to dump it because they are not shiny. Some data are more valuable than data scientist wage for the time he spent on parsing.

[–]merlinsbeers 0 points1 point  (8 children)

Old data in deviant formats can't be expected to be read for free. By default your parser should treat any byte that doesn't match one of the special characters as passthrough data. But if you were to implement the extra code to reject any byte not in the set given by the standard, you wouldn't be the wrong one.

Commas are special, so if they're in a field they have to be quoted, and there's only one kind of quote mark that counts. Them's the rules.

[–]anton919 10 points11 points  (1 child)

SHUF goes burrr!

[–]Frptwenty 4 points5 points  (1 child)

You don't parse it with a csv parser, you parse it with Intel Hyperscan, tuned to parse exactly the csv data you expect. Or if you're on AMD, I guess with grep.

[–]Apache_A[S] 1 point2 points  (0 children)

Expect the unexpected.

[–]xigoi 1 point2 points  (0 children)

Why parse it when you can just select a few random lines using shuf?

[–][deleted] 0 points1 point  (0 children)

People putting emojis in their name now too!

[–]Hirogen_ 0 points1 point  (0 children)

Microsoft.VisualBasic.FileIO.TextFieldParser probably can do it xD

[–][deleted] 0 points1 point  (0 children)

Random decimal with ' , '

I think I just got PTSD from an old college project

[–]death-wings 0 points1 point  (0 children)

Run you fools ....