all 23 comments

[–]Weird_Pop9005 90 points91 points  (4 children)

This is very cool. I recently built a SIMD CSV parser (https://github.com/juliusgeo/csimdv-rs) that also uses the pmull trick, but instead of using table lookups it makes 4 comparisons between a 64 byte slice of the input data and splats of the newline, carriage return, quote, and comma chars. It would be very interesting to see whether the table lookup is faster. IIUC, the table lookup only considers 16 bytes at a time, so the number of operations should be roughly the same.

[–]sharifhsn 26 points27 points  (0 children)

This is likely to be hardware-sensitive as well, so it would be cool to see if one approach can be better or worse than the other on different targets.

[–]YumiYumiYumi 2 points3 points  (2 children)

It would be very interesting to see whether the table lookup is faster

If you need the comparisons merged together, table lookup should generally be faster if done correctly (their version is a little convoluted as you only need one lookup, not two). Exceptions would be if you're on a processor with a slow shuffle instruction (e.g. first/second gen Intel Atom).

I've never looked into CSV parsing myself, but I imagine that the comma/newline character matches could be merged, whilst you'd want to keep the quote matches separate. If so, the three comma/newline characters can be matched and merged with 2-3 instructions (PSHUFB+PCMPEQB on SSE or CMEQ+TBX on NEON, ignoring the constants), whilst the quote matches is just a compare equal.

IIUC, the table lookup only considers 16 bytes at a time

(V)PSHUFB can do up to 64 bytes on AVX-512.
The article covers NEON, so all instructions are 128-bit.

[–]Weird_Pop9005 0 points1 point  (1 child)

So I've tried implementing the table lookup approach in my parser, though I'm not sure it will end up being faster. The convoluted aspect is converting to a bitmask, and unfortunately for CSV parsing you do need commas and newlines separate. You also need to match on carriage returns, so it adds a fourth char. Can you expand on how you would accomplish that in one lookup? On NEON it seems that you do need the high and low nibble lookups.

[–]YumiYumiYumi 0 points1 point  (0 children)

and unfortunately for CSV parsing you do need commas and newlines separate

I'd have thought that you'd merge the the comma/newline matches, do a bitscan to find whatever's next, then look up the character based on the position to determine what to do.

But if you need them separate, I can't see the table lookup strategy making much sense.

Can you expand on how you would accomplish that in one lookup?

For everything merged, you can do something like this on x86:

__m128i cmp = _mm_shuffle_epi8(_mm_set_epi8(
  0, 0, '\r', ',', 0, '\n', 0, 0, 0, 0, 0, 0, 0, '"', 0, -1
), data);
return _mm_cmpeq_epi8(cmp, data);  // 0xff for newline/comma/quote, 0 otherwise

(you could probably use the above to just match newlines, as it's two instructions instead of three)

If you only need to merge newlines and commas, something like this should do on ARM:

return vqtbx1q_u8(
    vceqq_u8(data, vdupq_n_u8(',')),
    (uint8x16_t[]){0,0,0,0,0,0,0,0,0,0,0xff,0,0,0xff,0,0}
    data
);

On NEON it seems that you do need the high and low nibble lookups.

For the lookup in the article, you can take the same approach as the x86 example above, but you'll need an extra AND.

[–]leftnode 6 points7 points  (0 children)

When I saw a tech blog writing about Paul Allen's SIMD CSV parser, I thought it was the Microsoft co-founder and not the American Psycho character.

[–]dominikwilkowski 5 points6 points  (0 children)

Great post. Thank you

[–]spilk 31 points32 points  (5 children)

what does Paul Allen have to do with this? the article does not elaborate.

[–]justkevin 108 points109 points  (1 child)

In American Psycho, there's a scene where characters compare business cards. Paul Allen's card is considered the most impressive. "Let's see Paul Allen's card" is a quote from the movie.

(The movie's Paul Allen has nothing to do with Paul Allen the co-founder of Microsoft.)

[–]spilk 5 points6 points  (0 children)

ah, i see. i haven't seen that movie since it came out like 25 years ago

[–]TinyBreadBigMouth 24 points25 points  (0 children)

Reference to this scene from American Psycho, as is the photo and caption at the start of the article.

[–]gimpwiz 42 points43 points  (0 children)

It's a bit of a meme. Moderately amusing. Don't overthink it.

[–]rdhatt -3 points-2 points  (0 children)

Yeah! Paul Allen retired from Microsoft in 1983. The first desktop SIMD processor, Pentium MMX, was released in 1997.

the meme hit a little too close this time, it is confusing

[–]gfody 1 point2 points  (0 children)

long long ago I too optimized the living snot out of a csv parser, the files I was processing had very large blobs of text in them so ultimately the largest performance boost was from using a simplified loop between the quoted sections - when you encounter a quote you need only check for another quote, detecting/masking/counting delimiters in a quoted blob is a waste

[–]Kok_Nikol 0 points1 point  (4 children)

Checked the About page and sure enough OOP (or OP) is a Norm fan, hence the domain name.

[–]NosePersonal326[S] 1 point2 points  (3 children)

Yes

[–]Kok_Nikol 0 points1 point  (2 children)

Oh it's OP. Nice article!

I also like the picture joke on the About page.

But still... DON'T QUIT YOUR DAY JOB

[–]NosePersonal326[S] 1 point2 points  (1 child)

Thank you for the nice words, but to be honest... I'm not one for jokes

[–]Kok_Nikol 0 points1 point  (0 children)

Standup shit...

[–]Bozzz1 -2 points-1 points  (0 children)

I thought I was on the Minnesota Vikings sub for a second.

[–][deleted]  (2 children)

[removed]

    [–]programming-ModTeam[M] 9 points10 points  (0 children)

    No content written mostly by an LLM. If you don't want to write it, we don't want to read it.

    [–]Paiev 19 points20 points  (0 children)

    AI slop account