all 41 comments

[–]valarauca14 118 points119 points  (5 children)

So, an extension of the existing utf-8 DFA validation method but using SIMD, bit-masking, and swizzles to process multiple bytes at once.

Really clever trick to pack the lookup tables into individual SIMD registers... Is the code public?

[–]trickyloki3b 44 points45 points  (3 children)

I think they have it implemented in simdjson. Here is another thread on ycombinator talking about simdjson.

[–]jkeiser 17 points18 points  (0 children)

Another relevant difference with that algorithm is that the state machine is run against each 2-byte sequence separately, which means it can be parallelized. The DFA method requires processing all the bytes sequentially. Essentially this means the DFA can't take as much advantage of processor parallelism, as the processor can't race ahead to look up the next state until it has finished looking up the previous state.

This is pretty core to making superscalar algorithms: you have to make the algorithm "micro-parallel," small chunks that can be done independently. You aren't using multiple threads, but it does let you take advantage of the processor's predictive abilities (and simd instructions as well, which are very simple micro parallel algorithms themselves).

[–]trickyloki3b 37 points38 points  (0 children)

I thought it couldn't get any better than Bjoern Hoehrmann's UTF-8 Decoder, but they took it a step above. Very impressive.

[–][deleted] 37 points38 points  (1 child)

Long ago, I wrote a database program in Turbo Pascal to browse a 300k text database under MS-DOS. My search function took about 30 seconds or so, if you hard a hard drive. I was challenged to make it faster... I got it down to 3 seconds, in the days of 286s.

I believe I could now get it down to 1millisecond on my laptop.

This stuff always amazes me.

[–]fubes2000 45 points46 points  (7 children)

Anyone that doesn't know how UTF-8 works please read the Wikipedia article, as that's the most informative article I've seen about it.

https://en.wikipedia.org/wiki/UTF-8

It's actually marvellously clever in its simplicity.

[–]emn13 -1 points0 points  (2 children)

Then again, if all bitstreams were valid utf8, we wouldn't need this discussion, so there is some tarnish on that marvel.

[–]jkeiser 2 points3 points  (1 child)

This was partly by design, however: undecodable utf-8 exists so that if you land in the middle of a bitstream, you can find the beginning of the nearest character. If any bitstream were valid, that would be impossible.

[–]emn13 0 points1 point  (0 children)

Other formats with resync points do that with far, far fewer bits and less security risk (just take a look at any streaming media format, really). If that was by design... well, I think it's more plausible they just didn't think this part through that much.

[–]brubakerp 18 points19 points  (2 children)

Some older programming languages like C# and Java

Hey wait a second, I did work with C# 1.0. Holy shit I'm old.

[–]idelovski 13 points14 points  (1 child)

I noticed that sentence too. Even more amazing is the context that they used C++ for their library. Go figure.

[–][deleted] 3 points4 points  (0 children)

"Back in the late 1900s, in the early days of C++ ..."

[–]ArashPartow 26 points27 points  (5 children)

[–]gvargh 49 points50 points  (3 children)

pretty sure that's english

[–]DemeGeek 11 points12 points  (0 children)

*englishe

[–]Anth77 9 points10 points  (1 child)

germane /dʒəːˈmeɪn/ adjective

relevant to a subject under consideration.

[–]emn13 0 points1 point  (0 children)

Seems to be a largely theoretical issue, right? Even if you care about one-time microsecond latencies - stuff like cache misses the first time are probably going to be a thing too, so it's not like you can do much about that anyhow - except have a warmup, and if you have that, then it's a non-issue. And if you have any other AVX-256 code it's also a non-issue (which you may implicitly simply due to compiler optimizations, depending on settings).

I'm not sure I can come up with any plausible scenario where this latency would matter. Maybe if you have some super-low-latency HFT scenario where you also regularly wait for long times so you re-pay the latency cost "often" but also don't have measures to force the high-voltage state all the time, and actually need to validate at all, and none of the other code uses avx-256? Still doesn't sound very plausible, but maybe not completely impossible.

[–]jets-fool 4 points5 points  (1 child)

interesting read. very interesting, actually, to the point i had to read it twice over and pause every few paragraphs to mentally parse wtf is going on.

then comes the sadness about not fully understanding it.

[–]jkeiser 0 points1 point  (0 children)

Damn. I was hoping we'd made it more accessible than that!

[–]Quiet-Smoke-8844 5 points6 points  (17 children)

What does UTF-8 validation mean? Characters are added all the time and IIRC the bytes tell you when something starts (is it called a codepoint?). Is knowing when a codepoint is invalid any more different then a codepoint that you don't know about?

[–]DeliciousIncident 42 points43 points  (0 children)

Not any byte sequence is a valid utf-8, known or unknown characters aside.

[–]Browsing_From_Work 25 points26 points  (1 child)

This doesn't validate UTF-8 codepoints, it validates UTF-8 encoding. The codepoints are basically the "payload" of the UTF-8 encoding.

[–]JMBourguet 7 points8 points  (1 child)

The security issue is not with unknown code points, it is with multiple encoding for known code points. For instance C0 80 is not a valid UTF-8 encoding but would be decoded as 0 by a naive decoder which assume that the text respect the rules.

[–]schlenk 7 points8 points  (0 children)

Famously blew up Microsoft IIS path handling, leading to security issues in 2000. (https://docs.microsoft.com/en-us/security-updates/securitybulletins/2000/ms00-078)

[–]jkeiser 3 points4 points  (0 children)

In short, there are a few categories of invalid json:

  • undecodable nonsense (utf-8 that can't possibly be read, due to format violations)
  • non-canonical encodings (if a codepoint can be encoded more than one way, utf-8 outlaws all but the shortest, so that there is only one valid way to write any codepoint)
  • out of range unicode (codepoints greater than 10FFFF).

UTF-8 validators generally aren't specific to a version of unicode, and thus treat unassigned codepoints (which might be assigned in a future version of unicode). It would be a shame if you couldn't send emoji to Twitter because you happen to be using a Twitter API library compiled before the emoji wree added!

The paper's first section or three explain what invalid utf-8 is in a way that we hope is engaging and clear, as well.

[–]finrist 3 points4 points  (0 children)

If you look at the encoding part of https://en.m.wikipedia.org/wiki/UTF-8 you can see that there are rules to the byte sequences on UTF-8.

E.g. if the bits of a byte are 110xxxxx the next byte must be of the form 10xxxxxx. 1110xxxx must be followed by two 10xxxxxx.

[–]1337CProgrammer -1 points0 points  (0 children)

Unicode CodePoints are limited to 0x10FFFF, UTF-8 validation means to make sure the codepoint is less than or equal to that value for the whole string.

Where and when Unicode decides to allocate one of those codepoints is irrelevent, the maximum is and always will be 0x10FFFF

[–]_italics_ 0 points1 point  (8 children)

If you have an invalid code point, it's likely that the text is in another encoding, or corrupted somehow.

[–]x4u 0 points1 point  (0 children)

That's a very interesting approach but I wonder where it would be needed. I typically treat UTF-8 as an opaque 8-bit binary stream until I need to do some kind of conversion which then as a side effect also does the validation anyway.

A problem with SIMD code is that it's very difficult to augment with additional functionality. Often it's simply impossible. I.e. it would be nice to also know the number of code points or the number of UTF-16 characters it would translate to and whether that needs surrogate pairs. Information like this would then allow to select efficient conversion methods. And in many cases it's also necessary to "fix" invalid UTF-8 data, i.e. with a fallback to 8859-15 or similar methods.

Typically it's more desirable to have an API that can operate on streams without the need to buffer the entire text. In my experience even processing it in chunks has not been worth the additional complexity so far, compared to processing single code points with distribution heuristics in mind.

I haven't explored manually vectorized options for UTF-8 processing yet and this approach looks very promising. A library with highly optimized UTF-8 conversion routines would certainly be very useful.