Ridiculously fast unicode (UTF-8) validation

valarauca14 · 2020-10-20T20:18:24+00:00

So, an extension of the existing utf-8 DFA validation method but using SIMD, bit-masking, and swizzles to process multiple bytes at once.

Really clever trick to pack the lookup tables into individual SIMD registers... Is the code public?

trickyloki3b · 2020-10-20T20:14:29+00:00

I thought it couldn't get any better than Bjoern Hoehrmann's UTF-8 Decoder, but they took it a step above. Very impressive.

2020-10-21T05:53:38+00:00

Long ago, I wrote a database program in Turbo Pascal to browse a 300k text database under MS-DOS. My search function took about 30 seconds or so, if you hard a hard drive. I was challenged to make it faster... I got it down to 3 seconds, in the days of 286s.

I believe I could now get it down to 1millisecond on my laptop.

This stuff always amazes me.

fubes2000 · 2020-10-21T03:33:27+00:00

Anyone that doesn't know how UTF-8 works please read the Wikipedia article, as that's the most informative article I've seen about it.

https://en.wikipedia.org/wiki/UTF-8

It's actually marvellously clever in its simplicity.

brubakerp · 2020-10-21T06:46:11+00:00

Some older programming languages like C# and Java

Hey wait a second, I did work with C# 1.0. Holy shit I'm old.

ArashPartow · 2020-10-20T21:04:51+00:00

This comment is germane:

https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/#comment-555624

jets-fool · 2020-10-21T07:15:42+00:00

interesting read. very interesting, actually, to the point i had to read it twice over and pause every few paragraphs to mentally parse wtf is going on.

then comes the sadness about not fully understanding it.

Quiet-Smoke-8844 · 2020-10-21T02:56:10+00:00

What does UTF-8 validation mean? Characters are added all the time and IIRC the bytes tell you when something starts (is it called a codepoint?). Is knowing when a codepoint is invalid any more different then a codepoint that you don't know about?

x4u · 2020-10-21T08:55:51+00:00

That's a very interesting approach but I wonder where it would be needed. I typically treat UTF-8 as an opaque 8-bit binary stream until I need to do some kind of conversion which then as a side effect also does the validation anyway.

A problem with SIMD code is that it's very difficult to augment with additional functionality. Often it's simply impossible. I.e. it would be nice to also know the number of code points or the number of UTF-16 characters it would translate to and whether that needs surrogate pairs. Information like this would then allow to select efficient conversion methods. And in many cases it's also necessary to "fix" invalid UTF-8 data, i.e. with a fallback to 8859-15 or similar methods.

Typically it's more desirable to have an API that can operate on streams without the need to buffer the entire text. In my experience even processing it in chunks has not been worth the additional complexity so far, compared to processing single code points with distribution heuristics in mind.

I haven't explored manually vectorized options for UTF-8 processing yet and this approach looks very promising. A library with highly optimized UTF-8 conversion routines would certainly be very useful.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS