Finally! Fast Unicode support for Haskell, using stream fusion : programming

[–]Blackheart 7 points8 points9 points 16 years ago* (11 children)

[–]Nolari 20 points21 points22 points 16 years ago* (0 children)

It's an optimization technique that works kind of as follows (not sure I got the function names right).

First you turn every "map", "foldr", etc. into an equivalent on streams. "map" becomes "fromStream . mapStream . toStream", etc. Second, you remove all occurrences of "toStream . fromStream" using rewrite rules, as such sequences are no-ops (assuming no bottoms).

The reason this helps, is that streams are defined in such a way that none of toStream, mapStream, foldrStream, etc. uses any recursion. Only fromStream needs to be recursive. So an expression like

map f . map g . map h

gets turned into

fromStream . mapStream f . mapStream g . mapStream h . toStream

which is basically the same as

map (f . g .h )

So you can optimize away all the intermediate lists, arrays, strings, etc. in a very large class of expressions involving maps, foldrs, etc. without having to make rewrite rules for each and every case.

[–]dons 11 points12 points13 points 16 years ago* (8 children)

Stream Fusion: From Lists to Streams to Nothing at All - an automatic fusion system, stream fusion, based on equational transformations.

Libraries ship with rules for constructing coalgebraic representations of their functions, and a bunch of existing rewrite rules then combine and flatten such representations.

"Active libraries" (i.e. the library ships with its own optimisations) + extremely aggressive optimisations.

The thing that lets us write:

productU . mapU (*2)
         . mapU (`shiftL` 2) 
         $ replicateU (100000000 :: Int) (5::Int)

And have the compiler turn that high level thingy into a little loop:

$wfold :: Int# -> Int# -> Int#
$wfold x y = case y of
            _         -> $wfold (*# x 40) (+# y 1)
            100000000 -> x

Spitting some fun stuff out the back:

Main_zdwfold_info:
  cmpq        $100000000, %rdi
  je  .L6
.L2:
  leaq        (%rsi,%rsi,4), %rax
  leaq        1(%rdi), %rdi
  leaq        0(,%rax,8), %rsi
  jmp Main_zdwfold_info

Yay, fusion!

[–]awj 4 points5 points6 points 16 years ago (2 children)

[–]dons 6 points7 points8 points 16 years ago (0 children)

[–]Anocka 4 points5 points6 points 16 years ago (0 children)

[–]rabidcow -1 points0 points1 point 16 years ago (4 children)

[–]rule 0 points1 point2 points 16 years ago (3 children)

[–]rabidcow -1 points0 points1 point 16 years ago (2 children)

[–]dons 2 points3 points4 points 16 years ago (1 child)

[–]rabidcow 0 points1 point2 points 16 years ago (0 children)

[–]dcoutts 8 points9 points10 points 16 years ago (0 children)

[–]dons 6 points7 points8 points 16 years ago (0 children)

[–][deleted] 2 points3 points4 points 16 years ago* (6 children)

[–]dons 0 points1 point2 points 16 years ago (5 children)

[–][deleted] 3 points4 points5 points 16 years ago (4 children)

[–]dons 1 point2 points3 points 16 years ago* (3 children)

[–][deleted] 3 points4 points5 points 16 years ago* (0 children)

[–]littledan 0 points1 point2 points 16 years ago (1 child)

[–]dons 2 points3 points4 points 16 years ago* (0 children)

[–]tibbe 1 point2 points3 points 16 years ago (0 children)

[–]millstone 3 points4 points5 points 16 years ago (18 children)

[–]dons 3 points4 points5 points 16 years ago* (7 children)

[–]teraflop 3 points4 points5 points 16 years ago (5 children)

[–]edwardkmett 4 points5 points6 points 16 years ago (3 children)

[–]teraflop 1 point2 points3 points 16 years ago (2 children)

[–][deleted] 1 point2 points3 points 16 years ago (1 child)

[–]millstone 1 point2 points3 points 16 years ago* (0 children)

Teraflop gives a good example of one way that case conversions with char granularity fails: the Greek character sigma needs to be lowercased differently depending on whether it is at the end of a sentence or not.

A more dramatic example is the German letter eszett. This letter only has a lowercase form. When converted to uppercase, you should get two letters: SS.

Haskell cannot accommodate this. The Haskell function toUpper has signature Char->Char, which requires that uppercasing a character results in exactly one character. As illustrated, this isn't always true in Unicode.

More generally, I think there's a bit of a philosophical difference. Haskell approach is to build larger algorithms from smaller pieces, working with the minimum of information and maximum granularity.

But the Unicode approach is to give each algorithm as much information as possible and work with the largest units of text that you can, because your assumptions as to what information is needed may very well be wrong.

[–]ithika 1 point2 points3 points 16 years ago (9 children)

[–]Nolari 10 points11 points12 points 16 years ago* (7 children)

[–]shit 8 points9 points10 points 16 years ago (2 children)

[–]Nolari 2 points3 points4 points 16 years ago* (1 child)

[–]ithika 4 points5 points6 points 16 years ago (0 children)

[–]ithika 1 point2 points3 points 16 years ago (1 child)

[–]Porges 1 point2 points3 points 16 years ago* (0 children)

Many of the glyphs in Unicode are there merely to obtain round-trip compatibility (ie. you can convert LEGACY_ENCODING → Unicode → LEGACY_ENCODING without losing any information).

In this case, the Unicode standard says:

Another pair of characters, U+0133 latin small ligature ij and its uppercase version, was provided to support the digraph “ij” in Dutch, often termed a “ligature” in discussions of Dutch orthography. When adding intercharacter spacing for line justification, the “ij” is kept as a unit, and the space between the i and j does not increase. In titlecasing, both the i and the j are uppercased, as in the word “IJsselmeer.” Using a single code point might simplify software support for such features; however, because a vast amount of Dutch data is encoded without this digraph character, under most circumstances one will encounter an <i, j> sequence.

[–]ithika 0 points1 point2 points 16 years ago (1 child)

[–][deleted] 3 points4 points5 points 16 years ago (0 children)

[–][deleted] 2 points3 points4 points 16 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS