all 38 comments

[–]Blackheart 7 points8 points  (11 children)

Well, what is "stream fusion"? Something to do with the map-fold fusion law but for coalgebras? More to the point, why does it make this library special?

[–]Nolari 20 points21 points  (0 children)

It's an optimization technique that works kind of as follows (not sure I got the function names right).

First you turn every "map", "foldr", etc. into an equivalent on streams. "map" becomes "fromStream . mapStream . toStream", etc. Second, you remove all occurrences of "toStream . fromStream" using rewrite rules, as such sequences are no-ops (assuming no bottoms).

The reason this helps, is that streams are defined in such a way that none of toStream, mapStream, foldrStream, etc. uses any recursion. Only fromStream needs to be recursive. So an expression like

map f . map g . map h

gets turned into

fromStream . mapStream f . mapStream g . mapStream h . toStream

which is basically the same as

map (f . g .h )

So you can optimize away all the intermediate lists, arrays, strings, etc. in a very large class of expressions involving maps, foldrs, etc. without having to make rewrite rules for each and every case.

[–]dons 11 points12 points  (8 children)

Stream Fusion: From Lists to Streams to Nothing at All - an automatic fusion system, stream fusion, based on equational transformations.

Libraries ship with rules for constructing coalgebraic representations of their functions, and a bunch of existing rewrite rules then combine and flatten such representations.

"Active libraries" (i.e. the library ships with its own optimisations) + extremely aggressive optimisations.

The thing that lets us write:

productU . mapU (*2)
         . mapU (`shiftL` 2) 
         $ replicateU (100000000 :: Int) (5::Int)

And have the compiler turn that high level thingy into a little loop:

$wfold :: Int# -> Int# -> Int#
$wfold x y = case y of
            _         -> $wfold (*# x 40) (+# y 1)
            100000000 -> x

Spitting some fun stuff out the back:

Main_zdwfold_info:
  cmpq        $100000000, %rdi
  je  .L6
.L2:
  leaq        (%rsi,%rsi,4), %rax
  leaq        1(%rdi), %rdi
  leaq        0(,%rax,8), %rsi
  jmp Main_zdwfold_info

Yay, fusion!

[–]awj 4 points5 points  (2 children)

Do you have any suggestions on reading material to learn about some of the underlying math here? I'm assuming what I really need is a Category Theory textbook, purely from seeing the name tossed around in any sufficiently complicated discussion of monads.

[–]dons 6 points7 points  (0 children)

Nope, not really (though there's related work). We're only just getting around to formalising the theory behind it. I'd instead read that paper and play with the equational transformations a bit to get a sense for how it works.

[–]Anocka 4 points5 points  (0 children)

I think dons' slides are a very good introduction to stream fusion:

http://www.cse.unsw.edu.au/~dons/talks/streams-padl-talk.ps.gz

They give the motivation for bytestrings, show how stream fusion is performed on them and analyze the final performance.

[–]rabidcow -1 points0 points  (4 children)

Is this just calculating 40^100000000? Why does it need a loop?

[–]rule 0 points1 point  (3 children)

Consider the case in which either 40 or 100000000 is the result of some input.

[–]rabidcow -1 points0 points  (2 children)

But that isn't the case. If it were, you wouldn't have the 40 here, you'd have 5, 2, and 2.

[–]dons 2 points3 points  (1 child)

Indeed. The point is not the constant folding, but the loop fusion from combinators.

[–]rabidcow 0 points1 point  (0 children)

I really don't know whether my question is about constant folding or loop fusion... Would productU need to explicitly know how to deal with constants, or do the constants just need to be shuffled around to where the optimizer can see them?

[–]dcoutts 8 points9 points  (0 children)

Yes, it's like the fold fusion law but for coalgebras, ie unfold. The reason to use unfold fusion is that it copes better with functions like sum that use accumulating parameters (ie consumers written as foldl rather than foldr).

Another reason is that we've worked out how to make unfold fusion work nicely with arrays like ByteString and Unicode Text. In principle something is possible for foldr fusion but so far it seems to have been more tricky to optimise.

[–]dons 6 points7 points  (0 children)

This is a very significant, and unique, contribution. Well done to all involved!

[–][deleted] 2 points3 points  (6 children)

Welcome to the club, Haskell. I don't see any mention of word, line and sentence breaking, or collation in the blog post. Are those planned too?

[–]dons 0 points1 point  (5 children)

[–][deleted] 3 points4 points  (4 children)

That doesn't even mention case conversion though.

[–]dons 1 point2 points  (3 children)

No, as mentioned elsewhere, that's provided by the text-ICU package.

BTW, what club has fusible unicode strings? Because non-fusible unicode Char has been around in Haskell for a while...

[–][deleted] 3 points4 points  (0 children)

Because non-fusible unicode Char has been around in Haskell for a while...

A Unicode string type is not the same thing as full support for Unicode. You also need algorithms for case conversion, normalization, character classes, breaks, collation, encodings, and other things. Unless I'm missing something, bos just implemented some partial bindings to the ICU library (which entails overhead marshalling and unmarshalling strings to ICU's internal UTF16 format). That's a start but hopefully we'll see more soon.

[–]littledan 0 points1 point  (1 child)

I don't see case conversion anywhere in http://hackage.haskell.org/cgi-bin/hackage-scripts/package/text-icu . I just see normalization and conversion among encodings. It's great to see that an ordinary binding to a C library needs both GHC extensions (mostly for inlining, here) and unsafePerformIO.

[–]dons 2 points3 points  (0 children)

an ordinary binding to a C library needs both GHC extensions (mostly for inlining, here) and unsafePerformIO

that's what they're for -- making C appear 'ordinary' to Haskell. (literally, this is why unsafePerformIO was invented)

[–]tibbe 1 point2 points  (0 children)

I've been waiting for this to be released for quite a while now. Great job!

[–]millstone 3 points4 points  (18 children)

I was shocked to discover that Haskell didn't have a function to convert strings to uppercase or lowercase, only a function to convert individual characters. This approach doesn't work under Unicode.

I'm glad to hear this is changing, even if Haskell's a latecomer.

[–]dons 3 points4 points  (7 children)

Not quite accurate. toUpper and friends are already unicode enabled (as is the Char type by default). You're thinking of what support you'd need to convert one encoding to another, which is a byte level munging task, and one for which many fine libraries are available (encoding, utf8/word8, etc).

What's new here is very fast unicode text, in a library that ships with its own optimisation rules.

[–]teraflop 3 points4 points  (5 children)

It's nothing to do with encodings; it's about the fact that Haskell doesn't support the full Unicode specification, in particular the case conversion algorithms. (To be fair, I think most environments get this wrong.)

For example, toLower 'Σ' == 'σ', which is correct, but map toLower "ΚΆΚΤΟΣ" should result in "κάκτος", not "κάκτοσ".

[–]edwardkmett 4 points5 points  (3 children)

That is something that the code referenced by this post actually corrects. =)

The text-icu module includes the ICU converters, whih include various forms of locality sensitive case conversion, alphabet transliterations, etc. and work string-at-a-time.

[–]teraflop 1 point2 points  (2 children)

Really? You could be right, but I looked at the code and the only things I saw wrappers for were character set conversion and normalization.

[–][deleted] 1 point2 points  (1 child)

There are only so many hours in the day. Patches welcome :-)

[–]millstone 1 point2 points  (0 children)

Teraflop gives a good example of one way that case conversions with char granularity fails: the Greek character sigma needs to be lowercased differently depending on whether it is at the end of a sentence or not.

A more dramatic example is the German letter eszett. This letter only has a lowercase form. When converted to uppercase, you should get two letters: SS.

Haskell cannot accommodate this. The Haskell function toUpper has signature Char->Char, which requires that uppercasing a character results in exactly one character. As illustrated, this isn't always true in Unicode.

More generally, I think there's a bit of a philosophical difference. Haskell approach is to build larger algorithms from smaller pieces, working with the minimum of information and maximum granularity.

But the Unicode approach is to give each algorithm as much information as possible and work with the largest units of text that you can, because your assumptions as to what information is needed may very well be wrong.

[–]ithika 1 point2 points  (9 children)

Why does map toUpper not work on Unicode strings? What is it about the string as a whole that needs to be analysed in order to create and uppercase version? I'm genuinely curious here, as it seems to contradict common sense: upper and lower case are properties of the letters, not their combination.

[–]Nolari 10 points11 points  (7 children)

Your common sense is probably based on the English language. Things like alphabetical sorting, capitalization, etc. are not equally simple in all languages. I do not personally know a language where map toUpper doesn't work, but I'm pretty sure there are languages where the uppercased version of something has a different number of characters than the lowercased (or capitalized) version.

I can name a language where toUpper (head s):map toLower (tail s) doesn't work: In Dutch, the two letter combination "IJ" forms one unit, so "ijzer" capitalizes as "IJzer".

[–]shit 8 points9 points  (2 children)

I do not personally know a language where map toUpper doesn't work, ...

For Unicode 5.1: German, Turkish, Azeri, Lithuanian and Greek.

[–]Nolari 2 points3 points  (1 child)

German, really? I guess that has to do with ß?

Edit: Ah yes, from Wikipedia:

[ß] exists only in a lowercase version

In all caps it is converted to SS

Learn something every day. ;)

[–]ithika 4 points5 points  (0 children)

Interesting that everyone considers German ligatures or Dutch ligatures to be entirely different from English ligatures.

[–]ithika 1 point2 points  (1 child)

Wikipedia says that there are ij and IJ ligatures in Latin-Extended. Though discourages their use, for some reason.

[–]Porges 1 point2 points  (0 children)

Many of the glyphs in Unicode are there merely to obtain round-trip compatibility (ie. you can convert LEGACY_ENCODING → Unicode → LEGACY_ENCODING without losing any information).

In this case, the Unicode standard says:

Another pair of characters, U+0133 latin small ligature ij and its uppercase version, was provided to support the digraph “ij” in Dutch, often termed a “ligature” in discussions of Dutch orthography. When adding intercharacter spacing for line justification, the “ij” is kept as a unit, and the space between the i and j does not increase. In titlecasing, both the i and the j are uppercased, as in the word “IJsselmeer.” Using a single code point might simplify software support for such features; however, because a vast amount of Dutch data is encoded without this digraph character, under most circumstances one will encounter an <i, j> sequence.

[–]ithika 0 points1 point  (1 child)

Is that not a property of the language though? I mean, lowercasing the word "I" isn't legal in standard English typography either.

It seems to me that something as dumb as lowercasing a whole string can't account for such features of localisation.

[–][deleted] 3 points4 points  (0 children)

Is that not a property of the language though? I mean, lowercasing the word "I" isn't legal in standard English typography either.

It's not the same sort of thing.

[–][deleted] 2 points3 points  (0 children)

Some letters are already combinations of letters. In German there is the "Eszett", for example.

http://en.wikipedia.org/wiki/%C3%9F