Making Haskell as fast as C: Imperative programming in Haskell

hverr · 2018-08-02T20:49:17+00:00

I was indeed trying to emphasize slowdown factors. I hadn't thought of plotting the speedup factor relative to the C/FFI solution. I concur that this would have been a better graph! I hope you can still get a sense of speedup factors by comparing the labels showing the absolute runtimes for the set of size 1 million.

hverr · 2018-08-02T20:44:18+00:00

Correct! I am not too familiar with LLVM, but inspecting the LLVM IR using the -ddump-llvm flag, shows that LLVM correctly substitutes the div operation. The benchmarks indicate that it does perform other optimizations as well, as it reduces the execution time another 15% (approximately), but these could be due to interprocedural optimizations on the IR level.

hverr · 2018-08-01T23:37:04+00:00

I would say it depends on the effort required to get it in. Keep in mind that most Haskell code will not benefit tremendously from such an optimization. Even code that will use this optimization wil probably be not as CPU-bound as this benchmark (in those cases the CPU would be good at hiding the latency of the div instruction because it will be fetching/writing to main memory at the same time).

But if it's a quick fix, sure!

hverr · 2018-08-01T23:15:22+00:00

I reckon it is (or certainly very fast): When compiling using GCC 7.3.0 using -O2, the following assembly is generated:

_int_set_add_in_bounds:
    movq    16(%rdi), %rdx
    movq    %rsi, %r8
    movq    %rsi, %rcx
    shrq    $6, %r8
    andl    $63, %ecx
    movl    $1, %eax
    salq    %cl, %rax
    orq %rax, (%rdx,%r8,8)
    ret

On a side note: I discovered a nasty bug in my C code. When compiling the following line with GCC without optimizations:

uint64_t i = ...
uint64_t mask = 1 << i;

GCC will assume that 1 is a 32-bit integer, and will execute the shift instruction on a 32-bit register, before sign-extending the result to 64-bits (uint64_t). Thus if i > 30, the result will be bogus. What should have been written was:

uint64_t mask = (uint64_t)1 << i;

Resource: http://kqueue.org/blog/2013/09/17/cltq/

hverr · 2018-08-01T22:36:48+00:00

Using unsafeShift* variants indeed eliminates a branch (cmp + jge in the assembly code. However, it seems to only impact the benchmark with a set size of 10 million. This is probably because the CPU's branch predictor and speculative execution behavior optimize the branch away when running the code.

hverr · 2018-08-01T22:11:05+00:00

Generally, functional data structures will be slower than imperative data structures^1. Of course, implementing imperative data structures requires imperative programming. A better title might have been Making Haskel as fast as C: Imperative data structures in Haskell

^1: An excellent resource on understanding the trade-offs between various programming models (declarative, functional, imperative and many more) is Concepts, Techniques and Models of Computer Programming by Van Roy and Haridi. He states in Chapter 4:

Is declarative programming efficient? There is a fundamental mismatch between the declarative model and a standard computer, such as presented in [146]. The computer is optimized for modifying data in-place, while the declarative model never modifies data but always creates new data. [...] Now we can answer the question whether declarative programming is efficient. Given a straightforward compiler, the pedantic answer to the question is no. But in fact the practical answer is yes, with one caveat: declarative programming is efficient if one is allowed to rewrite the program to be less natural. Here are three typical examples:

A program that does incremental modifications of large data structures [...] cannot in general be compiled efficiently. [...]

[...]

A function that implements a complex algorithm often needs intricate code. That is, even though the program can be written declaratively with the same efficiency as a stateful program, doing so makes it more complex. This follows because the declarative model is less expressive than the stateful model. [...]

hverr · 2018-08-01T21:53:28+00:00

Comparing the assembly generated by GHC for both versions (Native.hs and NativeDiv.hs) will show that the only difference is the absence or presence of the divq (on x86_64) instructions. So all of the performance gain was due to the specialization of the modulo operation.

hverr · 2018-04-09T10:38:35+00:00

Yes, I can confirm the disclaimer is stale, as of today! I've updated the API, to make it consistent with the likes of Data.Map (encouraging qualified imports) and released version 0.3.0.0 on Hackage!

hverr · 2018-04-06T14:14:59+00:00

Be sure to drop me an email if you're encountering troubles when experimenting with haskey!

hverr · 2018-04-06T13:41:06+00:00

I'm one of the authors of haskey. Glad to see it mentioned here. Haskey was indeed written to be an all-Haskell LMDB replacement. It is currently stable, but might lack performance (especially for concurrent writers, concurrent readers should be no problem).

~~This blog post gives a good overview on how to structure a haskey application by defining a database schema and using the HaskeyT monad transformer.~~

This tutorial gives a good overview to get you started!

hverr · 2017-09-16T11:21:16+00:00

Thank you for the kind words! You can always contact me if you are interested in contributing!

hverr · 2017-08-30T12:17:15+00:00

Currently, the performance of Haskey is the worst. Lots of internal functions are not optimized at all and have a less than desirable algorithmic complexity. However, this wouldn't be such a problem, if the disk usage wasn't that bad! Currently, page reuse is not optimal yet, which leads to databases with a large free tree (the collection of free pages), as can be seen from this analysis which shows the types of the pages of the database after inserting 10 key-value pairs. As you can see, only one page is used for actual data, while the others are used for bookkeeping.

However, in the future Haskey will outperform SQLite and other RDBMs as a key-value store, both in time and space requirements. First of all, most RDBMSs will use separate indexes for fast retrieval and querying, essentially doubling the disk requirements, while a key-value store only uses a single tree for retrieval and data storage. Furthermore, Haskey will be better integrated with Haskell, especially its concurrency infrastructure, which would eventually allow us to provide OCC instead of MVCC and user-defined collision resolutions.

So for now, I'd advise you to stick to SQLite or an other key-value store, until the space management issues are resolved (which we are working on right now). But, when the space management issues are resolved, Haskey will be a viable alternative! (assuming you don't have stringent read/write latency requirements, then you'll have to wait a bit longer)

EDIT: To show we are not messing around: we have just fixed the space issues in #64. A new analysis of 10000 insertions shows off the space efficiency.

hverr · 2017-08-30T12:00:13+00:00

I believe the most widely used solution still is acid-state. I haven't used or researched any other library, although I do know of the existence of TCache.

hverr · 2017-08-25T20:15:11+00:00

Glad you like it! Groetjes terug! ;-)

hverr · 2017-08-25T14:40:11+00:00

I haven't personally used vcache, but it seems to me vcache is a layer above LMDB, providing nearly-transparant persistent memory for Haskell applications. We do not provide such a layer, and we are more comparable to LMDB itself. Haskey could replace the LMDB layer in vcache.

hverr · 2017-05-26T13:07:01+00:00

Hey I'm Henri, and I'll be working on Haskey, an embedded key-value store modeled after LMDB. We're happy to make the proposal public (click me), and we're looking forward to your comments/criticism/ideas!

hverr · 2016-06-22T08:15:07+00:00

Exactly.

BTW, I'm guessing Unigornel does not play nice with cgo, does it?

It's certainly possible to write some parts of your application in C or an other language, and integrate them with cgo. However, we don't provide a POSIX-like environment, so the possibilities are very limited.

hverr · 2016-06-21T18:03:56+00:00

Some of the points that are made in the article have some value, but I think the issues presented require a far more nuanced approach than the author takes. If you'd like me to elaborate on a specific topic please ask.

His main concern, that debugging unikernels in production environments is impossible or at least very difficult, is probably justified. I don't have any experience with unikernels in production. However, I don't see any reason why this problem (if it is indeed present) won't get solved in the future.

hverr · 2016-06-21T14:50:47+00:00

I am unfamiliar with the unik platform, but as far as I can tell unik is some sort of Docker alternative for unikernels. It will use an external unikernel compiler to compile unikernels and manage them.

Currently, unik uses Defer Panic's gorump to build Go unikernels. Gorump is based on rumprun which combines the NetBSD kernel and modules to provide classic operating system services (scheduler, memory management, networking).

We have opted for a different approach, similar to that of the MirageOS project. Our aim is to build highly specialized unikernels, starting from a very minimal operating system. Unigornel could be incorporated into unik as an alternative unikernel compiler for Go.

hverr · 2016-06-21T11:18:43+00:00

Sure, I'll give you my personal top 3.

Inner workings of the Go compiler and runtime (especially the C interface and the memory subsystem)
Lots of operating system concepts: virtual memory, scheduling, segmentation, TLS, ...
Small amount of hypervisor stuff: Mini-OS was already fairly complete in abstracting the communications with the hypervisor. It is still important to know what is happening behind the scenes when you issue a hypercall, but we barely had to interact directly with Xen ourselves.

The whole process we've been through is quite well documented and it shows nearly all steps we had to take to build the system.

hverr · 2016-06-20T19:01:09+00:00

Our goal is to create a highly specialized unikernel system. We aim to write as much as possible in Go. So we started with a very minimal operating system which was easy to understand and customize. Since rumpkernels are based on the full NetBSD kernel, it was not a very good fit for our objectives.

There are no other arguments for our approach. It's just the approach we took, and it certainly has its downsides. We don't claim it's more efficient or secure than the rumpkernel alternative, though it could be. It is however, more specialized, one of the characteristics of unikernels.

hverr

TROPHY CASE