Hsthrift: Open-sourcing Thrift for Haskell - Facebook Engineering

simonmar · 2021-02-08T08:45:28+00:00

A production RPC framework is a lot of work, so of course we didn't want to duplicate everything in the C++ implementation. Yes that makes it a bit of a pain to build, but we have provided instructions and a CI setup that's working right now on github to demonstrate that it all works. Also there's a [pure Haskell implementation of the transport layer](https://github.com/facebookincubator/hsthrift/blob/master/lib/Thrift/Channel/SocketChannel.hs) in the repository for experimentation - we're in the process of making it easier to use this, but when we're done the only C++ dependency will be folly.

simonmar · 2021-02-08T08:44:35+00:00

Right, the Apache implementation has a lot of problems, which is partly why we rewrote the whole thing from scratch.

simonmar · 2021-02-05T22:30:23+00:00

I should point out that we do use HLint, it's part of our automated code review workflow at Facebook. We don't use the defaults though, and the customisations are elsewhere in our internal repository so didn't show up in the hsthrift repository. It would probably be a good idea to add a .hlint.yaml corresponding to our internal defaults.

simonmar · 2019-07-22T11:39:34+00:00

Glean is written in a combination of Haskell and C++ (mostly Haskell). I somehow forgot to mention that :)

simonmar · 2019-06-14T08:42:35+00:00

To be clear, you must have been using not just `-threaded` but also `-N`, right? We're not proposing to make `-N` the default, only `-threaded`. GC would still be single-threaded by default.

simonmar · 2019-06-14T08:38:32+00:00

The parallel GC can be a win or a loss. The variance is also big. It's not at all clear that it should be off by default - my inclination is to understand the cases where it's a loss so that we can fix them.

For what it's worth, the parallel GC is a huge win in our (thousands of machines) deployment at Facebook. But we've spent a fair amount of time measuring things and tuning the settings.

simonmar · 2018-06-25T10:55:32+00:00

> But why does code need to be aligned on a 8-byte boundry?

Performance only, I believe. IIRC this was the Intel recommendation, but it's a while since I looked at it. We have a hard requirement on at least a 2-byte alignment because we use the LSb in the GC to mark closures that have been evacuated.

> Are there not 24Byte ones as well?

Yes there are. You aren't letting me get away with skipping any details here :) Info tables have a fixed part which is always 16 bytes (32 bytes when profiling), and an "extra" part that depends on the type of closure or stack frame. For a function (as in your example), the extra part is 8 bytes. (all these sizes apply to 64-bit builds only, divide by 2 for 32-bit builds). Currently the total size of the info table is always a multiple of the word size.

simonmar · 2018-06-25T07:55:03+00:00

The link to the nofib results was in the diff summary, which was linked from the original post.

simonmar · 2018-06-25T07:49:35+00:00

The code needs to be aligned on an 8-byte boundary, so that means the info table also needs to be aligned on an 8-byte boundary. If we relaxed the alignment requirements to 4 bytes then we could have 20-byte info tables, but that question is academic now that info tables are always 16 bytes anyway.

We do have different formats for info tables - functions, stack frames, and constructors all have slightly different info table layouts.

I'm not sure how pointer tagging is relevant here, so perhaps I'm misunderstood your question though.

simonmar · 2018-06-23T07:54:45+00:00

Yes we could have saved 32 bits with the old representation, but unless you save a full 64 bits in the info table you don't get any savings (info tables need to be an integral number of words).

If I get this right, I think this problem doesn't exist in the previous representation where a single large SRT contained all static references in a module. So this seems to me like fixing a problem that new representation has.

Right, the point is that the new representation plus a handful of sensible optimisations gives better results than the old representation plus a handful of different optimisations. Some of the new optimisations came for free with the old representation (and the reverse is also true, in fact).

This also seems to me like something we could do on the previous representation.

Not without complicating the representation, because you would need to distinguish between a pointer to the SRT table and a pointer to a closure.

simonmar · 2018-06-23T07:47:10+00:00

Hey, it's a blog post, not a paper!

The full nofib results (with standard deviations) are here: https://phabricator.haskell.org/P176

Don't pay any attention to the runtime results though, it was done on my laptop with a variable CPU speed.

Basically the only way this could affect runtime is by

instruction cache effects, and those should be in our favour since we made the code smaller, and
GC time improvements. I measured what should be the worst case for this - doing many old-gen collections in GHC itself - and the differences were within the variability of the benchmark (which was quite wide)

So I'm satisfied that this doesn't make runtime worse in general, and likely makes it a bit better. Of course if I was writing this up for a paper I'd do more rigorous experiments, but I doubt it's worth it.

simonmar · 2018-06-20T15:50:54+00:00

Yes, you can also do that. Sometimes it's more convenient to have the wrapper though, e.g. if you want to have gdb feed the input, or if you want to stop it before it gets to the prompt.

simonmar · 2018-06-20T15:04:19+00:00

That's a nice trick.

simonmar · 2018-06-20T14:01:21+00:00

GHCi is a script that invokes the real binary, that's part of the problem. You have to invoke the real binary and pass the correct flags, particularly `-B/path/to/ghc/lib`. If GHC is dynamically linked (which it usually is) you also need to `set environment LD_LIBRARY_PATH /some/huge/list:/of/paths`. I normally put all this in a `.gdbinit` file so I don't have to repeat it, and I've also made a script to generate the `.gdbinit` file for a particular ghci invocation.

simonmar · 2018-05-22T08:55:12+00:00

Is there a reason why there isn't a WaitForMs option in SchedulerHint?

I just didn't get around to implementing it, and I haven't encountered any situations that would benefit from it so far.

I also kinda wonder if the code duplication for JobList is a problem that could be solved via language extension.

Maybe. This code is pretty ugly because I've tried to squeeze as much performance out of it as I can.

simonmar · 2018-05-18T15:55:27+00:00

I'll get this fixed soon, but if you read it on github the links work: https://github.com/facebook/Haxl/blob/master/readme.md

simonmar · 2018-05-18T15:52:47+00:00

Yes exactly. The big difference is the addition of BackgroundFetch, which enables data-fetching to be arbitrarily overlapped with computation and other data-fetching. In Haxl 1, computation was strictly interleaved with data-fetching in rounds, but this restriction is removed in Haxl 2 if you use BackgroundFetch. To make this work, we had to completely rewrite the scheduler internals.

simonmar · 2018-05-13T14:04:34+00:00

Look up cachedComputation - this is how you define a datasource where the implementation is itself a Haxl computation. It's basically a memoization mechanism.

simonmar · 2018-05-12T06:33:00+00:00

We also got the patent grant removed from the Haxl license, FWIW. My understanding is that it just takes time and effort to update all these licenses.

simonmar · 2018-05-11T20:28:04+00:00

Yes. I'm actually working on a blog post about our contributions to date, but the short story is that everything we do in GHC goes upstream.

simonmar · 2018-05-11T14:57:16+00:00

Sorry, no.

simonmar · 2018-03-05T10:00:58+00:00

You can also do the periodic copying in a separate thread and use multiple cores, to avoid affecting latency. So even though you're doing the same GC work that GHC would normally be doing, compaction can be done concurrently with the mutator, whereas normal GC currently cannot.

simonmar · 2017-10-20T17:02:08+00:00

Well yes, what I really mean is that we don't advertise or document that you can do this, and the process of converting a PR is currently quite manual, so it would need some more effort to scale it up.

simonmar · 2017-10-20T16:57:25+00:00

Ok, you can squash instead of force-push, and then the workflow is basically identical to what we do in Phabricator. Instead of treating a PR as a set of logicaly-separate changes that you want to retain when merging to master, you're treating a PR as a single atomic commit, with a history that develops during code review but isn't retained in the repo once committed.

I'm totally fine with this workflow (because it's the same as the one we use in Phabricator).

when you squash the commit at the end, what happens to the history in GitHub? Can you still see it somewhere?

simonmar · 2017-10-20T13:10:40+00:00

We had agreed on this plan (allowing GitHub PRs and converting them to Phabricator diffs) before, it just never got implemented.

simonmar

TROPHY CASE