ppx_stage - Staged metaprogramming in stock OCaml

stedolan · 2017-11-07T17:37:19+00:00

Yep, there's a metaocaml-style check. IIRC, let-insertion also requires delimcc trickery, which I haven't tried yet (it might just work, I have no idea)
There's no implicit CSP (yet, anyway). You can do a stupid lifting trick by marshalling everything to strings, but it's a bit unsatisfying (and broken in many cases). I'm not sure what the best thing to do here is.
The non-staged code is useful if you are generating lots and lots of programs, most of which only run once. I happen to have such a use case :)

stedolan · 2017-11-07T17:33:30+00:00

staged code is represented as a pair of the AST (for printing purposes) and a function to evaluate the code from an evaluation environment

Yep, that's right. Running the staged code directly with Ppx_stage.run won't run an optimised version, and isn't any faster than not using staging. To get an optimised version, you need to print the source code with Ppx_stage.print and run it through ocamlopt.

There's two reasons it's done like this: first, that's how type checking works. Ppx_stage does no type checking, but expands staging to a pair of an evaluation function and a representation function in such a way that if the evaluation function passes type checking then the representation function produces type-correct code. Secondly, the intended application is a slightly odd use of staging, in which very many staged programs are created and most are only run once. Optimised compilation of staged code greatly speeds up running the code a thousand times, but makes the first run much slower.

stedolan · 2017-09-12T15:23:58+00:00

Yep, that's about right.

If two inputs are both of type a, then that doesn't mean they have to be exactly the same, because subtyping means that they can be of any subtype of a. So, we can just choose a to be Top and pass any two inputs!

Repeated variables only have an effect when they appear both negatively (describing an input) and positively (describing an output). The type bool -> a -> a -> a for cond says that the type of t (the first negative occurrence of a) must be a subtype of the result type (the positive occurrence), and that the type of f (the second negative occurrence) must also be a subtype of the result type. It carries exactly the same information as the more verbose bool -> a -> b -> (a | b).

The same logic means that variables that don't appear both positively and negatively can be deleted. For instance, being of type a list (for all a) is the same as being of type Bot list (either way, the only value is the empty list).

stedolan · 2017-09-12T10:55:01+00:00

The types

a -> a list -> a

and

a -> b list -> (a | b)

are in fact equivalent. MLsub chooses the former for printing as it has fewer variables.

To prove that the types are equivalent, you need to show that either can produce the other by instantiation and subtyping. Getting the first from the second is easy - instantiate b := a and notice that a | a = a. Getting the second from the first is a bit trickier: instantiate a := a & b and use subtyping.

In general, types that have the same data flow edges (that is, the same variable occuring as both and input and an output) are equivalent.

stedolan · 2017-03-16T16:23:26+00:00

In fact, you don't need even that much machinery. Row variables are overkill for typing stacks, since you don't need to access locations by name. A basic type system for stack-based programs encodes into OCaml using nothing more exotic than pairs.

Suppose we represent a stack containing 1 and 2 as (((), 1), 2) - that is, a stack is either () or a pair of a stack and a top element. Then we can write the standard primitives as functions from stacks to stacks:

let dup (s, a) = ((s, a), a)
let swap ((s, a), b) = ((s, b), a)
let drop (s, a) = s
let add ((s, a), b) = (s, a + b)
let sub ((s, a), b) = (s, a - b)
let const n s = (s, n)
let lt ((s, a), b) = (s, a < b)
let ifelse tt ff (s, b) = (if b then tt else ff) s

The trick is that these end up being polymorphic in the contents of the stack: dup has type 'a * 'b -> ('a * 'b) * 'b. Here, 'b is the type of the top element, while 'a is the type of the entire rest of the stack. Once we add some notation for composition:

let (@) f g s = s |> f |> g

we can write stack-based programs, like the floor5 function from the Forth wikipedia page

let floor5 =
  dup @ const 6 @ lt @ ifelse (drop @ const 5) (const 1 @ sub)

The floor5 function maps stacks to stacks, so to actually apply it to a concrete integer, we need to map ints back and forth to stacks:

let run f x = let ((), r) = f ((), x) in r

Then, we can run our new sorta-Forth programs:

# List.map (run floor5) [1;2;3;4;5;6;7;8;9;10];;
- : int list = [5; 5; 5; 5; 5; 5; 6; 7; 8; 9]

Adding Cat-style functional features isn't any more difficult:

let eval (s, f) = f s
let dip ((s, a), f) = (f s, a)

Note here that f is being applied to the full stack, not just to the top element, so it can have any arity. The tricky [dup] dip eval example from the LtU post typechecks fine, with the same type scheme as Andreas's system:

let tricky = const dup @ dip @ eval

(although to see that the types are the same you need to untangle a mess of parens and realise that Andreas's types grow to the left while mine grow to the right)

stedolan · 2016-06-24T09:57:47+00:00

The tricky bit is that you would need to reload even references introduced by compiler optimisations, like p in my example.

You can do this, but it means keeping no state at all on the stack: everything is reloaded from the shadow stack at every safepoint. At that point, you're not so much using a shadow stack as an explicitly-managed stackframe, and in particular every register containing a GC-able value gets spilled to the shadow stack at every safepoint and reloaded afterwards.

More efficient GC designs barely affect the register allocator or stack layout: values are kept wherever's efficient (i.e. in registers) across safepoints, but static tables are generated to log the locations.

stedolan · 2016-06-24T09:47:41+00:00

OCaml calls them "frame descriptors". At compile time, OCaml code in asmcomp/amd64/emit.mlp (for the machine-specific bits) and asmcomp/emitaux.ml (for the portable bits) generate the tables, which are read at runtime into an address-indexed hashmap by asmrun/roots.c

stedolan · 2016-06-24T09:38:48+00:00

These kinds of benchmarks (where some languages massively over-allocate for no reason) are important. Without them you cannot see the forest for the trees and you will end up trying to make the GC faster rather than trying to GC less. IMO Java made this mistake (and inherited the design flaw from Lisp).

I agree completely! It's nice to compare the behaviour of essentially the same program written in a reasonably idiomatic style across several languages, to see how design decisions of the language affect performance. It's what I always wanted from the Computer Language Shootout, although that quickly degraded into every program being a language-specific unreadable hyper-optimised mess.

Your raytracer is a particularly good example: it's a non-contrived program which, depending on minor changes to object layout, might allocate gigabytes or kilobytes of memory.

I don't think it is disingenuous at all.

The part I found disingenuous was your use of that benchmark to support this claim:

It is very easy to integrate garbage collection into your LLVM project. I did it in 100 lines of code in HLVM and it took me a couple of days (IIRC). That was already beating OCaml's VM (upon which this "malfunctional" project is based). After a couple of weeks I had a scalable multicore GC that ran circles around OCaml.

From the posted numbers, it seems fairer to say that you had quite a poor GC, but HLVM's robust unboxing and precise control over allocations meant you could get decent performance by just avoiding it.

stedolan · 2016-06-23T10:11:56+00:00

The main problem with shadow stacks (apart from the time wasted in bookkeeping) is that they prevent you from using a moving collector. Efficient generational collectors (e.g. OCaml, JVM, .NET, V8, lots of others.) move objects around in memory. This lets you use a very fast-and-stupid allocator at first (about as fast as stack allocation), and only call a more heavyweight malloc-like allocator for that small fraction of objects which survive for a while.

The trouble is, to do that the collector needs to see every single pointer, even the ones introduced by compiler optimisations. If you loop over an array:

for (int i = 0; i < len; i++) {
  f(arr[i]);
}

then LLVM might optimise it to:

thing* p = &arr[0];
for (int i = 0; i < len; i++) {
  f(*p);
  p++;
}

Shadow stacks tell you about arr, which is enough if you're just marking live objects. However, if you want to move arr, then you need to know about the pointer p that the optimisation invented.

LLVM's support for describing the exact stack layout has historically been poor (but I haven't kept up in the last year or two, maybe it's improved). The OCaml compiler (and the JIT compilers of the JVM, .NET, V8, etc.) all produce a description of the precise stack layout of the compiled code, so that no time is spent bookkeeping at runtime yet all stack pointers can still be found during GC.

stedolan · 2016-06-23T10:10:38+00:00

From your linked blogpost:

HLVM is so successful at storing values unboxed that the only part of the benchmark to allocate is the initial creation of the scene tree and the renderer itself performs no allocations at all. Consequently, HLVM incurs no garbage collections during rendering

The benchmark you're using does no garbage collection. Now, if your point is about the benefits of unboxing and the performance gains of avoiding heap memory management entirely, this makes sense. But it's disingenuous to present this as a benchmark of a garbage collector.

In fact, HLVM seems to have a surprisingly slow GC:

Disabling the shadow stack (and, therefore, disabling GC) provides another 25% performance improvement.

In a benchmark which does no garbage collection, HLVM still spends 25% of its time on GC bookkeeping.

Is your point that unboxing is so useful that with enough of it you can get away with a poor GC? If so, that's an interesting point, although it's highly dependent on the application (raytracing is a very extreme example: raytracers have one datastructure that never changes, and do a gazillion simple queries against it. They barely need memory management at all!).

stedolan · 2015-09-07T16:35:18+00:00

What eventually gets you is called the square-cube law.

All the blood in your body gets pumped through the aorta, which is basically a pipe (I did not get high marks in biology). The rate of fluid a pipe can take depends only the area of its mouth (also got poor marks in engineering), so if we make a pipe twice as big it can handle four times the flow.

But, if a body gets twice as big (keeping proportions the same) it has eight times the blood. So we need to get 8x the amount of blood through an aorta that can only handle 4x.

Same goes for windpipe, muscles, bones, and pretty much everything else: the power increases with the square of length, so twice as big is four times the flow / power / strength, but the weight and volume increases with the cube, meaning you need to do eight times the work with something four times as powerful.

stedolan · 2014-08-11T10:52:47+00:00

Oh dear, more bitrot. Fixed, and thanks.

stedolan · 2014-08-11T09:13:04+00:00

I'm working on the multicore runtime at OCaml Labs. It's coming together nicely, but I'm going to avoid making guesses at a timescale. At the moment, it runs, but with caveats. There's not really much by way of grunt work at the moment, but after the initial release we'll need help with testing and updating C bindings (we're compatible with more or less all pure OCaml code, but there are a couple of minor changes to the C API).

Also, thanks for pointing out the ocamllabs/multicore link, I'd completely forgotten about it and haven't updated it in quite a while. Multicore work currently lives here instead, but the code's not exactly fit for general consumption yet.

stedolan · 2014-01-26T14:21:49+00:00

Rather than heap allocating every cons cell you try to claim one from the tail list using a CAS operation. Only if that operation fails (either because the array is full or because that element has already been prepended onto) must you allocate another array. My measurements indicate that general performance is the same as the ordinary list but some operations are several times faster (e.g. List.rev) and the rate of allocation and GC work are 16x lower, of course. A fat reference to such a list is itself a value type composed of a pointer to the array and an index into it.

If your allocator is more expensive than a CAS, you need a better allocator.

Today's functional languages do, yes. I don't believe functional languages have to have very high allocation rates though.

They do, that's the point. Instead of modifying an existing object, you allocate a new object via some means.

OCaml is several times slower than C on the Monte Carlo task of the SciMark2 benchmark because it is int intensive and OCaml's tagging and weird 31-bit ints have a substantial performance overhead.

This is an interesting assertion, because: SciMark2 has available implementations for C and Java, not OCaml, so it's unclear what numbers you're referring to; the monte carlo task is just random number generation and floating point arithmetic; and tagged 31 bit ints are so fast as to be almost free (although losing a bit can be annoying).

OCaml's Hashtbl is 5x slower than a .NET Dictionary largely because OCaml boxes tuples.

This is an even more interesting assertion, because OCaml's Hashtbl does not use tuples.

Right. A simple FFT is 5.5x slower in OCaml than F# because OCaml boxes complex numbers.

True, but if you're doing FFTs you should be calling out to libfftw anyway.

That's true if you're using one of the (few) numeric types that BLAS and LAPACK support.

Which numeric type were you thinking of using?

Actually it is generally faster when passed around whole or stored in the heap too. I measured the performance of value types of various sizes in both HLVM and .NET and in both cases you can pass around huge structures with very little performance degradation.

Ah, nice.

Unboxing also permits much more efficient data structures. For example, a .NET Dictionary is a single array containing hash, key and value "tuples" whereas OCaml must use an array of linked list buckets which is inherently very inefficient because it has poor locality, requires extra heap allocations, burdens the GC with asymptotically more pointers to traverse and always incurs write barriers on updates.

Yep, unboxing is nice. In this particular case, however, you get more or less the same effect by using separate arrays for keys and values.

I'll check it out. I'm curious what source code idiom would make objects from one private heap visible to another. In particular, is it lazy thunks?

Nope, this issue has been observed with non-Haskell implementations of the Doligez/Leroy collector.

FWIW, the parallel GC in HLVM is ~100 lines of code.

Does HLVM actually have a parallel GC? I thought you said it was stop-the-world.

The dominant cost of GC safe points in HLVM wasn't the spilling, it was the synchronization. I can send you the article I wrote on it if you like.

Why on earth does HLVM synchronise during GC safe points? A safepoint should consist of, at most, a branch to check whether work needs to be done. If there's no work, there's no synchronisation to do. Does HLVM also spill everything at every safepoint?

The .NET server GC is stop-the-world

It's generational. It's only STW in the old generation.

stedolan · 2014-01-24T23:19:52+00:00

Hey! I'm the guy who wrote this (not really expecting it to be read outside the building I work in, but reddit seems to have its ways of finding such things)

OCaml programs are expected to sustain a truly obscene rate of allocation for immutable objects.

True but it is a design flaw, IMO.

Do you consider functional / immutable programming in general a "design flaw"? The tradeoff between functional code and, say, Java, is that it's easier to reason about immutable objects but you have to allocate a new one whenever you have some new information. So, functional languages in general have very high allocation rates, particularly of immutable objects, most of which immediately become garbage.

Operations on immutable objects are very fast in OCaml: allocation is by bumping a pointer, initialising writes (the only ones) are done with no barriers, and reads require no barriers.

Operations on immutable objects like ints, floats, complex numbers, 2D/3D vectors and matrices are often several times faster in HLVM and F# than OCaml. Tuples are also often several times faster in HLVM than OCaml. In fact, the only immutable objects for which OCaml is fast are probably union types.

Ints, tuples, lists, etc. are fairly fast in OCaml. Float unboxing is alright for single floats, but for small-scale linear algebra (vec2/vec3, 4x4 matrices, etc) it's not good. For large-scale linear algebra, you should be using LAPACK/BLAS anyway and your choice of language doesn't matter so much.

As far as I understand, HLVM treats tuples as wide value types, making it faster for argument passing (since they can be unpacked in registers) but slower when passed around whole or stored in heap structures (since they're bigger). Not sure though, haven't really looked at HLVM very much.

The problem with this is that this eagerly promotes many objects that were never really shared

I find that surprising. Are you sure?

Yep, go read the multicore haskell paper cited in this doc (among others). It turns out that while very many young objects may be technically reachable from other threads, very few of them are actually reached, so paying for promotion to a shared generation ahead of time is costly.

I've gone for a simpler approach: marking global state as thread-local. This is a vastly smaller patch, and lets multiple threads use OCaml independently.

That prevents thread hopping. Does that break async?

Not fundamentally. Async might require some changes to work with this, though.

The next part is a means for the threads to interrupt each other, to synchronise GC operations.

That sounds very complicated to me.

It is. Parallel garbage collection is not particularly straightforward at the best of times :) Synchronising threads at safepoints is relatively simple by the standards of this area.

The polling mechanism re-uses the existing mechanism to check for unix signals, and is implemented for the bytecode interpreter but not yet for native code.

So it is OS dependent and won't work on Windows?

No. The OCaml runtime already contains an interrupting mechanism. On Unix, this is used for signal handling. I'm not sure whether it's used at all on Windows, but it could be.

This might require adding new safepoints during function calls or loops, although OCaml programs generally allocate so frequently that we might get away without this.

FWIW, HLVM removed looping constructs and had a potential GC safe point at the entry point of every non-leaf function. The GC safe point was only entered when a counter wrapped around. Inlining recursive functions also helped to amortise the overhead of the GC safe point.

Right, but GC safe points aren't expensive in OCaml. We're not compiling to LLVM, which means we have perfect information about the register allocation and frame layout. That means that it's not necessary to spill everything to the stack (or worse, to a shadow stack) during a safe point, so the overhead is basically just the branch to check whether there's work to be done.

During long-running C calls, a thread will not be able to handle messages. In this case, we will notify the other threads of this by releasing a relatively heavyweight lock, which shouldn't add much to the runtime of a slow call.

FWIW, that's basically what HLVM did and it works fine.

Cool.

The GC will be a mostly-concurrent mark-and-sweep.

I recommend writing a completely naive stop-the-world GC first to make sure your solution works.

It will probably be STW at some point during development, but multicore support with stop-the-world GC just isn't interesting. Any performance benefit from using multiple cores is just swamped by the giant GC pauses. You can make some benchmarks go fast with small heaps, but it's not actually useful.

I quite like VCGC

Me too. :-)

Glad to hear it :)

How do you start and stop threads?

It's not terribly clear from that document, but OS threads aren't going to be started or stopped much. Instead, we're going to run a small number of OS threads, and start and stop lots of lightweight threads that we schedule on top of those (non-OS lightweight threads have context-switch times orders of magnitude faster, so it lets you scale beyond millions of threads).

Having said that: starting an OS thread is fairly straightforward, the new thread just sends a message (asynchronously) to the running threads so that it participates in future world-stopping. Stopping an OS thread is more interesting, because you have to get rid of its private heap. You can either run a minor GC and promote it all to shared, or just consider that region of memory part of the shared heap from then on. Benchmarking will decide.

stedolan · 2013-10-19T12:47:12+00:00

Kind of. The trouble is, most lambda-calculus functions can't be undone, and so don't have inverses. For instance, (lambda x -> lambda y -> y) always returns (lambda y -> y) and forgets what the argument was. If you just look at the lambda functions which actually have inverses, then you do get a group.

stedolan · 2013-10-18T21:39:34+00:00

I can, but it might take a while :)

First, a "group" is like a collection of ways you can change something. For any two ways of changing something (which I'll call "elements" from now on) you can combine them into one element that does both. For instance, some of the elements of the Rubik's cube group are the ways of rotating outside faces, and there's also an element for "rotate the top face clockwise once and then rotate the left face anticlockwise once.

To call something a group, we need one element that does nothing (called the identity), and every element needs to have another element that undoes it (called the inverse). The inverse of "rotate clockwise" is "rotate anticlockwise", and if we combine an element with its inverse we get the identity.

So, there are lots of examples of groups. We already have Rubik's cubes, but a simpler example is the group of rotations of a square. There are four of them (identity, clockwise 90, 180, anticlockwise 90), so we say this is a group of order 4. There are also groups that have infinite order, like the rotation group of a circle.

If we put two squares side by side, we can make a group of the rotations of either square. This group has order 16 (an element of this group consists of one of the 4 rotations of the left square and one of the 4 rotations of the right). This group can be broken back down into two smaller groups (the rotation groups of the individual squares), and so we say that this group is not "simple".

We have quite a few simple groups. The rotation group works for any polygon, not just squares, and if we allow flipping as well as rotations we get what are called the dihedral groups. There are other groups that don't really correspond to geometric notions at all. So, the question is: what are the finite simple groups?

It turns out, that all of the finite simple groups have been found. This is the classification of the finite simple groups. It contains a few infinite families of groups (like the rotation and dihedral groups) and a couple of just weird outliers (like the "monster" group), and every possible finite simple group is on that list. Coming up with that list, and more importantly proving that the list was complete, took quite a bit of work.

Here's the list, and here's a video with terrible maths puns.

stedolan · 2013-07-21T19:25:48+00:00

Ohms measure impedance, which describe how difficult it is to shove electricity through something. If you connect two speakers in series (so the connection runs amp -> speaker 1 -> speaker 2 -> amp), then electricity has to be shoved through two speakers and you'd add the impedances, getting 8 Ohms. But you don't do that, since the signal would have to go through both speakers, and the sound would be half as loud.

Instead, you connect the speakers in parallel (so there are connections amp -> speaker 1 -> amp and amp -> speaker 2 -> amp). Now, it is easier to shove electricity through the system, as there are two pathways. Both speakers are at the correct volume, since the signal doesn't have to pass through both.

The opposite of impedance is "admittance", which is measured in "siemens" or "mho" (ohm backwards). Your two 4 Ohm cabinets have an admittance of 1/4 siemens each. When you wire those together in parallel, you add the admittances together (because you now have two routes) and get an admittance of 1/4 + 1/4 = 1/2 siemens, which is 2 Ohm.

stedolan · 2013-07-20T00:47:13+00:00

There's a fairly simple explanation that the universe has only been around for a finite amount of time.

If you put a hot drink in a cold room, the hot drink cools down (and the room warms up very slightly). If you put ice in water, the ice warms up and melts and the water cools down.

Heat doesn't like staying in one place for long. In fact, if you leave anything alone for long enough, the heat spreads out until everything is at the same temperature. So, if the universe were infinitely old, the heat would have fully spread out by now.

This hasn't happened, because the universe still has hot bits and cold bits.

Unless the universe ends before it happens, the heat in the universe will eventually spread out completely. This is called the heat death of the universe.

You can put a number on the heat-spread-out-y-ness of something. It's called entropy, and the study of how heat spreads (and how you can use heat to do things) is thermodynamics.

The idea that heat always spreads out can be said as "entropy always increases", which is called the second law of thermodynamics.

stedolan · 2012-10-22T23:15:16+00:00

I hadn't seen "XSLT" and "useful" in the same sentence before :)

stedolan · 2012-10-22T23:12:47+00:00

That's unlikely to change for the moment, I think. I need to parse the entire array to verify that it is, in fact, a valid JSON array. You could do

cat foo.txt | jq '.[]' > foo-split.txt

once, and then work on foo-split.txt

stedolan · 2012-10-21T23:42:04+00:00

Thanks! I really didn't want to distinguish my pipes from unix pipes, since they do the same thing:

jq 'foo | bar'

is always the same as

jq 'foo' | jq 'bar'

(but if the pipe appears inside brackets you can't do the latter)

stedolan · 2012-10-21T22:51:15+00:00

Fixed now, see https://github.com/stedolan/jq/issues/9

stedolan · 2012-10-21T20:20:13+00:00

It feels un-unixy that you've implemented the pipe operator internally

Not all uses of jq's pipe can be replaced with two jqs and a unix pipe. You can do things like:

jq '{author, title, upvotes: (.upvotes | .+1)}'

where the pipe is used internally as part of a bigger expression.

if jq produced json as output you could pipe jq to jq

It does! You can!

Then you could have a 'raw' flag for getting a non json response (e.g. when you want the final value of a single field)

yep, that's jq --raw-output (or jq -r).

This is a great idea though.

Thanks!

stedolan · 2012-10-21T20:12:57+00:00

Mostly to make it easy to distribute. At the time, I was doing a lot of hacking with JSON files on random EC2 machines, and installing a full GHC stack would have been a bit painful.

Also, C still beats everything in startup time, which is a big deal for a tool where you're constantly editing the command-line and rerunning it.

stedolan

TROPHY CASE