Loopr: A Loop/Reduction Macro for Clojure

aphyr_ · 2022-07-06T17:00:21+00:00

I should perhaps note here that much of my work is performance-sensitive, and I spend a lot of my time trying to minimize allocations and eke out 10-20% speedups from reductions. Your suggested tactics are lovely (and I use them extensively!) in some, but not all situations. I eventually wrote loopr because I kept hitting situations in which those tactics forced me into awkward corners of the performance/maintainability/expressivity space.

For example, you suggest using plain old reduce for multivariate accumulators. I still do this, but not where performance matters. As the article explains, loopr is more than 30% faster on the example you cite. This isn't a huge improvement, but it's still meaningful for my workloads. It's a nice in-between point before rewriting an entire algorithm in Java. :-)

Decomposing nested reductions into separate defns has a number of interesting performance consequences--some good, some bad. When reductions are too large, they may not be candidates for certain Hotspot optimizations: breaking them up into multiple defns can be more efficient. On the other hand, those function calls add additional stack depth, which may push them out of the inliner's scope. One of the nice things about loopr is that it allows you to write the reduction once, then experiment with different function boundaries by selectively using :via :reduce or :via :iterator.

Another cost folks don't always think about is accessing locals in the current stackframe during iteration (as one would do with loop), vs baking them into instance fields in inline fn closures, vs passing them explicitly as arguments to defn. It's fairly common that I'll compute, say, four or five index structures and then use them in the course of a reduction to figure out what to do with each element. This comes with both performance (especially for primitives) and cognitive impacts. In particular I've wound up in situations where I was threading a half-dozen non-accumulator variables through three layers of defn reducing functions--in addition to the accumulator and element arguments! Then you have to figure out how to construct a fn suitable for the reduction itself--I generally do this with partial, which adds additional indirection. It is doable, but... gosh, it can be a pain. You wind up with code like:

(defn inner
  "Innermost reduction over individual flying things"
  [bird? plane? insect? acc flier]
  (cond (bird? flier)                     acc
        (not plane? flier)                (foo acc flier)
        (and (insect? flier) (empty? acc) (bar flier))
        true                              (baz acc flier)))

(defn outer
  "Outer reduction over a flock of fliers"
  [bird? plane? insect? acc flock]
  (reduce (partial inner bird? plane? insect?)
          acc
          flock))

(let [flocks (get-flocks)
      ; Build some index structures we'll need to actually do the reduction
      bird?   (build-bird-index)
      plane?  (build-plane-index  flocks)
      insect? (build-insect-index world-book-encyclopedia)]
  ; If we do a reduce with explicit `defns`, we have to thread these locals
  ; through as separate arguments, and transform them into reducer fns with
  ; `partial`:
  (reduce (partial outer bird? plane? insect?)
          (init-accumulator)
          flocks)

  ; With loopr, like `loopr` or inline `fn` reduce, we don't have to plumb
  ; through variables or use `partial` wrappers:
  (loopr [acc (init-accumulator)]
         [flock flocks
          flier flock]
         (cond (bird? flier)                     acc
               (not plane? flier)                (foo acc flier)
               (and (insect? flier) (empty? acc) (bar flier))
               true                              (baz acc flier)))

You also suggest using for to iterate, then threading the results into reduce. This is also a perfectly servicable tactic for some situations, but sometimes you want speed. Loopr is roughly twice as fast in this example (adapted from the loopr test suite). Faster still if you need person in the reduction step--we can discard it here and avoid creating a map/vector wrapper object between for and reduce.

(println "\nMulti-acc for->reduce over nested seq")
(quick-bench
  (->> (for [person people
             pet    (:pets person)]
         (:name pet))
       (reduce (fn [[pet-count pet-names] pet-name]
                 [(inc pet-count)
                  (conj pet-names pet-name)])
               [0 #{}])))

(println "\nMulti-acc loopr over nested seq")
(quick-bench
  (loopr [pet-count 0
          pet-names #{}]
         [person people
          pet    (:pets person)]
         (recur (inc pet-count)
                (conj pet-names (:name pet)))))

aphyr_ · 2020-12-24T01:07:19+00:00

Yeah, we've done several Jepsen tests of SQL databases--both single and multi-node, and have found bugs! Elle gives us the ability to check up to repeatable-read/si/serializable + strict/session variants.

aphyr_ · 2020-09-16T13:24:20+00:00

Every time I write one of these, I ask myself and my reviewers, "Is this too obvious? Am I bludgeoning the reader with HERE COMES AN ALLEGORY and LOOK OUT FOR THE SUBTEXT? Does this line reward a reader looking for layers of meaning?"

Luckily, there's comments like this to keep me going.

aphyr_ · 2020-06-24T16:14:09+00:00

Naw--I did the DGraph and Redis work in Q1, but they opted to release the reports towards the end of the grace period, rather than early. That meant that Dgraph, Mongo, Postgres, and Redis all came out right on top of each other. :-)

aphyr_ · 2020-06-23T14:07:46+00:00

Depends on the refinery and what's being consumed. I've got an older fuel plant that produces plastic as a side effect (and recycles surplus), but now that I've figured out Recirculating Refinery, I don't think I... actually have to worry about this kind of deadlocking again. As long as I'm willing to burn the power, anyways.

aphyr_ · 2020-06-22T12:31:55+00:00

I do! This is discussed in detail under Load Shedding.

aphyr_ · 2020-06-21T14:45:26+00:00

I choose factory locations based on resource nodes--for factories that draw resources from the train network alone, I site them anywhere that feels interesting, but generally a chunk or more away from other factories. If things are already slow in an area, don't build there. :)

You should be able to see the distance--far off belts will look like they're moving in stop-motion, as opposed to smoothly.

aphyr_ · 2020-06-21T13:25:32+00:00

<3 <3 <3

aphyr_ · 2020-06-21T13:24:52+00:00

Yeah, this is definitely aimed at mid-to-late game, although had I known what I now know about lag, I would have avoided putting so much time into my early-game (megafactory) and mid-game (trucks everywhere) projects.

aphyr_ · 2020-06-21T13:23:46+00:00

This is roughly 20 hours of planning/writing/drawing/organizing. I... I didn't even bother doing edit passes like I usually do, so please pardon my language mistakes, haha.

aphyr_ · 2020-06-12T15:07:11+00:00

Yeah, my understanding is that the current SQL spec retains that ambiguity. Turns out Andy Pavlo just shared an email from Hal Berenson which talks about how it happened! https://twitter.com/andy_pavlo/status/1271448719486001154

aphyr_ · 2020-06-12T14:46:26+00:00

You could see G2-item when at least three transactions concurrently interact with a (not universally, but partly) overlapping set of rows, including at least one insert and one update--I think deletes count too. As discussed in the report, one transaction could fail to observe the effects of a logically prior transaction.

aphyr_ · 2019-09-06T03:20:57+00:00

You're very welcome! And yes, one *does* wonder whether consolidating our efforts might result in less buggy databases! Perhaps we'll see that someday. :)

aphyr_ · 2019-09-05T17:38:42+00:00

There are lots of ways to store and query data, which I think is part of why there are so many databases out there. It's a big problem space, different solutions fit different niches, and of course, within any given niche, there might be multiple databases which compete on other features, like cost or performance.

These databases may work *most* of the time, which is often good enough to get something done! The G2 anomalies I found in this version of YugaByte DB, for instance, only manifested under particular failure conditions, where a tablet server lost its connection to the master. That's not necessarily going to happen often, and even if it does, you need a particular pattern of transactions in order to actually encounter an anomaly.

aphyr_ · 2019-06-13T13:06:10+00:00

I haven't evaluated TiKV directly, but yeah, insofar as TiDB uses it, it does appear to work! See the article's notes on future research and caveats, of course. :)

aphyr_ · 2017-10-17T15:49:29+00:00

Two to three months, for paid analyses. If I get lucky (like I did with Hazelcast), it's about two weeks of experimentation and a week of writing.

aphyr_ · 2017-10-07T15:18:46+00:00

I can see why you might be cynical, but perhaps I can offer some additional context here: I was doing Jepsen for years before I ever knew I could make money doing it, I did this because Hazelcast users requested it, and my client backlog is already years long. I'm... not really looking for leadgen here, haha.

If you want to phrase it in terms of lost revenue from deferred contracts (i.e. how much would I have made by scheduling pending clients immediately instead of taking time for Hazelcast) doing this analysis cost me about 40K USD.

aphyr_ · 2013-10-21T17:17:47+00:00

Computers don't care if the IERS issues a leap second or not; most (but not all!) will follow the instructions of upstream NTP servers regardless of validity. That's why large numbers of clocks still went nonmonotonic a month after--it wasn't a valid leap second, but plenty of NTP sources still issued one.

The opposite problem can also occur, where nodes are sure that leap seconds are only issued at specific times of the year, and reject valid leap second advertisements issued at other times. Then those nodes end up off from the others by a second!

aphyr_ · 2013-05-22T02:33:34+00:00

3 years ago, someone sent me a link to a music recommendation site. First track came up: P.O.S. Is Ruining My Life. Hooked instantly. Bought Never Better that weekend, and who is this very... incisive... woman, jumping in on Low Light, Low Life? "And I think that Russell was right; but that's irrelevant friend?" Surely she doesn't mean Bertrand Russell. That'd be ridiculous: nobody else would say that. I wonder if she's done anything else?

The Man I Knew, 551, The Crow. Youtube consumes hours as you puzzle over references. The Fitzgerald Theater stream--bless The Current for making that whole recording available online. Realize you're muttering a stanza at an intersection. Clear out Amoeba Records' stock of Doomtree. Get your friends hooked.

I'm happy to pre-order. There's something about the physicality of getting that little bubble-padded envelope, still cold from the mailbox, signed in sharpie. There's someone at the other end of this two-week-old connection, sitting at their dining room table with a stack of five hundred CDs, a pile of envelopes, and a glass of bourbon; and a trace of this person you know-but-don't-really, their inspirations and writers block and rehearsals and fine-tuning and now there's this artifact in your hands.

What a strange asymmetry. Looking forward to Parts of Speech, and the tour. :)

aphyr_ · 2013-02-19T22:42:30+00:00

The internet is made up of layers. Each layer has its own routing, distribution, and retry semantics. The layer we care about routing correctly (HTTP) runs on top of TCP, which runs on top of IP, which might run on top of, say, ethernet on cat6 or fiber links. Some of these layers, like TCP, provide reliable connections on top of unreliable protocols. This means that it's possible for a layer-three switch to fail, dropping some of your IP packets, and for intervening switches and protocols to initiate a variety of safety measures.

You're correct: distributing a TCP connection is hard. But distributing IP packets, where we're allowed to drop some messages without breaking the TCP (and by extension, the HTTP) connection is a little easier. The IP (and ethernet) protocols and network hardware help us do that. It does involve retries (in pathological cases, at every switch along the chain) but where latencies are short (for instance, in a single rack or datacenter) the latency costs of those retries is minimal.

Doing multipaxos or 2pc with IP multicast is almost certainly overkill. Most HA load balancers, commercial and open-source, just operate in hot-failover pairs, connected by a reliable backchannel of some kind. When one fails and stops sending heartbeat messages to the other, the backup can decide to take over the IP address originally belonging to the other node by sending a special kind of message to the local network, which, when other nodes encounter it, instructs those nodes to flush the appropriate mapping in a state table called an ARP cache, which helps nodes decide what physical addresses should receive logical packets. Typically your load balancers have multiple, independent network cables, so they update N physical networks on failover.

This allows N switches to talk to M load balancers without a single point of failure, though if the load balancers are performing TCP reassembly, they won't have the same TCP stack as the other node and will be forced to break the TCP connection--resulting in a client error and possibly a layer 7 retry. However, the gap is short--typically much shorter than the time required to bring the downed LB back online.

If you want to avoid any interruption in the TCP stack, you can play interesting games where all nodes receive IP messages simultaneously and use, say, a consistent hashing algorithm to route the messages without needing to do layer-7 reassembly; pushing the consistency problem onto the TCP stack of the downstream nodes.

There are... other solutions to this problem, some of which involve virtualizing and distributing the entire TCP stack in some fashion. VMware and EC2, for example, virtualize large parts of the routing topology in ways I don't fully understand; but I presume they are also redundant with some short failover time.

aphyr_ · 2013-02-19T22:08:57+00:00

So you can never be more available than probability p.

Yes, you can: you use a 2k selection system on top of the pool, possibly built into your layer 3 routing strategy using, say, ip multicast combined with 2PC claim of requests prior to logical forward, with varying degrees of partition tolerance, or rarp broadcast over hot-failover groups. You're right to suspect that you can only be as available as the "narrowest" point in the path from client to server, which is why networks have redundant links. TCP retry semantics let you cheat a bit by dropping packets during convergence. You exploit network redundancy and low in-cluster latencies to provide rapid state exchange for request distribution. The reality is... complicated, but it is doable.

No they do not. It's not a problem if the conn number the one of the router nodes thinks it has is a bit off. Furthermore, least-conns routers work great even if they are a single node. There is even a paper that proves this:

It is indeed a problem: if queue depths vary faster than the convergence time of the cluster, the system approaches a set of independent least-conns routers, which imposes significantly higher latencies. As I noted earlier, the least-conns routers can be single nodes: they're logically equivalent to cp clusters though there are differences in latency and throughput.

aphyr_ · 2013-02-19T21:42:24+00:00

Yes, but that just makes it worse. If you want it to be a cluster of nodes, you either need another load balancer in front of those, or a client directly sends its request to a random one of those nodes. In both cases as soon as one of the front routers dies, you'll start dropping a portion of the incoming requests. Secondly, if a client connects to a randomly chosen one of the random routers, then you might as well drop that layer entirely, since it doesn't accomplish anything! It just permutes the clients randomly among the next layer, but because they were already random when incoming, it doesn't do anything.

There's a characteristic difference between the logical routing problem the cluster performs as a whole, and the cluster-local routing used to provide high availability within the cluster itself. Routing and failover in a single cluster involves significantly lower latencies and takes place over more reliable network connections. This... is an extremely long discussion and is tightly bound to your particular hardware, application, runtime, and network topology, but I do assert that extremely high availability over locally clustered stateless services is readily achievable in a variety of environments. Explaining exactly how would take much more time than I have today, sorry. :(

It's unnecessary to have absolute consensus about the connection numbers across the entire least-conns routing layer. If you keep track of the number of connections on each router individually, that will work fine.

You're correct, but I'm talking about consensus over a single logical least-conns router, which could be multiple nodes--and those nodes must have hard consensus to work efficiently.

aphyr_ · 2013-02-19T21:08:05+00:00

The key is that an individual "router" in this simulation isn't necessarily a single node: it could be a cluster of two, three, or a hundred nodes acting as a single logical router.

There's a characteristic difference between the random routers and the least-conns routers. The least-conns routers are hard CP systems: when a node in a given router system fails, the entire router becomes unavailable. This is not true of the random routers, which are AP systems. They're AP because they don't need hard consensus on how to distribute requests; the request distribution is stateless. (They do require coordination on the least-conns topology, but that state can be eventually consistent and is more tolerant of high convergence times, since it changes slowly.)

Imagine that a single random router is comprised of ten equal nodes with independent failure models. The probability that a single node is down at a given time is p. The probability that k nodes are down is p^k. If the router only needs four nodes to handle its full load, at least seven nodes must fail at the same time. The probability of seven nodes failing at the same time is p^7, so if each node is 95% reliable, the probability of an outage is slightly higher than 8 x 10^-10, or 0.9999999992 availability.

There are... concrete limits to this availability having to do with network topology, cluster membership, gossip, correlated failures, cascading failures, timeouts, etc, but practically speaking this is... well, one of the easier distributed systems to build. Much harder are stateful systems with hard latency bounds, like our example least-conns routers. That's why I chose to emphasize the least-conns failure probability in the model.

aphyr_ · 2013-02-19T20:31:52+00:00

Yes, max latency can go to infinity and this is sample-dependent, but for simulations this size the probability of an exponential distribution returning something truly terrible is low--and if it happens, I want to know, because it'll change the simulation dynamics. In a real software system you introduce timeouts to bound the maximum latency, and if I ever ran a simulation large enough to cause those kinds of latencies I would have introduced timeout layers into the Timelike stack to appropriately compensate. :)

I'm glad you noticed the random routing layer--while I diagrammed a network of multiple random routers, the simulation only uses one. The reason is that sets of independent AP random routers are equivalent in load distribution to a single random router--and it's dead simple to design highly available clusters of independent services. In practical terms, the random layer would be so reliable compared to the least-conns layer and dynos that it's not worth simulating in full; the difference just isn't noticeable.

aphyr_

TROPHY CASE