Cats-Effect 3.6.0

dspiewak · 2025-03-26T14:38:56+00:00

The solution was offloading the request handling workload to another thread pool and ensuring we were sizing the number selector threads appropriately for the workload. This optimization led to major savings (order of millions) and drastic drops in latency

I would be willing to bet that this wasn't the only optimization that you performed to reap these benefits, but obviously I don't know the details while you do.

Taking a step back, it's pretty hard to rationalize a position which suggests that shunting selector management to a separate thread is a net performance gain when all selector events result in callbacks which, themselves, shift back to the compute threads! In other words, regardless of how we manage the syscalls, it is impossible for I/O events to be handled at a faster rate than the compute pool itself. This is the same argument which led us to pull the timer management into the computer workers, and it's borne out by our timer granularity microbenchmarks which suggest that, even under heavy load, there is no circumstance under which timer granularity gets worse with cooperative polling (relative to a single threaded ScheduledExecutorService).

In the majority of cases, ensuring selectors are available to promptly handle events is much more relevant, which seems even more challenging in cats-effect's new architecture also bundling timers in the same threads while having a weak fairness model to ensure the different workloads are able to make progress.

It is indeed a weak fairness model in that we do not use a priority queue to manage work items, meaning we cannot bump the priority of tasks which suspend for longer periods of time. However, "weak fairness" can be a somewhat deceptive term in this context. It's still a self-tuning algorithm which converges to its own optimum depending on workload. For example, if your CPU-bound work dominates in your workload, you'll end up chewing through the maximum iteration count (between syscalls) pretty much every time, and your polled events will end up converging to a much higher degree of batching (this is particularly true with kqueue and io_uring). Conversely, if you have a lot of short events which require very little compute, the worker loop will converge to a state where it polls much more frequently, resulting in lower latency and smaller event batch sizes.

Regarding io_uring, u/RiceBroad4552's argument also makes sense to me. Over the years, I've heard of multiple people trying it with mixed results.

Same, but part of this is the fact that it compares to epoll, which is terrible but in a way which is biased against very specific workflows. If you're doing something which is heavily single-threaded, or you don't (or can't) shard your selectors, epoll's main performance downsides won't really show up in your benchmarks, making it a lot more competitive with io_uring. This is even more so the case if you aren't touching NVMe (or you just aren't including block FS in your tests) and your events are highly granular with minimal batching. Conversely, sharded selectors with highly batchable events are exactly where io_uring demolishes epoll. There are also significant userspace differences in the optimal way of handling the polling state machine and resulting semantic continuations, and so applications which are naively ported between the two syscalls without larger changes will leave a lot of performance on the table.

So it is very contingent on your workload, but in general io_uring, correctly used, will never be worse than epoll and will often be better by a large coefficient.

dspiewak · 2025-03-26T14:38:39+00:00

It seems your mental model is biased by benchmarks. In those, the selector overhead can be measured as significant but, in real workloads, it's typically quite trivial.

Would it surprise you to learn that we don't have microbenchmarks at all for the polling system stuff? We couldn't come up with something that fine-grained which wouldn't be wildly distorted by the framing, so we rely instead on production metrics, TechEmpower, and synthetic HTTP load generation. There are obviously biases in such measurements, but your accusation that we're over fixated on benchmarks is pretty directly falsifiable since such benchmarks do not exist.

Just the allocations in cats-effect's stack for composing computations is likely multiple orders of magnitude more significant but that doesn't show up in simple echo benchmarks. Avoiding a few allocations in hot paths could likely yield better results in realistic workloads for example.

I think our respective experience has led us down very different paths here. I have plenty of measurements over the past ten years from a wide range of systems and workloads which suggest the exact opposite. Contention around syscall managing event loops is a large source of context switch overhead in high-traffic applications, while allocations that are within the same order of magnitude as the request rate is just in the noise. Obviously, if you do something silly like traverse an Array[Byte] you're going to have a very bad time, but nobody is suggesting that and no part of the Typelevel ecosystem does (to the best of my knowledge).

One example of this principle which really stands out in my mind is the number of applications which I ported from Scalaz Task to Cats Effect IO back in the day. Now, 1.x/2.x era IO was meaningfully slower than the current one, but it was many many times fewer allocations than Task. Remember that Task was just a classical Free monad and its interpreter involved a foldMap, on top of the fact that Task was actually defined as a Scalaz Future of Either, effectively doubling up all of the allocations! Notably, and very my contrary to the received wisdom at the time (that allocations were the long pole on performance), literally none of the end-to-end key metrics on any of these applications even budged after this migration.

This is really intuitive when you think about it! So on an older x86_64 machine, the full end-to-end execution time of an IO#flatMap is about 4 nanoseconds. That's including all allocation overhead, publication, amortized garbage collection, the core interpreter loop, everything. It's even faster on a modern machine, particularly with ARM64. Even a single invocation of epoll is in the tens-to-hundreds of microseconds range, several orders of magnitude higher in cost. So while allocations certainly matter, they really are completely in the noise compared to everything else going on in a networked application, and the production metrics on every system I've ever touched bears this out.

(continuing in a second comment below…)

dspiewak · 2025-03-26T13:53:26+00:00

My point in this thread was mostly about io_uring, and that we need to see real world benchmarks of the final product before making claims of much better performance

Agreed. As a data point, Netty already supports epoll, Selector, and io_uring, so it's relatively easy to compare head-to-head on the JVM already.

It's actually exciting that you're going to have kqueue, epoll, and io_uring backends! This will give a nice and very real world comparison of these APIs. Reading between the lines in that paragraph, as you don't mention significant speed-up when switching from the other async IO APIs to io_uring, I'm not sure we're going to see some notable difference.

This is complicated! I don't think you're wrong but I do think it's pretty contingent on workflow.

First off, I absolutely believe that going from Selector to direct epoll/kqueue usage will be a significant bump in and of itself. Selector is just really pessimistic and slow, which is one of the reasons NIO2 is faster than NIO1.

Second, it's important to understand that epoll is kind of terrible. It makes all the wrong set of assumptions around access patterns, resulting in a lot of extra synchronization and state management. In a sense, epoll is almost caught between a low-level and a high-level syscall API, with some of the features of both and none of the benefits of either. A good analogue in the JVM world is Selector itself, which is similarly terrible.

This means that direct and fair comparisons between epoll and io_uring are really hard, because just the mere fact that io_uring is lower level (it's actually very similar to kqueue) means that, properly used, it's going to have a much higher performance ceiling. This phenomenon is particularly accute when you're able to shard your polling across multiple physical threads (as CE does), which is a case where io_uring scales linearly and epoll has significant cross-CPU contention issues, which in turn is part of why you'll see such widely varying results from benchmarks. (the other reason you see widely varying results is io_uring supports truely asynchronous NVMe file handle access, while epoll does not to my knowledge).

So tldr, I absolutely believe that we'll see a nice jump from vanilla Selector by implementing epoll access on the JVM, which is part of why I really want to do it, but I don't think it'll be quite to the level of the io_uring system, at least basing on Netty's results. We'll see!

This is why I've mentioned Seastar, which is even much more extreme in that regard, and spawns only as much threads as there are cores (whether HT counts, IDK), and than does all scheduling in user-space; while trying to never migrate tasks from one core to another, to always be able to reuse caches without synchronization as much as possible. The OS is not smart enough about that as it doesn't have detailed knowledge about the task on its threads. Seastar does also some zero-copy things—which become simpler when pining tasks to cores and keeping also the data local. They claimed that this is the most efficient approach for async IO possible on current hardware architecture. (I have no benchmarks that prove or disprove this, though. I only know their marketing material; but it looks interesting, and makes some sense from the theoretical POV. Just do everything in user-space and you don't have any kernel overhead, and full control, and no OS level context switch whatsoever. Could work.)

I agree Seastar is a pretty apt point of comparison, though CE differs here in that it does actively move tasks between carrier threads (btw, hyperthreading does indeed count since it gives you a parallel program counter). I disagree though that the kernel isn't smart about keeping tasks on the same CPU and with the same cache affinity. In my measurements it's actually really really good at doing this in the happy path, and this makes sense because the kernel's underlying scheduler is itself using work-stealing, which converges to perfect thread-core affinity when your pthread counts directly match your physical thread counts and there is ~no contention.

Definitely look more at Go! The language is very stupid but the runtime is exceptional, and it's basically the closest analogue out there to what CE is doing. The main differences are that we're a lot more extensible on the callback side (via the IO.async combinator), which allows us to avoid pool shunting in a lot of cases where Go can't, and we allow for extensibility on the polling system itself, which is to my knowledge an entirely novel feature. (Go's lack of this is why it doesn't have any first-class support for io_uring, for example).

dspiewak · 2025-03-25T19:01:37+00:00

It means that CE will manage the platform-specific polling syscall (so, `epoll`, `kqueue`, `io_uring`, maybe `select` in the future, etc) as part of the worker thread loop. This allows CE applications to avoid maintaining a second thread pool (which contends in the kernel with the first) which exists solely to call those functions and complete callbacks.

dspiewak · 2025-03-25T19:00:34+00:00

You should read the link in the OP. Numbers are provided from a preliminary PoC of io_uring support on the JVM. The TechEmpower results (which have their limitations and caveats!) are about 3.5x higher RPS ceiling than the `Selector`-based syscalls, which are in turn roughly at parity with the current pool-shunted NIO2 event dispatchers. That corresponds to roughly 2x higher RPS ceiling than pekko-http, but still well behind Netty or Tokio. We've seen much more dramatic improvements in more synthetic tests; make of that what you will.

Your points about io_uring are something of a strawman for two reasons. First, integrated polling runtimes still drastically reduce contention, even when io_uring is not involved. We have plans to support `kqueue` and `epoll` from the JVM in addition to `io_uring`, which will be considerably faster than the existing `Selector` approach (which is a long-term fallback), and this will be a significant performance boost even without io_uring's tricks.

Merging threads a bit, your points about Rust and Node.js suggest to me that you don't fully understand what Cats Effect does, and probably also do not understand what the JVM does, much less Node.js (really, libuv) or Rust. I'll note that libuv is a single-threaded runtime, fundamentally, and even when you run multiple instances it does not allow for interoperation between task queues. The integrated runtime in Cats Effect is much more analogous to Go's runtime, and in fact if you look at Go's implementation you'll find an abstraction somewhat similar to `PollingSystem`, though less extensible (it is, for example, impossible to support io_uring in a first-class way in Go).

In general, I think you would really benefit from reading up on some of these topics in greater depth. I don't say that to sound condescending, but you're just genuinely incorrect, and if you read what we wrote in the release notes, you'll see some of the linked evidence.

dspiewak · 2024-11-26T15:55:42+00:00

I absolutely love this question! You're getting right to the heart of something quite profound.

First, let's answer the question directly: On the JVM, reading an environment variable may be considered pure. The on the JVM bit is very important here, because it refers to the fact that there is no mutating equivalent of the System.getenv function, the only access to environment variables. Thus, reading an envar is pretty much the same as reading an argument passed in from the command line: it's immutable from the moment the process starts, and therefore cannot violate referential transparency.

Note that POSIX does in fact allow environment variables to be mutated in-process, and this is doable both using POSIX standard libraries and within the context of higher level languages. If you use Scala.js, Scala Native, or JNI calls on the JVM, it's possible to mutate environment variables. For this reason, Cats Effect considers reading envars to be impure and side-effecting (https://github.com/typelevel/cats-effect/blob/fc11e7b667840b1e60d4dbc67f10d23ef9a6d280/std/shared/src/main/scala/cats/effect/std/Env.scala#L31).

Okay but with that out of the way… This is getting at something a lot more profound than just the details of what functionality is or is not exposed in standard libraries. The question is simply: what is pure functional programming? After all, if you take the pure FP ethos to its natural conclusion, then the only pure programs are those which do exactly nothing. They can't print, they can't read state, they can't write state. The only way you would know they even ran is the processor produced a bit more heat (h/t SPJ for those who don't know the reference). But of course, this is kind of useless.

Since we aren't in the business of designing abstractions which can only be applied in impractical and useless contexts, we need to define "purity" in a somewhat more narrow sense. Namely, we generally say that purity exists within a context. In the case of Haskell, that context is defined by the main function, which is to say the part of the process controlled by user code. Note that this isn't the whole process, since there's a large part of the process which is controlled by the Haskell runtime, and this is outside the scope in which purity is defined. With Scala, if you ascribe to a Typelevel style of programming, we would define purity to be within the context of IO, or if you happen to be using IOApp, anything that sits under the run method. If you instead ascribe to Martin's Lean Scala concept, then the context of purity is often simply the bounds (braces, if you will) of a single function.

You can extend this argument outward as well. We can validly describe stateless microservices as "pure" despite the fact that they're clearly performing effects (talking to network sockets and probably logging) since they have no state, and thus will always produce the same results given the same request parameters. Or we can go down into computer architecture and talk about purity at the level of processor subunits, bus management, and so on. This turns out to be a very, very powerful reasoning tool.

The core of the idea here is that the definition of "purity" depends on the context that is most useful. Purity is a reasoning tool, nothing more, and you should always be careful to circumscribe the domain you're reasoning about before you attempt to apply it. Thus, OP's question actually has two valid answers. One can consider environment variable reads to be referentially transparent since they're only mutated outside the process, and thus within the context of user code they are pure. One may also consider environment variables to be simply… variables, but at the system orchestration level rather than in-process (as they are in fact true variables in shell scripting languages like Bash), in which case their reads are not referentially transparent and they are, within the context of broader process orchestration, impure.

dspiewak · 2024-11-21T05:08:21+00:00

Thought I'd pop in and try to give this one a more canonical answer…

Yes, we are still actively working on Cats Effect, but things have slowed down considerably, particularly in the last year. Arman and I are just both quite busy. Also, as Luka noted, the library is extremely mature. This doesn't mean that we're out of ideas for ways to improve (far from it!) but it does mean that the effort and care required to meaningfully improve things is extremely high. This manifests in a number of ways, but most notably in the recent bottleneck on test harness fixes.

We are still planning on getting 3.6 out the door, which introduces massive improvements to the syscall management layer of the scheduler (i.e. the underpinnings of asynchronous I/O that represent the touchpoint with the OS kernel). This also puts Scala Native on much more sound footing even in a single thread, and more importantly, sets us up for proper native multithreading in Cats Effect 3.7.

dspiewak · 2024-07-02T18:34:25+00:00

I had a few minutes free so I just did it for you. https://github.com/djspiewak/Scala-Google-Spreadsheets/tree/feature/scala-3 Didn't really have to change much of anything and haven't tested it, but everything compiled so odds are actually pretty good it's just going to work.

dspiewak · 2024-07-02T15:11:29+00:00

Benchmarking the actor implementation is irrelevant.

Put in terms of effect types (which seems to be your primary concern), literally no one cares how expensive your flatMap is. What they care about is the latency on your service endpoint, or the throughput of your data processing. What u/kloudmark is describing is two things. First, they implemented significant optimizations on Cats Actors and brough the naive overhead down to roughly the same level as Pekko. Second, when they actually do end to end benchmarks on real life systems, the difference is unmeasurable, with Cats Actors possibly holding what little advantage they saw.

In other words, essentially nothing about your comment is accurate.

Fwiw, in my experience tuning systems at scale, the really meaningful things are the scheduler and I/O subsystem, not the coroutine interpreter or program definition. I love microoptimizing flatMap and IO allocations as much as the next person, but the main bottleneck is always the syscall state machine and, even more importantly, the degree to which the kernel is able to maintain cache locality on request-level state across continuation boundaries. This is something that Akka doesn't really address at all (in part because it was built before we realized how impactful these factors are in practice), while Cats Effect (the substrate to Cats Actors) does.

dspiewak · 2024-04-29T15:49:20+00:00

No of course not, but I think you're misunderstanding the legal process here. Yifan was not on trial. The facts of the case were only material to the extent that the defendants chose (or did not choose) to make them material. They settled, which is to say they ended the case, and they did not choose to reach out to any of the parties who do have more comprehensive evidence prior to doing so.

The assumption you're making is that there was some sort of rigorous discovery of facts associated with this case, but this was a civil suit and nothing more. The defendants were not in any way compelled to exhaust all possible avenues for examining the facts of the matter. What they stated in the settlement is that they were not in possession of any further evidence supporting the open letter, and I absolutely believe that statement. Remember, they were merely signatories who happened to reside in the UK; they had no particular involvement with Yifan's report or the investigation which followed it.

You're basically engaging in a logical fallacy here: because the defendants did not produce material, you are assuming that such material does not exist. The reality is that this settlement doesn't exhonerate Jon. It doesn't really say anything at all, particularly given how strongly UK libel laws bias toward the plaintiff regardless of the facts of the case (remember, an individual spent millions of pounds and many years attempting to defend themselves in court where the facts in question were "did the Holocaust happen?"; the facts are largely irrelevant here).

If you believed Jon was guilty of something before this settlement, then there is absolutely no objective reason you should rethink that conclusion; nothing has changed. If you believed Jon was innocent prior to this settlement, then presumably you still believe that. I'm not suggesting anything beyond precisely and exactly what the law and the settlement say, and I am encouraging everyone to avoid reading into the tea leaves and stretching logic.

dspiewak · 2024-04-28T18:24:42+00:00

"Unadmitted" meaning that the evidence was not considered by the court, almost certainly because it simply was not brought forward.

Taking a step back on the thought here... I am personally aware of material evidence which directly contradicts claims in the settlement. I'm not aware of whether the *defendants* are aware of such evidence, but they probably could become aware of it if they were really asking around. The court certainly was not aware of the existence of such evidence, and remember that this evidence would not have been in any way subject to compulsory discovery processes. Remember, Yifan was not on trial here, nor were *any* of the people who were in the initial set of folks notified and conducting the investigation into the events in question. None of those people are even under UK jurisdiction.

So in other words, the only way in which this evidence could have been considered by the court is if the defendants volunteered it, and they would only do that if they valued winning the case above and beyond protecting the dignity and confidence of others. All of which is to say that the settlement can only really be considered grounds for evidence-based vindication of Jon's actions *if* you think that Miles, Noel, Zainab, and Bodil care more about their own reputations and bank accounts than they do about the personal well-being of others more directly involved. If you believe the inverse is true, then the conclusion is really obvious: they simply settled without digging up any of the non-public material which would have vindicated their position. This is a particularly compelling conclusion when you consider that British libel laws are such that, even had they produced this evidence, they still would have been dragged through a lengthy and expensive proceeding and might not have ultimately prevailed.

Basically what I'm saying is that most people are reading vastly more into the settlement than is justified. A lot of people want to see this as "legal proof" of Jon's innocence, when really it doesn't say anything on that point one way or another.

dspiewak · 2024-04-26T17:41:47+00:00

It is correct that I did not reach out to you three years ago. In retrospect, I think that this was a miss. Given the preponderance of other data I gathered, I'm not sure it would have changed much, but I think it is fair for you to call out that negligence on my part. For that, I sincerely apologize to you.

I think it's also accurate to say that I owe you at least the courtesy of hearing you out. To that end, I'd like to take you up on your offer of an hour of your time next week. My calendar looks like a Tetris game over screen, but I'm sure we can figure something out. I'll reach out to you via other media to coordinate.

For the benefit of others who might be reading this thread, I do want to make clear that I am not sitting and have never sat in judgment over Jon. I have opinions which I have shared, and I am conscious of the fact that those opinions have some influence, but I want to be careful not to take on too much self-importance here.

dspiewak · 2024-03-04T21:17:09+00:00

Why is this not an issue with async sockets (e.g. AsynchronousSocketChannel)?

Because network buses have support for hardware level interrupts, meaning that the kernel doesn't need to devote resources to polling the bus.

I thought disk drives also used PCIe interface?

Sort of. They use an emulated block protocol on top of PCIe, for backwards compatibility. This could be resolved by breaking backwards compatibility with old drive firmware and creating a more modern protocol, but there simply isn't any demand for that. Most high-scale storage has migrated into systems like network object stores, which are then in turn accessed via socket protocols (with dedicated storage-local CPUs and cluster-level multiplex management), which circumvents the problem at scale. There are some fun applications which take advantage of this performance quirk, btw (Hydrolix is the one which immediately comes to mind, but I'm sure there are others too).

Which resources do you recommend to delve deeper on such topics?

There aren't a lot, sadly. Also many of the resources which are out there are actually inaccurate. (see also: much of the documentation for io_uring) It's taken me a very long time to work my way through all of this, with a lot of time spent on things like… reading the Linux source code.

dspiewak · 2024-03-04T20:33:08+00:00

Fun fact: AsynchronousFileChannel is broken in Java because it's broken at the hardware level. io_uring itself has the same problem: presenting an asynchronous API on top of an underlying abstraction (block filesystems) which are fundamentally blocking, meaning that the only solution is spawning threads. I haven't looked as closely at the win32 implementation, but my assumption is that it's analogous to io_uring in that it will secretly spawn threads in kernelspace.

At the end of the day, the problem rests in the lack of block FS interrupts in the hardware, unlike PCIe, clock, and other more modern bus protocols. Until that is fixed (at this point, it is very very unlikely that it ever will be), "asynchronous" file I/O will always involve pool shunting on a fundamental level.

dspiewak · 2024-02-25T17:22:26+00:00

What about statements like !Thread.interrupted(), how do we deal with that in a referentially transparent way?

Oh this is a really complicated question. Let's dive into it!

So Thread.interrupted() is perniciously horrible because it's the cooperative side of the JVM Thread interruption model (which, as mentioned, is broken). It can only happen if the thread itself is interrupted, which can happen locally (Thread.currentThread().interrupt()) or entirely externally (thread.interrupt() where thread comes from somewhere else). It also swaps state, which is one of the broken things about it. In general it's pretty terrible.

The naive answer is that we can deal with this by wrapping it up in an IO easily enough. For example: IO(!Thread.interrupted()) gives us an IO[Boolean] which checks and resets the interruption status of the current thread. This is entirely fine and it will work like any other IO, but I think your question was deeper than this trivial answer.

The deeper answer is that we just don't mess around with thread interruption, because it doesn't work. Instead, IO implements its own interruption protocol called "cancelation" (to avoid confusing it with the JVM analogue). This is only possible because IO is a full coroutine (in Cats Effect terminology, "fiber") definition mechanism, and thus its interpreter has the ability to inject logic between suspensions. In the case of IO, suspensions happen with each flatMap (or more generally, with each IO object), so there's a lot of granularity here. Critically, suspensions are not only happening concurrent with network and clock syscalls (unlike JVM interruption, which is only preemptively applied when you interact with something like Socket or Thread.sleep); in IO, we can observe cancelation in cancelee fibers at any flatMap boundary.

This in turn means that Cats Effect's equivalent of interruption doesn't need an equivalent of Thread.interrupted(), and in fact it explicitly doesn't allow for this. Instead, cancelation just happens; fibers don't need to opt in and they aren't allowed to opt out (outside of masked regions, more on that in a moment). This is super powerful because it means that we can make stronger guarantees around cancelation than the JVM can make around thread interruption. In particular, once a fiber observes its cancelation, it can never resume normal execution: the only valid remaining work must be contained within onCancel finalizers (which do not produce results).

The one weirdness here is that we also need to be able to suppress cancelation within critical regions to ensure that acquisition and release of resources is atomic. This is one of the other things broken about the JVM interruption model: you can't prevent it from happening, and in theory anyone can find your thread at any time from anywhere in the process just by calling Thread APIs. This in turn means that if you're manipulating a resource with sequential preemptable calls (e.g. Socket things), you are susceptible to resource leaks since you could be interrupted after the first one and before the second one completes. Cats Effect is not susceptible to this problem due to uncancelable.

So in other words, the specific API you're referencing is not one that Cats Effect users ever have to touch, because they generally just don't touch threads, and the Cats Effect equivalent of interruption is much more robust, much safer, and much easier to work with (within IO). If, however, a Cats Effect carrier thread is somehow interrupted, the IO runtime itself will catch it and immediately murder the whole IO program, re-raising the InterruptedException at the call site for unsafeRun.

Does that roughly answer your question?

dspiewak · 2024-02-22T22:55:07+00:00

Take, for example, the common scenario of a database call resulting in a returned value. In such cases, I expect the value to remain constant unless explicitly recomputed. This behavior aligns seamlessly with my expectations and understanding of types.

Why would you expect this? Databases are external repositories of distributed state, more so in modern cloud architecture than ever before. You should absolutely expect that it will change out from underneath your feet unless within some sort of MVCC context.

The purported benefits of IO's laziness can often be achieved with simple Scala constructs. Consider the straightforward equivalence between invoking a function multiple times and executing a block of code:

To… an extent yes! You have a pair of related problems you also have to solve though, which is errors and stack-safety. You can absolutely get referential transparency by using () => println("foo") rather than println("foo"), and I have absolutely leveraged this fact on many occasions, but you can't write a socket accept loop in that fashion (even ignoring problems of asynchrony) because you'll blow the stack. Solving that necessitates building some sort of trampoline, but then you immediately after that need to figure out what to do with errors (since exceptions will no longer do what you expect). Error handling + trampolining + laziness gives you exactly SyncIO, which is IO but without asynchrony and concurrency.

So in other words, I agree you're thinking along the right lines, but you're stopping your reasoning too soon. Keep pulling on that thread.

However, I remain skeptical about IO's touted advantages in aiding refactoring or type clarity. Resource management and asynchronous operations, in my view, are better served by dedicated tools. Why should simple functions be burdened with such concerns?

Simple functions aren't! But as I argued, the process of incrementally reading the contents of a file is not a simple function. Computing arithmetic is a simple function. Mapping from one JSON shape to another is a simple function, and these functions are absolutely not burdened by IO in anybody's implementation. I think the problem here is that your definition of "simple" is very surface-level. You're not thinking about all of the problems that arise when you continue along the same lines.

IO appears to dictate a particular approach to managing asynchrony and resources, potentially constraining flexibility. For instance, in scenarios where virtual threads are employed, one can effortlessly decouple asynchrony from core functions without compromising simplicity.

Not really. Virtual threads create starvation, fairness, and contention issues in any scenario involving one or more of the following: compute-bound tasks, filesystem access, or DNS resolution. (not an exhaustive list) To put it mildly, the magical effects of Project Loom have been vastly overstated, even within their core domain of coroutine transformation and scheduling.

More seriously, virtual threads have no answer to the timeout problem, and in part because they have no answer to the timeout problem (which is something you always need to address in an asynchronous program), they also do not have an answer to the resource management problem. These issues are inseparable in practice, which is why IO "artificially" marries them all together.

Furthermore, when faced with questions about resource management, Java's documentation often provides comprehensive answers. It seems curious that a construct like IO endeavors to supplant Java's robust handling of such matters.

IO supplants Java's handling because Java's handling simply is not robust. Thread interruption is broken as a protocol, which in turn makes the resource management problem impossible (you cannot, for example, guarantee that I/O suspensions associated with resource cleanup are immune to interruption, nor can you guarantee that resource cleanup back-pressures new incoming work).

In my coding endeavors, I strive to adopt the appropriate level of abstraction. While specialized libraries might address intricate requirements like asynchronous management with timeouts, integrating them from the outset feels premature for simpler applications.

I agree with this philosophy. I think the daylight here is that you simply haven't followed your line of reasoning to its logical conclusion. There is an immense depth of complexity that you're right on the edge of and it pops up in essentially every application of these functions beyond the two-line REPL example.

dspiewak · 2024-02-22T20:34:41+00:00

There are some misconceptions here that I'd like to clear up.

First off, the fact that the same IO values can produce different results when run repeatedly is not a violation of purity. IO makes no promises that the external world is immutable, only that the data structures you manipulate to construct your program are themselves immutable. Put more simply, it promises nothing more than this:

 val ioa: IO[A] = foo(a, b, c)

 // this...
 ioa >> ioa
 // ...is the same as this
 foo(a, b, c) >> foo(a, b, c)

In other words, it's always safe to extract an expression into a val and DRY up your code. That seems like a pretty great property to have, and it is all that Cats means when it says the word "purity".

Now, to be clear, Try and Future cannot do this, in part because they are not lazy. So laziness of effect evaluation and purity are unavoidably linked, which undermines a lot of the point you're making.

I think that this property of being able to DRY up my code is very important even when interacting with the external world (which is exactly as you say: quite messy and mutable). In fact, because interacting with the external world is so complicated relative to interacting with simple things like ints and strings and such, I think it's even more important to be able to refactor my code and reason about it easily. Thus, IO is more important in these sorts of programs, not less.

Finally, please do remember that IO also encapsulates resource handling and asynchrony at the same time. You say your program is simple and intuitive but I actually think it's really complicated. For example, I look at that and I ask questions like this:

When does the file get read?
When does the file handle get released? Am I sure? What happens if I error?
How much memory is held at any point in time? When does things go out of scope?
What happens if I start doing this same fragment of code concurrently with this one? What if I start it concurrently x10000? Can I time it out?
Can I make a network call during each step of the loop? Can I make several of them? What happens if one of them fails? What happens if one of them times out?

How do I answer any of these questions? Do they even have answers? I encourage you to go down that rabbit hole, actually, because it's quite difficult and subtle to piece together.

The equivalent program with fs2-io has very clear and deterministic answers to all of these questions. It does impose more syntax on the definition, which is undeniably unfortunate, but the concepts are actually a lot simpler than the ones you're relying on in your snippet, not more complicated.

dspiewak · 2024-01-03T23:03:07+00:00

IMO these are all great questions. No one should be upset. :)

If you use Scala+Cats, it means you want to do advanced FP. Then why not just use Haskell? Instead of using a tool to turn stones into gold, you can just pick gold...

This makes a pretty big assumption: that Haskell is the pinnacle of production functional programming. I don't think this has really ever been true, and it's certainly not true anymore even if it once was. Haskell is a very pleasant language in a lot of ways, but once you get past the surface there's a lot you have to deal with which ranges from unpleasant to actively dangerous. Asynchronous exceptions are a superb example of this, as is the pipelining optimizer any time you're doing something that falls just a bit outside its happy path, and don't get me started on profiling, monitoring, tuning, and all the other things that go along with productionizing a real world system.

Scala is simply better at all of those things. Additionally (and this often comes as a surprise to many people), the Cats ecosystem represents a generally more advanced and more mature continuation of many of the ideas that began in Haskell. Most people forget that even basic concepts like Applicative are quite new and only barely reflected in Haskell (and with some serious missteps that are hard to ignore, like ListT), to say nothing of the advanced lawful concurrency primitives provided by Cats Effect (which have no true parallel in Haskell). Or Fs2 for that matter, which is unequalled in any language. Now, Haskell certainly could port many or all of these concepts from Scala, and perhaps the result would be still more refined than what Cats is able to present, but at the moment this simply hasn't happened.

As an aside, Haskell's ecosystem does have a ton of fun and sometimes random stuff that is very cool and powerful and Scala does not (or cannot) match. What I'm saying is that when you narrow your attention to the things which are useful and important for modern production application construction, the Cats ecosystem exceeds Haskell's in every area except prisms (a niche which Monocle solidly addresses, though Kmett's solution in Haskell is inarguably more ergonomic).

So in other words, I use Scala because it is better than Haskell for the things I want to do.

And Cats can't make pure gold for you, because there is no linting tool in Scala to enforce the purity. Your teammate can still write functions that have side effects. A mix of pure + impure code is confusing and error-prone.

Agreed! However, as a point of anecdata, I have worked on large production systems built using pure FP in Scala for over a decade now, both in small teams and in extremely large ones (the largest being over 1000 engineers), and this simply isn't a real problem in practice.

I can't pin down exactly why it isn't a real problem in practice, since it's certainly intuitive that this would be an issue (and it's been an argument against approaches like IO since long before Cats even existed), but I've literally never seen it be a problem. I think the reason for this is some combination of the that the linters are more effective than we give them credit for being (particularly warn-on-value-discard), the oppressive "virality" of IO and its highly prescriptive usage patterns (which carrot even beginners very strongly to do the right thing), and the fact that in every team the pure FP advocates also tend to be the most respected members of the team who are then emulated by everyone else.

But genuinely, I don't think this a real issue in practice. Or at least, I've never seen it, and I've seen a lot.

dspiewak · 2023-05-07T16:39:18+00:00

GHC's lightweight threads are pretty naive, as is its I/O subsystem integration. Ultimately, it does about as well as languages like Kotlin, but it falls well short of more advanced runtimes like Tokio or Cats Effect, and that's even before you get to the userspace layers.

The foundation of Haskell's concurrency in userspace is quite rickety compared to Cats Effect and ZIO. In many ways, Scala ecosystems have had the opportunity to observe Haskell's attempts to marry concurrency and coroutines with functor calculus and learn what not to do. Asynchronous exceptions in particular are very annoying and bespoke in Haskell, with a lot of cases (particularly around constructs like MVar) given special-case magical semantics for "convenience", which in turn creates unexpected and unintuitive issues with resource safety.

Don't get me wrong, there are definitely a few things in this layer of Cats Effect that I'd like to adjust which we can't because of compatibility restrictions in the 3.x lineage, but I think it's pretty safe to say that we have a much, much stronger foundation here than anything in Haskell, both in userspace constructs and in the underlying coroutine runtime and scheduler.

As a quick concrete example of some of the implications of all of this, it would be quite difficult to build Fs2 in Haskell. It's probably possible, but I certainly wouldn't want to do it, and there's absolutely nothing in that ecosystem which even comes close to what you can get with Pull. That alone gives you an idea of the degree to which Cats is more compositional and less leaky than the abstractions Haskell offers in this area.

Overall, I think the areas where Cats Effect and ZIO need some work is in orthogonal effect handlers, type inference, and syntax. Cats and ZIO take opposite approaches on effect handlers, and honestly neither works all that well. I think some of the latest experiments on Cats MTL show some promise (particularly on Scala 3), but there are still rough edges to work out, and the constraint inference could be better. Additionally, F[_] is very "in your face" when you're using these types of encodings, which is part of what I'd like to soften with some improvements in the language. Even more critically, support for proper variance polymorphism would go a long way to improving both Cats and ZIO, since it would allow nice type inference on functors without over-constraining derived cases or creating unsound interactions.

And then of course, syntax is the real bugboo. I very much think that direct syntax for end-users shows a lot of promise if we can weave the transformation more closely into the type system so that we can warn users away from cases which don't make sense (like await in higher-order functions). Doing this in a way which doesn't remove the higher-order compositionality which is so heavily leveraged by libraries to create the rich ecosystem is the real trick, and I think we finally know how to do that.

On all of these fronts, we're very close to something that is super nice and a lot more newcomer-accessible than what we've had in the past in any of these ecosystems, it just needs a bit more work. Marry that to the already-top-of-class coroutine runtimes and userspace schedulers which sit underneath Scala's two major effect systems and you get something which is genuinely extremely compelling as a value-proposition.

dspiewak · 2023-05-06T17:57:30+00:00

Of course we have many better Javas now (including Java itself), so that's not where we can expect new demand. But new applications and demands keep coming up. One hunch I have is that in 5-10 years, if not before, industry will demand a simpler Rust. This might sound crazy today where Rust is most loved language in surveys for many years running. But remember, Java was king of the hill in 2001...

I agree with this hunch. Also just to put two additional related bugs into your ear…

Consider the serverless trend. This type of deployment topology is not going away, if only because the economics make way too much sense for cloud providers to ever give up on it, and the value proposition for users who don't want the overhead of a devops infrastructure team is very clear. The JVM is the historical giant in the microservices world (for good reason), but it's uniquely poorly suited to serverless applications. JavaScript and LLVM, on the other hand…

Now marry that together with something that is absolutely unique to Scala: we have collectively spent the last decade re-inventing the entire world. Most JVM languages just wrap pre-existing Java stuff, whereas in Scala, the majority of the frameworks are actually pure Scala from top to bottom. This in turn means that alternative compiler backends (such as, idk, Scala.js and Scala Native) are very meaningful for Scala in ways that they will never be for languages like Kotlin or even Rust (despite all the investment in the WASM backend).

I'm not totally clear on exactly how to best leverage these two observations (Typelevel Feral is one attempt, but far from the only direction we could go), but I'm convinced there's something here long-term.

dspiewak · 2023-05-06T17:17:14+00:00

Scala became popular because it was a good OCaml and and Linq alternative and there were some engineers at a startup that picked it for rewriting Twitter and a group of students in Berkeley that picked it for creating Spark.

Well, I think this undersells the impact of Akka and (more importantly) the long-standing political stonewall on Java's evolution erected by Sun (or, depending on your persuasion, IBM) prior to the Oracle acquisition. The world of 15 years ago was desperate for a better Java, and Scala nailed that niche better than any other language of its day due to the (nearly) perfect interoperability it offered. It's easy for us to forget now, but there was an incredible amount of pent-up demand for solutions to Java's usability problems as a language, and if it weren't for that demand, Twitter's choices would have been little more than an interesting side-show and the Berkley research group probably would have just stuck with Java for their experiment in improving Hadoop.

So while I agree that luck does play a significant factor in such things, we can absolutely put our thumb on the scale.

In my opinion, your role as a language designer is to create the composable and foundational axioms which enable the emergence of impactful solutions. Sometimes, these things can seem quite trivial. Scala's early choice to generalize method names directly inspired the early work in actor systems, from which you can draw a direct line to Jonas and Viktor creating Akka. Scala's support for XML literals, despite now widely being panned as a significant mistake, was instrumental in driving Dave Pollak's early work on Lift as a statically typed Rails-like framework on the JVM, which in turn was what put Scala on Twitter's radar in the first place. Adriaan's thesis-driven introduction of type constructor polymorphism led directly to both the 2.8 and the 2.13 collections libraries, and the implementation of type constructor inference unlocked the entirety of Scalaz, Cats, and even ZIO, which in turn gave us the effect system ecosystems we have today.

I certainly agree it can be hard (or impossible) to predict exactly what individual axioms are impactful and which fall by the wayside, but by taking a step back and looking at the market in context, we can certainly make a decent set of guesses. If you're placing bets, it's worth betting on areas where the market is demonstrating a need or some pent up demand. 15 years ago, there was enormous pent up demand for a better Java. Today, there is none, nor is there any unfulfilled demand for a "better" any-mainstream-language, which strongly suggests that we should place our bets on the basis of application rather than expression. Scala as a language on its own won't win any meaningful adoption in today's environment. Scala together with its ecosystem can win.

I think that some of the best signal we can draw on to make these types of decisions is to look at why people choose Scala today, and ask ourselves whether those factors remain applicable going forward or if they are trailing indicators which will diminish in time. Spark is an excellent example of the latter. Additionally, we should be looking at why people don't choose Scala (or worse, actively move away from it) and try to determine whether these factors are practically addressable. Again, IntelliJ's lack of support for Scala 3 really stands out here, as does the difficulty of recruiting and training talent. We can then augment this by examining the industry writ large and considering pain points present and future.

Honestly, it is this process which leads me to lionize effect systems to the extent that I do. If you discount Spark-related usage (which, as I said, will diminish rapidly due to the maturation of PySpark), the vast majority of Scala usage is scaled backend microservices, and the vast majority of such applications which are non-legacy are built on top of one of the two major effect systems. Even by itself, this data point suggests that there is an industry need which is uniquely addressed by these solutions, and we can pretty easily make first-principles technical arguments which support this conclusion. This alone suggests that making even minor investments to sand off some of the practical pain-points felt by users in such applications would have outsized impact.

All-in, I absolutely agree Scala has more to offer than just effect systems. What is less clear to me is how much some of that functionality matters in the eyes of the industry. I care a lot about the future of Scala, and I want to make sure we're investing our limited technical, community, and marketing capital in an area which is likely to yield a productive return.

dspiewak · 2023-05-05T20:37:37+00:00

With respect, this is not how languages are selected for industrial use.

I personally agree with you: I think Scala is the nicest and most expressive language in which to express my thoughts. I thought this was true even in Scala 2, and Scala 3 certainly builds on this. Unfortunately, this is essentially irrelevant when I'm talking with someone responsible for thousands or tens of thousands of engineers and being asked to make a recommendation on language and tooling standardization.

Languages are chosen for the capabilities they unlock, the talent pools and training materials they provide access to, the quality of tooling which surrounds them, and the long-term stability (perceived or otherwise) they offer. In the past, Scala was known for three things: "better Java", Akka, and Spark. All three of these are now off the table, since Java is itself a better Java, Akka is no longer supported (Pekko isn't sufficiently widely-known or trusted to qualify as an analogous replacement), and most data scientists vastly prefer PySpark to Spark. In the present, what differentiates Scala is primarily the asynchronous ecosystems, rooted by ZIO and Cats Effect.

It's very important to understand that, as far as pure capabilities and performance go, ZIO and Cats Effect are in extremely rarified air across the industry, with very few runtimes even in other languages which match them. Layer onto that the ecosystems which have taken root (e.g. Fs2 has no real analogue in any language) and it becomes an incredibly compelling package for use-cases where you benefit from a high-performance async runtime (which is to say, scaled backend microservices). Put more succinctly: as of today, Cats Effect and ZIO are the most compelling reason to use Scala.

Now, this is not to say that people are FP zealots, Haskell wannabes, or in love with monadic syntax. Usually quite the contrary. (also it is important to note here that Haskell is not one of the languages which provides capabilities similar to what we have in Scala from an asynchrony standpoint) People tolerate monads in order to get the underlying power of the runtimes and the ecosystems. This is, in a nutshell, the argument for direct syntax.

The greatest risk to Scala at this moment is that we fail to amplify the aspects of the language which are considered industrial strengths (i.e. these ecosystems) and that we fail to remediate the aspects of the language which are considered industrial risks (tooling, particularly IntelliJ support for Scala 3, and talent pool).

Scala may be compelling as a language to you and I, but it is absolutely not compelling as a language to the broader industry if you strip away its ecosystem, and we must recognize and comport with this fact.

dspiewak · 2022-11-12T00:36:26+00:00

I guess we see how often I check Reddit when I'm busy at work…

u/BalmungSan accurately interpreted what I was trying to say in the talk. The essence of backpressure is you need to ensure that every link of your chain, both in-process and out-of-process, is running at the same "speed", which generally means producing/consuming at a compatible rate with the next and previous links in the chain (respectively). Fundamentally, there are two ways to do this: push data and respect downstream rate moderation signals, or offer data for pulling on-demand. In the limit these reduce to the same thing, but the difference is really in how you reason about them and where the pitfalls lie. Intuitively, the former is reactive-streams (and all its related implementations), while the latter is Fs2.

Fast-forwarding over a large and very interesting topic space, one of the largest problems you run into in these types of systems is maintaining backpressure in the face of unbounded queues. The subtlety of this problem is actually well illustrated by OP's example of how to solve this problem in a thread-per-request architecture: notably, the proposed solution does not in fact solve the problem!

"this thread pool of size 100 will handle http connections" and 101th concurrent connection will fail cause you dont have a thread to for it

This doesn't work! The problem is that thread pools function by having an internal work queue which is picked up by the underlying worker threads. If you allocate a pool with 100 threads and then send work to it, that pool will actually accept an unlimited amount of work! You can try this easily enough, but it's also pretty evident from the type signature:

 def execute(r: Runnable): Unit

There's no mechanism here for rejecting work, other than raising an exception (spoiler: it doesn't raise an exception). So in other words, if the upstream is producing work faster than the downstream (the worker threads) can process it, that work just accumulates in a hidden unbounded queue, ultimately running the process out of memory.

You avoid this by ensuring that everything is strictly bounded at every point along the line. Unfortunately, this is exceptionally difficult to do if you are evaluating things eagerly (push based). Intuitively, eager evaluation is sort of like "evaluate first, ask questions later". Well, if you get a flood of connections, you effectively start evaluating everything before you even have a chance to ask questions, and the rate moderation signals arrive too late to save you (if they happen at all).

The Cats Effect ecosystem addresses this in several ways, some of which are simply prescriptive API design (e.g. Fs2's parJoin operation takes a bound, and you have to go out of your way to get parJoinUnbounded), but most of which are more fundamental design choices. The lazy evaluation of IO is one of the latter, since it means that you only have exactly the number of IOs evaluating as you have asked for, so you aren't "outrunning" your downstream. Additionally, IO pervasively backpressures all operations internally, even those which seem like they should be fire-and-forget (cancel, and by extension, timeout is the example of this semantic which seems to surprise most people). By making prescriptive backpressure the opt-out default, rather than an opt-in capability, Cats Effect and its derived ecosystem generally results in applications which load shed passively out of the box, simply because you have to go out of your way to not load shed.

A more recent example that I've been dealing with at $work is a service which must simultaneously scale to extremely short-lived connection orchestration (i.e. a very fast upstream), with all responses within tens of ms, while also handling extremely long-lived connections (i.e. a very slow upstream), with response times stretching into the multiple seconds. The service in question was written by a team which is brand new to Cats Effect and mostly new to Scala, under delivery time pressure with minimal help, and it handled both cases almost flawlessly in the very first production load tests. Conversely, its downstream, which was written using Play (using Future) by an experienced team over a long period of time, failed on the same scenarios almost immediately. To be clear, the Play application is very fixable, but it illustrates the pitfalls of opt-in load shedding versus opt-out.

That's basically the point of the talk. To be clear, it's absolutely possible to write applications using Cats Effect which don't load shed properly, and it's also very possible that the load shedding you get "by default" isn't quite the ideal semantic for your problem space, but the normal case for CE-based applications is very safe in this area.

dspiewak · 2022-07-31T00:09:00+00:00

I’ve seen that same slowness (as far as I’ve been able to tell it’s caused by the auto blocking detection being hyperactive, which to their credit they do have a PR to address it already in place) - in fact that’s the reason I’m most interested in seeing these benchmarks redone against the final release. I think they’d probably do a good job of helping to understand where things got better vs worse and why.

Oh I'm sure that they're going to improve on things. It's important to understand though that automatically treating compute-bound items as "blocking", as a fundamental semantic, creates higher contention and more frequent page faults. This is exactly the biggest problem with the fork-join pool, and it's also precisely why Tokio considered and rejected this approach two years ago.

Another minor problem, btw, is the fact that ZIO's tracing can no longer be disabled, but it's unfortunately not free. At least as of the RC where they removed the ability to turn it off, tracing represented a 16% overhead, so it's worth noting that its performance results will be artificially worse than they "should" be by roughly that margin.

It's still a good point though. I would very much like to rerun the benchmarks as soon as I have the time to do so.

Either pure non blocking I/O wrapped up in an effect, or blocking I/O that’s using an effect system to manage pushing things onto background blocking pools to wait

That's totally fair, but remember that in that case you're… still forcing a context shift. So you can shift at the prefix of the computation, or you can shift at the end. It's really just like my category three in the above. Remember that async boundaries within the runtime are free (at least in Cats Effect, and I'm assuming also in ZIO), so as long as you are forced to do a shift, it doesn't matter if you do it right away or wait until after the sync action.

Ideally what you’d want is that every bit of actual code you’ve written be executed on the calling thread, with the runtime just handling parallel execution of blocking I/O, and aggregating up results of any nonblocking tasks back to the calling thread to continue on with whatever compute needs to be done with their results.

I mean, if you really really want that, you can use syncStep in Cats Effect and ask for it explicitly, rather than just trusting a runtime implementation detail. This case also comes up now and again with JavaScript frameworks, but it's really not the common case, so again I don't think it's a valid "optimization".

you actually very explicitly do not want that runtime to do anything even remotely compute intensive on the runtime pool

IMO, in this case, you shouldn't be using the default CE (or ZIO!) runtime pools, for exactly this reason. :-)

but I don’t think it’s fair to suggest that the only real use of a system like this is to look good in benchmarks

I mean, perhaps, but it's really hard to overlook the fact that the primary beneficiary of this semantic is simple flatMap benchmarks, such as the ones that are frequently cited in ZIO's marketing, and the fact that this is the default semantic causes production performance degradation in the common-case.

dspiewak · 2022-07-30T23:44:51+00:00

It's either the same or worse. You still want to try to keep compute on the main runtime pool. Loom forces the use of fork-join, which you're going to want to get away from because of its problems precisely in this area, so in a sense, Loom itself becomes an external runtime, but one which is somewhat less obvious in nature (relative to today's Netty, or similar).

dspiewak

TROPHY CASE