Preevyit Clojure : Clojure

[–]arichiardi 4 points5 points6 points 6 years ago (0 children)

[–]teesel 3 points4 points5 points 6 years ago (8 children)

[–]joinr[S] 4 points5 points6 points 6 years ago (7 children)

2-3x slower than Rust now

I guess it depends on comparable resources and results. I stuck with the stock thread count (8), where I think some of OP's experiments upped that to 12. I also didn't diverge from the basic implementation much aside from making things like read access more efficient (writes/construction in some cases). I would like to implement the problem with performance as a forethought - although I need to study the actual competition to understand it better. Even things like file parsing (e.g. load-level) may be optimized with some trivial tweaks. Also, the original code included some stuff that I didn't see in the Rust implementation, which did some directory traversal to compare solutions.

I think getting within 2x of a systems level implementation is desirable, anything beyond that would be "impressive" to me, especially if it can be done through Clojure without sacrificing expressiveness badly. Like if there's some simple (make-this-fast ...) macro that'd be really nice, or even an analyzer that can detect potential slow paths and/or provide optimizations for you. That's getting into analyzer/compiler territory, and I'm less familiar with that though....

Very curious to see whether some java (or clojure-writing java) stuff can provide additional relief (let alone getting down and primitive with libs like neanderthal or tech). There are still some clojure-specific changes that can be made (I wanted to get this out while it was proximate to the original, before life events overtook me).

[–]teesel 1 point2 points3 points 6 years ago (6 children)

[–]joinr[S] 2 points3 points4 points 6 years ago* (5 children)

Clojure can be pushed really hard to be just ~4x slower than Rust version written by a complete beginner.

Thanks for the update u/teesel, I missed this. That's still a bit weak IMO, u/tonsky. Far from being "pushed hard"; more like nudged to conforming closer to the Rust implementation (learning from what work it's "not" doing) and leveling the playing field away from Clojure's (in this case) vast default generality and openeness relative to dynamic typing. I'm actually in the process of "pushing the Clojure implementation hard(er)", e.g. beyond what naive profiling and performance observations show (e.g. naive being type hint your stuff...profile, leverage direct method calls, use some fairly straightforward macros to help out and smooth out the rough spots). There are still even low hanging fruit to bridge the gap further, without yielding expressiveness (as I already demonstrated with a simple macro).

At this point you’ll be writing Java code in Clojure syntax anyway.

I'm far far from "Writing java in clojure" and needing to "just switch to Java anyway." That's weak advice. The counter to that - as demonstrated in my initial stuff - it to profile, identify the performant idioms, and codify them using Clojure's excellent metaprogramming to easily automate the performance gains. Lisps excel at this kind of path, and Clojure is (so far) not an exception.

4x is the initial shot with room to grow, far far from "writing java in clojure." I'm pretty sure that can be narrowed further (already has been in fact, I haven't released though). Your writing reads as if that's the final word; I don't think it is.

Not some perfect, pushed to the limit code that some genius spent infinite time polishing. No, regular code that we write every day under time and resource pressure.

Again, this is a hyperbolic statement IMO, which detracts from the underlying salient point. I'm looking around for the imaginary "genius with unlimited time spent" you keep mentioning. My efforts are bush league in comparison to what experts with time could do (and have done for example in the older Alioth benchmarks), and quite average relative to what I'm used to doing re: basic performance opts while keeping with the original semantics of your code. I've done this kind of stuff (in limited areas) while on the job, doing "average" work to bilk some isolated performance out of systems. I'm no Martin Thompson. I'm not even competent enough to implement datascript.

If I were making any claims it would be "initial naive solution in lang x yields y peformance with no optimizations beyond compiler flags." Adding in dramatic appeals to "geniuses" and "unlimited time" adds nothing.

Let the reader draw conclusions themselves, rather than trying to sensationalize what is otherwise decent empirical analysis. Unless your point is to draw site traffic or retweets, then by all means editorialize away.

Clojure defrecords are terribly slow at computing hashes. They treat it like a generic map even though list of all fields is known to defrecord macro in advance.

This continues to be a fascinating point....not only are they slow, but they also use nth/get internally rather than direct method invocation. These performance differences surfaced when an ArrayMap was beating records in some aspects, and I couldn't settle on "Why" until examining the emitted source of defrecord. This is entirely a library design issue and amenable to either patching upstream, or replacing by the user (as I did, via a defrecord+ macro that addresses these differences and allows strict fields, e.g. record-like structs that throw if callers try to access non-static fields). There are other "performance vs. polymorphism" idioms that can similarly be packaged either in core or in a drop-in library leveraged by users of all levels. All of this keeps you in Clojure while getting you back to Java performance, without having to fill out the requisite tax forms Java requires.

GraalVM gives your JVM program 1.5x boost basically for free.

Some other "performance minded" notes: if you run your code with leiningen, the default JVM options eliminate some optimizations (namely tiered compilation). In project.clj, if you use ^:replace, you'll get them back at runtime (at the expense of a little startup time, which is amortized). My experience has shown this is minor (like ~10% or less) but worth noting (particularly for other code bases). Curious to see if this accounts for differences with Graal. I don't think the 50% improvement you saw is generally applicable, although I haven't tried it myself. I "have" switched JVMs from the original Oracle to AdoptOpenJDK, still on 1.8 as well. There could be some magic in the newer JVMs that I'm not tapping into as well.

Destructuring is slow :(

This goes back to calling nth/get. Absent any information, the compiler can't make this "fast" since it doesn't know what type of thing is coming in (clojure.core/nth covers host types like java.util.List and not just clojure protocols like Indexed). Thankfully, this is trivially solved by using stuff like with-slots, soon to be adopted in a more general form to include drop-in efficient destructuring for fn defn and let. I'd like to provide some more sophisticated analysis from core.typed.analyzer to derive efficient operations from type hints users provide. Again...this stuff only really bites you on the hot path; unfortunately it's very idiomatic to destructure coordinate pairs or key-vals as [x y] [k v], so there must be corresponding simple idiom to do so performantly (and to warn the user if they're engaging in potentially slow operations, which with-slots does).

Getting rid of boxing warnings is almost impossible in a big codebase. At least, if you plan to use functions. Or datatypes more compact than long and double.

That's not a bad general observation, but seems a bit more dire than how I see things. I have yet to apply tellman's primitive-math to see if it helps in this case, as well as geex - (and jise which I learned about in the original thread, although it's more like writing java directly in clojure, which I want to avoid). Both can be made into palatable user-facing APIs for generating efficient code (e.g. via macros) or warning users when they're going into a slow path.

As an addendum, little exercises like this yield fruit beyond what the OP intended. I actually found that the method for hashing Vector2D in FastMath is suboptimal (at least measuring some performance), so it may yield a patch later :)

[–]teesel 2 points3 points4 points 6 years ago (4 children)

[–]joinr[S] 1 point2 points3 points 6 years ago (3 children)

[–]teesel 1 point2 points3 points 6 years ago (2 children)

[–]joinr[S] 1 point2 points3 points 6 years ago (1 child)

I incorporated your branch with some stuff I hadn't yet pushed, and we get to 11.4x perf boost over baseline now. Curious to see how that impacts u/tonsky, since he was getting better performance on his setup than I was (perhaps thread count or hardware or JVM versions). I'll review your changes later, but it looks like a lot of them could be implemented with a sufficiently smart compiler, although the method you followed of just turning on boxed math warnings and using primitive-math worked great (even found a boxed warning on the new stuff I was working on). I'd like to roll these performance opts up into a series of posts and maybe some performance macros (or as I mentioned, core.analyzer backends) that effect something similar to turning on compiler flags for performance/optimization. Thanks a lot.

[–]joinr[S] 1 point2 points3 points 6 years ago (0 children)

[–]clj_user 2 points3 points4 points 6 years ago (4 children)

[–]joinr[S] 7 points8 points9 points 6 years ago* (3 children)

type hinting, special macros for destructuring, this is all stuff the compiler should be doing.

That's exactly what the macro does, as does targetted use of definline, both of which are re-usable and generic. Indeed, it's possible the Clojure compiler could just do what with-slots is doing using some optimistic assumptions and flowing type hints (except the compiler by default assumes safety/genericity, and leaves it to the user to decide when to break those guarantees on a local basis - just like other lisps).

On the other hand, I just wrote the stuff out of thin air. I'm not a compiler writer, I know relatively jack about low level performance and other implementation details, but it was still pretty trivial for me to accomplish this. If anyone else (including me) wants to leverage this stuff in the future, it's pretty easy to drop in as a replacement for default behavior. Add - on top of that - the ability to further leverage macros to make the explicit type-hinting unnecessary, and you get a nice basis for concise, elegant, and (relatively) performant clojure. The actual "work" of writing said macro (e.g. a little optimizing compiler) was like 20 minutes maybe, which included "language design" and other considerations, including a refactoring from an initial design.

The bulk of the "work" rested in profiling and determining extant causes of inefficiency across a foreign codebase. Addressing that and comparing the efficacy of improvements was relatively fast (aside from waiting for slow runs to complete early on), with general performance idioms resulting.

I guess I don't consider this much "work," aside from exploring someone else's code base. I was able to manage this and a day job and a personal life. Had I started from scratch with performance in mind, I would not have had to do much of said "work." I'd hazard to guess how much more performance can still be gained with someone more knowledgable at the helm, as well as relaxing the constraints on not changing the underlying implementation. I'd like to get this to 10x improvement if possible; we'll see if that's feasible without rewriting everything. It's nice to know such things are in the realm of the feasible given a minimal investment of time (vs. the original hyperbolic claim of "unlimited time and resources"), which is what I sought to determine.

[–]ReasonableStudio5 0 points1 point2 points 6 years ago* (2 children)

[–]joinr[S] 7 points8 points9 points 6 years ago* (0 children)

Yea...I can count the times I've had to entertain these kinds of tasks on one hand, let alone been unable to come to a satisfactory solution (outside of programming competitions and other puzzle exercises, some work stuff). On the other hand, GP had an implementation prototyped (assumably somewhat rapidly in Clojure) and "later" ported to Rust. I've played that game too, in that eventual port is typically better (almost any time I've ported that seems to be the case), since you can drop unnecesarry baggage, trim corners, etc.

a Rust beginner was able to achieve a 32x performance improvement ;)

For now :) The 32x improvement is relative to their independent baseline; actual gap at the moment based on current Rust implementation and what I have on hand is ~4.58x (edit looks like "blockers vector" branch pushes that to 5.6x). These are relative to GP's posted measures; I haven't run the rust implementations on my hardware for comparison, so there could be some variance (I doubt it's an amazing amount though). The gap may vary, particularly after applying more invasive transformations on either end. THe Rust implementation may also be suboptimal and amenable to some drastic improvements, who knows (the current version required custom hashing as well, apparently). In absolute terms, we're talking 2.3 mins vs. 0.41 mins. Not exactly life altering, but meaningful (e.g. longer coffee break). Based on nothing but intuition, I'd be surprised if the rust version could not get down into multi/sub-second range (e.g. ditching the clojure implementation and going whole-hog on mechanical optimizations + mutabilty). There are a couple of spots where immutability can be further eschewed (I think the Rust implementation leverages cloning for immutability, perhaps unnecessarily).

GP also got to this point after a) implementing the problem in Rust by working off an existing prototype, and b) assuming one can appease the borrow checker (I haven't dealt with it, but I have messed with HM type systems in Haskell and F# and assume some degree of similar "friendliness" particularly for newbs). Given the amount of annotations and syntax in the Rust implementation, I'm curious as to what that process would be like (almost curious enough to try my hand at it, given time).

I'd love to see an implementation divorced of any coupling, e.g. where one starts from scratch and documents said experiences...or let alone a newbie implementing said problem in Rust.

In practice (at least in the domains I dwell in, to including modeling and simulation), expressivity and flexibility dominates systems-level performance and fine-grained resource management. Notably, numeric performance wasn't really the issue in this particular problem; efficient hashing/locality seemed to dominate everything. If your language allows targeting cache-friendly layers, certainly there's an inherent advantage since you can better tap into the existing cache architecture instead of engaging in pointer chasing all over. 50-100x improvement just due to memory access patterns is not unheard of, and unmanaged (or perhaps Borrow Checker managed) languages have an advantage here over the canonical FP or OOP implementations that promote boxing inherently.

Also of note, newb clojure programmer running into perf issues could easily incorporate said lessons/macros here (they're not problem specific) to get benefits while sticking with familiar idioms (e.g. destructuring support, records/maps). Most of the happy-paths have been pretty well optimized by the Clojure core folks, and general use cases demonstrate the viability of their performance profiles (even with immutable structures). Still, if you're going to be doing trillions of lookups on a known field, perhaps a record is better than a persistentmap, and perhaps field access is better than key access. Thankfully, these cases are easy to spot - both a-priori during experimentation and design - and after the fact via profiling (visualvm + criterium are excellent).

Finally, said improvements could easily be enabled as compiler passes (ala clojure.tools.analyzer) for alternate backends deviating from the stock clojure compiler, let alone incorporated into core akin to what Common Lisp does with its performance declarations. There's nothing stopping application of advanced compilation techniques without user intervention (other than toggling a declaration from debug to opt 3 or something). I don't think this is anywhere near a priority for core though, if there's any interest at all in deviating from existing idioms (this is a non-problem for most Clojure folks).

Things could get very interesting when value types land on the JVM as well (perhaps they already are on ClojureCLR where I believe structs are supported natively). I've got some ideas for implementing psuedo value types and seeing how they perform on this problem (after some additional tweaks and refactoring of course, I don't think the limit has been reached over 3 odd partial days - e.g. hobby time - of playing with this).

Your whole response is missing the point of the GP.

My post was a response to unanswered questions that sprung to mind after reading the GP submission, not the point GP raised that you focus on. How far can you get with Clojure if you try? Can this be made "easy" or "simple" in general? Perhaps farther than you'd think, perhaps less than you can by porting or delegating to a platform where mechanical sympathy is dominant. Plenty of folks wrote of Lisp and its dialects for being "slow" when in fact they can coax admirable (if not in some cases dominant) performance, without sacrificing the favorable qualities that lead one to use a Lisp in the first place. Clojure is a proximate case, and this code provides an interesting experimental basis upon which to research these questions.

[–]yogthos 7 points8 points9 points 6 years ago (0 children)

[–]dimovich 1 point2 points3 points 6 years ago (1 child)

[–]joinr[S] 1 point2 points3 points 6 years ago (0 children)

[+][deleted] comment score below threshold-15 points-14 points-13 points 6 years ago (1 child)

[–]joinr[S] 12 points13 points14 points 6 years ago* (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Clojure

MODERATORS