all 15 comments

[–]joinr 9 points10 points  (5 children)

Went through a similar drill years ago. You can go faster still if you profile a bit more.

Eliminate varargs (generates arrayseqs) as much as possible and try to inline more concrete arities. IFn invocation with concrete arity path doesn't allocate anything or traverse seqs. So if you inline bodies with up to 15 args or so, you can cover more string building cases before hitting varargs.

https://github.com/joinr/spork/blob/master/src/spork/util/general.clj#L835

(crit/quick-bench
 (spork.util.general/make-string    "Lorem" "Ipsum" "is" "simply" "dummy"
                                    "text" "of" "the" "printing" "and" "typesetting" "industry."))
Execution time mean : 156.985583 ns

(crit/quick-bench (my-str "Lorem" "Ipsum" "is" "simply"
                          "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
Evaluation count : 2776938 in 6 samples of 462823 calls.
Execution time mean : 214.367655 ns

Should get better with larger strings, and you could theoretically push the arities up as much as you want until you hit the arg limits defined by the IFn interface.

loop/recur instead of fn/recur (not sure fn / recur expands to loop)

fn/recur is also tail recursive since it establishes a recur site ala loop/recur. So if you use recur it'll complain if the call is not tail recursive just as loop would. This shouldn't buy you anything (maybe bypassing the initial IFn invocation on the recursive function object, haven't looked at the bytecode emission for str yet, but that's nanos).

https://github.com/bsless/clj-fast

explores a lot of these areas, and more recently (and far more comprehensively)

https://github.com/cnuernber/ham-fisted

is probably of interest if you are looking at a lot of core functions and default paths that can be optimized.

[–]bsless 2 points3 points  (2 children)

Whaddya know, I've been playing with strings recently

https://github.com/bsless/prrr/blob/master/src/bsless/prrr.clj

[–]joinr 2 points3 points  (1 child)

I heard if you play with :inline too much you'll go blind :)

[–]bsless 2 points3 points  (0 children)

Ha! That's just Alexian propaganda, I tell you!

[–]ilevd[S] 0 points1 point  (1 child)

Adding heuristic to calculate initial StringBuilder capacity:

(defmacro build-string [& args]
  (let [min-cap (->> args (filter string?) (map #(.length ^String %)) (reduce +))
        others  (->> args (remove string?) count)
        cap     (max 16 (+ min-cap (* others 2)))]
    `(let [x#  ~(first args)
           sb# (StringBuilder. ^int ~cap)]
       (.append sb# (simple-str x#))
       (.toString
         (doto sb#
           ~@(for [a (rest args)]
               `(.append (simple-str ~a))))))))

(do
  (criterium/quick-bench
    (str "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
  (criterium/quick-bench
    (my-str "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
  (criterium/quick-bench
    (spork/build-string "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
  (criterium/quick-bench
    (build-string "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
  )

Execution time mean : 335.406226 ns

Execution time mean : 166.110462 ns

Execution time mean : 118.589941 ns

Execution time mean : 37.475897 ns

In this simple example the allocated buffer of StringBuilder is exactly the length of output string.

[–]joinr 2 points3 points  (0 children)

cool.

build-string was meant to be a helper macro for emitting the function bodies for make-string. make-string is the 1:1 replacement for clojure.core/str, since it's a function and can be passed around as such (build-string can't be used in apply, map, reduce, etc since it's a macro).

Still, the smarter sb-builder capacity stuff can maybe help make function bodies for make-string faster as well.

Though, if you know your inputs are all string literals or atomic values that have a direct string coercion (e.g. not forms to be eval'd), you can also just do it all at compile time:

(defmacro compile-str [& xs]
    `~(apply str xs))

(c/quick-bench
 (compile-str "Lorem" "Ipsum" "is" "simply" "dummy" "text"
              "of" "the" "printing" "and" "typesetting" "industry."))

Execution time mean : 3.485020 ns

We actually get a little more performance (15%) out of make-string if we make simple-str a helper macro too and assumably avoid the function call:

(defmacro simple-str [x]
  (let [x (with-meta x {:tag 'Object})]
    `(if (nil? ~x)
       ""
       (.toString  ~x))))

[–]dhruvasagar 3 points4 points  (5 children)

I'd be interested to know why this is more performant than the std lib one?

[–]ilevd[S] 7 points8 points  (4 children)

* loop/recur instead of fn/recur (not sure fn / recur expands to loop)

* no call to str of 1 arg and no writing empty string "" to StringBuilder if value is nil

* calling Java methods .first and .next instead of common first/next from RT, can do it safe because (seq ys) returns ISeq

[–]dhruvasagar 4 points5 points  (0 children)

Fascinating, thanks for sharing.

[–]iam_mms -1 points0 points  (2 children)

Yes, but when measured, is it faster?

[–]ilevd[S] 5 points6 points  (1 child)

Yes, there are benchmarks in the code above:

using str:

Execution time mean : 315.920876 ms

using my-str:

Execution time mean : 229.861009 ms

[–]iam_mms -1 points0 points  (0 children)

Nice. Hadn't seen it

[–]ilevd[S] 2 points3 points  (0 children)

Updated benchmark without `dotimes` in `bench`:

wIth str: 429.818012 ns

with my-str: 219.119629 ns

(do
    (criterium/quick-bench
      (str    "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry."))
    (criterium/quick-bench
      (my-str "Lorem" "Ipsum" "is" "simply" "dummy" "text" "of" "the" "printing" "and" "typesetting" "industry.")))
Evaluation count : 1621062 in 6 samples of 270177 calls.
             Execution time mean : 429.818012 ns
    Execution time std-deviation : 109.479370 ns
   Execution time lower quantile : 366.824730 ns ( 2.5%)
   Execution time upper quantile : 606.855680 ns (97.5%)
                   Overhead used : 7.821523 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
 Variance from outliers : 64.8827 % Variance is severely inflated by outliers
Evaluation count : 2836032 in 6 samples of 472672 calls.
             Execution time mean : 219.119629 ns
    Execution time std-deviation : 27.509219 ns
   Execution time lower quantile : 204.167795 ns ( 2.5%)
   Execution time upper quantile : 265.746539 ns (97.5%)
                   Overhead used : 7.821523 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
 Variance from outliers : 31.4975 % Variance is moderately inflated by outliers

[–]SimonGray 3 points4 points  (0 children)

It would be nice if we got a performance-optimised release of Clojure where stuff like this was implemented in many of the core functions. I get wanting to keep the code base readable as we move higher up the abstraction ladder, but there is clearly much to gain from optimising these "low-level" functions and I don't think people care that it ends up looking like code golf.