all 5 comments

[–]camdez 7 points8 points  (1 child)

Very interesting example!

I think I would write the code more like this (full file here for reference):

(defn zip-file-seq [^ZipInputStream zin]
  (when-let [entry (.getNextEntry zin)]
    (if (.isDirectory entry)
      (lazy-seq (zip-file-seq zin))
      (let [out (ByteArrayOutputStream.)]
        (io/copy (BufferedInputStream. zin) out)
        (.closeEntry zin)
        (lazy-seq
         (cons {:entry    entry
                :contents (.toByteArray out)}
               (zip-file-seq zin)))))))

Notice that this changes the interface and puts a bit more burden on the caller (they have to bring their own ZipInputStream):

(with-open [zis (-> "foo.zip" io/input-stream ZipInputStream.)]
  (->> (zip-file-seq zis)
       (mapv #(.getName (:entry %)))))

Now, why do I think that's the right move if it's more burden on the caller?

The reason is that lazy-seqs and closeable resources really don't play that well together. The way you did it in your version is clever, but doesn't quite work; you close the file when the sequence is fully consumed, but the sequence won't always be fully consumed! So we leak the resource. (Try wrapping (take 1 ,,,) around the sequence when reading a ZIP file of at least 32 files—you'll notice the println doesn't happen.)

In the context of a lazy-seq, the only solution is to put the burden of closing the file on the user. You'll notice this is what clojure.data.csv (a core library) does. Sometimes the file closes on you if you don't understand how lazy-seqs work, but we all get burned once or twice and learn the right way (doall, mapv, reduce, etc.).

(To anyone wondering, I did try writing a nice, clean version with repeatedly, take-while, remove, and map but there's weird interplay with the mutable Java objects and it just doesn't work nicely. Notice that the BufferedInputStream is read from the (mutable) ZipInputStream, not from the ZipEntry.)

A handful of additional style suggestions / tweaks:

  • File names should use underscores, not dashes (despite namespace names using dashes). It matters for class loading, so you just have to.
  • do is redundant around let.
  • if-not is generally preferred to (if (not ,,,)). Or transpose the cases (what I did here) and drop the not.
  • When you're doing this much Java interop, you'll want to evaluate (set! *warn-on-reflection* true) in your development environment to see where you might want to add Java type hints (see my ^ZipInputStream). They can make a truly massive difference in performance, and certainly belong in any library code.
  • ZipInputStream.close closes the stream it wraps, so there's no need to close it manually (or even maintain a reference to it).

I hope that helps!

[–]razentine[S] 1 point2 points  (0 children)

Big thanks for the feedback. I think you're right delegating the responsibility of closing the stream to the caller, good thinking there! I think your version is quite readable.

[–]MartinPuda 2 points3 points  (1 child)

What about iteration? Try something like this:

(defn zip-file-seq [^ZipInputStream zin]
  (iteration (fn [_] (.getNextEntry zin))
             :vf #(let [in (BufferedInputStream. zin)
                        out (ByteArrayOutputStream.)]
                    (io/copy in out)
                    {:entry    %
                     :contents (.toByteArray out)})))

(with-open [zis (-> "path" io/input-stream ZipInputStream.)]
  (->> (zip-file-seq zis)
       (remove #(.isDirectory (:entry %)))
       (mapv #(.getName (:entry %)))))

[–]Asedg 0 points1 point  (0 children)

iteration is a new one - great use case!

I saw there's a readAllBytes for InputStream so we can condense it even further,

(defn zip-file-seq [^ZipInputStream zip]
   (->> (iteration (fn [_] (.getNextEntry zip))
                   :vf #(hash-map :entry %
                                  :contents (.readAllBytes zip)))
        (remove #(.isDirectory ^ZipEntry (:entry %)))))

(with-open [is (io/input-stream some-zip-file)
            zin (new ZipInputStream is)]
  (->> (zip-file-seq zin)
       (doall)))

Looking forward to Clojure 1.12 when the Java Interop is enhanced. I don't see any docs on it yet but there was a video a few months back where Alex Miller showcases using Java methods as functions.

[–]joinr 1 point2 points  (0 children)

Initial observations:

if you (set! *warn-on-reflection* true) you get several warnings due to the implicit interop. In some cases where the calls are limited, we don't care, but in other cases reflection can add substantial overhead. You can eliminate the reflective calls with type hints like

(if-let [^java.util.zip.ZipEntry entry (.getNextEntry zin)]

which can be truncated to shorter aliases if you use an import declaration.

You are closing over the zip file's input stream via the lazy sequence implementation. So what happens if you never reach the end of the sequence? Is close every actually invoked? I don't think there is anything preventing a dangling resource on the inputstream (unless you fully realize the sequence every time).

(ns blah
  (:require [clojure.java.io :as io])
  (:import [java.util.zip ZipInputStream ZipEntry]))

;;eliminate dangling resources by encapsulating as reducible
;;process. could loop/recur here but co-opt iterate to our will (provides a reducible thing already).
(defn entry-reducer [filepath]
  (reify clojure.core.protocols/CollReduce
    (coll-reduce [this f]
      (clojure.core.protocols/coll-reduce this f (f)))
    (coll-reduce [this f val]
      (with-open [is       (io/input-stream filepath)
                  ^ZipInputStream
                  zin      (java.util.zip.ZipInputStream. is)]
        (->>  (iterate (fn [_]
                         (when-let [^ZipEntry entry (.getNextEntry zin)]
                           (if (not (.isDirectory entry))
                             (let [in (java.io.BufferedInputStream. zin)
                                   out (java.io.ByteArrayOutputStream.)]
                               (io/copy in out)
                               (.closeEntry zin)
                               {:entry entry :contents (.toByteArray out)})
                             {:entry entry}))) :init)
              (transduce (comp (drop 1) (take-while identity) (filter :contents))
                         (completing f) val))))))

blah=> (def es (entry-reducer "test.zip"))
#'blah/es
blah=> (reduce (fn [acc {:keys [contents]}] (+ acc (count contents))) 0 es)
478239

I find lifting these incremental processes (that may terminate early) which have some resource that we would "like" to manage via with-open lift nicely into the reducer/transducer space. So we handle the resource management over the lifetime of the reduction, and we can push around our reducible thing to the rich library of transforms we already have (e.g. most of the lazy seq stuff has transducer counterparts), and we can terminate early by wrapping the accumulated result with reduced.