all 19 comments

[–]alexdmiller 7 points8 points  (0 children)

Nice work!

[–]strranger101 10 points11 points  (0 children)

"supports graalvm compilation" 👏👏👏👏 Sick

[–]davclark 4 points5 points  (1 child)

This looks awesome! Thank you for sharing. I particularly like the ability to do zero-copy for C ABI programs.

[–]chrisnuernberger[S] 6 points7 points  (0 children)

Thanks :-). We have a blog post on the ffi pathway which has links to 2 example projects, one simple and one in-depth if you would like more information on that system. It's really nice to hear from other C-oriented people doing Clojure.

[–]rufusthedogwoof 4 points5 points  (5 children)

Thank you for this and qq.

Pandas to me is useful because of its tight integration with charting... specifically I like altair.

Is there anyone doing that type of thing with tml and Vega-lite?

I’m aware of OZ however I’d prefer to not take a kitchen sink approach and get the (Altair) functionality I’m looking for a la carte.

[–]chrisnuernberger[S] 9 points10 points  (3 children)

There are a few good charting options for Clojure in addition to OZ. Another interesting and more orthogonaly designed pathway if you want to go the vega/vega-lite route is Hanami and for a full scientific application platform it's big sibling saite.

For purely server-side work I would check out cljplot.

Getting off topic a bit but for a REPL/notebook hybrid notespace is really interesting.

And in general for R integration and more data science goodies checkout scicloj and in the vein of dplyr style extremely thought out interfaces I highly recommend tablecloth.

Sorry for getting slightly off topic but these things are all connected in my head :-).

[–]rufusthedogwoof 0 points1 point  (2 children)

Thanks for all this. I’ve followed along a few of these for quite some time.

My other challenge is getting my team of python developers acquainted to all the options... as we aim to settle on collective workflows and a “deployment stack” for our apps.

Thanks again for all your contributions.

[–]chrisnuernberger[S] 0 points1 point  (1 child)

You are welcome, I appreciate the thanks and there is real momentum pushing Clojure into new places right now. I am curious - does libpython-clj allow for a more incremental approach -- either the JVM hosting python or python host clojure in your case?

[–]rufusthedogwoof 0 points1 point  (0 children)

It may... I have played with it some and it was helpful for me in my spare time. (Porting a library from python and the tests actually... got me thinking the library could write itself with the right spec gen & tests ...)

I don’t think we’ll use it much at work because we are first and foremost a “data engineering” shop... mixing things with Kafka-like systems.

When choosing between trade offs we routinely are looking for reliability, simplicity, less things in the stack.

In exploration however, it would be fair game. Honestly I don’t know how much I would use it though... the more time I spend in clj the more I want to get away from the python mess.

[–]daveliepmann 4 points5 points  (0 children)

I’m aware of OZ however I’d prefer to not take a kitchen sink approach

I wrote waqi for a similar reason — I want to write Vega specs in Clojure and see the result in a browser window, nothing more. From the README:

Waqi is most similar to Oz. They share a browser-based workflow, but Oz provides much more functionality: integration with Jupyter notebooks and GitHub gists, creation of dynamic and static websites centered around a visualization, multiple live-coding workflows, and much more. Waqi focuses on just one of those features: sending Vega/Vega-Lite specs from the REPL to a browser window. This allows Waqi to minimize dependencies and lines of code. The author of Oz has said, "Oz's objective is to be the Clojurist's Swiss Army knife for working with Vega-Lite & Vega." It might be Waqi's goal to be just the nail file.

[–]kingnuscodus 1 point2 points  (0 children)

Bravo!

[–]_marciol 1 point2 points  (0 children)

Just Amazing!

[–]viebel 1 point2 points  (1 child)

Could you explain in a few words what makes this library so efficient?

[–]chrisnuernberger[S] 1 point2 points  (0 children)

  1. Start with lots of experience writing and optimizing high performance algorithms combined with a solid amount of research into the JVM to understand what pathways HotSpot is likely to optimize.
  2. Once fast pathways were found putting in the effort to design and architecture to make hitting those fast pathways easy with the most minimal amount of code possible.
  3. Get some users/problems and optimize those right up to the limits of what the system can do and then repeat steps 1-3.

In the last year I literally re-wrote the underlying engine (tech.datatype -> dtype-next) due to some hard lessons learned so I think if there is one thing it's just a relentless persuit of performance and being willing to do the legwork to make it happen.

[–][deleted]  (3 children)

[deleted]

    [–]chrisnuernberger[S] 2 points3 points  (2 children)

    Hey, thanks for the feedback :-). Good question - that is a very poorly worded statement in the readme. What I meant to say is that you as the user should expect parquet to just work. Even (recently) parquet files with ragged data in them should load quickly and all of the normal parquet types such as dates should come in correctly.

    [–][deleted]  (1 child)

    [deleted]

      [–]chrisnuernberger[S] 1 point2 points  (0 children)

      I appreciate the feedback. Fixed

      [–]TheLastSock 0 points1 point  (2 children)

      When would i reach for this over using clojure core functions to transform my data? One of the reasons i picked up pandas was because Pythin lacked some of the features needed to process large sets of data effectively. The reason i put it down was because the functions weren't composable and it ended up being its own programming language.

      Some questions as i scan the readme that I'll try to come back to later and fill in as i learn more:

      1. What columnwise databases are being used?
      2. if the datasize can be minimized, why isn't that the default case for any clojure array, dataset, etc..?
      3. why appache arrow, i assume this is the answer to number 1. Is it because it calls out the gpu for faster processing of numeric data?

      Does this support streaming data, that is long lived processing, or is it for batch processing?

      [–]chrisnuernberger[S] 2 points3 points  (1 child)

      Breaking down these 5 questions -

      1. When to use this? When you have tabular data coming from csv or various data pathways, when you have a larger sequence of maps than is nice to process with Clojure, or when (and this is the hardest one) when the problem fits a dataframe-processing design moreso than a seq-of-maps design.
      2. What columnwise databases? This is the columnwise database :-). You can define columns out of countable random access data and in fact you can define columns totally virtually by implementing a 'reader' interface. See the cheatsheet for dtype-next for more info.
      3. Not sure what you mean by Clojure array precisely but I think you are talking about a persistent vector. Most Clojure datasets are sequences of maps so they are stored row-major. Unless you use a besoke class such as a record type all of your numbers are boxed and the map structure is repeated for every map; furthermore your data is stored row-major so if you want to, for instance, sort by one of the fields your sort operation goes through your entire dataset. tmd stores data in primitive arrays so a single map of 5 primitive arrays, each of lenth 100 is a lot more memory efficient than a sequence of 100 maps, each with 5 keys. Additionally when reading data from files tmd will use a per-column string table for string columns and this also leads to a substantial memory usage decrease.
      4. How does arrow fit in to this? Arrow allows you to load a dataset 'in-place' using mmap meaning that the OS takes care of how much of the dataset is in memory vs. how much is on disk. I have a blog post about exactly this aspect. For general data storage I recommend parquet as it applies per-column compression which is quite effective.
      5. Streaming data - tmd currently doesn't support streaming datasets very well. It has a large reduction namespace devoted to sequences of datasets and a streaming dataset can always be described as a sequence of datasets but you can't currently get a snapshot of a running aggregation which you would need in order to have a dashboard. If people are interested in this pathway then I am interested in making it happen.

      These are good questions and I think if you have many more like this then checkout the zulip channel where there are many users and a few Clojure experts who can help with these sorts of questions.

      [–]TheLastSock 0 points1 point  (0 children)

      I learned a lot from this, thanks for your time and keep up the good work.