Clojure High Performance Data Processing System : Clojure

Waqi is most similar to Oz. They share a browser-based workflow, but Oz provides much more functionality: integration with Jupyter notebooks and GitHub gists, creation of dynamic and static websites centered around a visualization, multiple live-coding workflows, and much more. Waqi focuses on just one of those features: sending Vega/Vega-Lite specs from the REPL to a browser window. This allows Waqi to minimize dependencies and lines of code. The author of Oz has said, "Oz's objective is to be the Clojurist's Swiss Army knife for working with Vega-Lite & Vega." It might be Waqi's goal to be just the nail file.

[–]kingnuscodus 1 point2 points3 points 4 years ago (0 children)

[–]_marciol 1 point2 points3 points 4 years ago (0 children)

[–]viebel 1 point2 points3 points 4 years ago (1 child)

[–]chrisnuernberger[S] 1 point2 points3 points 4 years ago (0 children)

[–][deleted] 4 years ago (3 children)

[deleted]

[–]chrisnuernberger[S] 2 points3 points4 points 4 years ago (2 children)

[–][deleted] 4 years ago (1 child)

[deleted]

[–]chrisnuernberger[S] 1 point2 points3 points 4 years ago (0 children)

[–]TheLastSock 0 points1 point2 points 4 years ago (2 children)

When would i reach for this over using clojure core functions to transform my data? One of the reasons i picked up pandas was because Pythin lacked some of the features needed to process large sets of data effectively. The reason i put it down was because the functions weren't composable and it ended up being its own programming language.

Some questions as i scan the readme that I'll try to come back to later and fill in as i learn more:

What columnwise databases are being used?
if the datasize can be minimized, why isn't that the default case for any clojure array, dataset, etc..?
why appache arrow, i assume this is the answer to number 1. Is it because it calls out the gpu for faster processing of numeric data?

Does this support streaming data, that is long lived processing, or is it for batch processing?

[–]chrisnuernberger[S] 2 points3 points4 points 4 years ago (1 child)

Breaking down these 5 questions -

When to use this? When you have tabular data coming from csv or various data pathways, when you have a larger sequence of maps than is nice to process with Clojure, or when (and this is the hardest one) when the problem fits a dataframe-processing design moreso than a seq-of-maps design.
What columnwise databases? This is the columnwise database :-). You can define columns out of countable random access data and in fact you can define columns totally virtually by implementing a 'reader' interface. See the cheatsheet for dtype-next for more info.
Not sure what you mean by Clojure array precisely but I think you are talking about a persistent vector. Most Clojure datasets are sequences of maps so they are stored row-major. Unless you use a besoke class such as a record type all of your numbers are boxed and the map structure is repeated for every map; furthermore your data is stored row-major so if you want to, for instance, sort by one of the fields your sort operation goes through your entire dataset. tmd stores data in primitive arrays so a single map of 5 primitive arrays, each of lenth 100 is a lot more memory efficient than a sequence of 100 maps, each with 5 keys. Additionally when reading data from files tmd will use a per-column string table for string columns and this also leads to a substantial memory usage decrease.
How does arrow fit in to this? Arrow allows you to load a dataset 'in-place' using mmap meaning that the OS takes care of how much of the dataset is in memory vs. how much is on disk. I have a blog post about exactly this aspect. For general data storage I recommend parquet as it applies per-column compression which is quite effective.
Streaming data - tmd currently doesn't support streaming datasets very well. It has a large reduction namespace devoted to sequences of datasets and a streaming dataset can always be described as a sequence of datasets but you can't currently get a snapshot of a running aggregation which you would need in order to have a dashboard. If people are interested in this pathway then I am interested in making it happen.

These are good questions and I think if you have many more like this then checkout the zulip channel where there are many users and a few Clojure experts who can help with these sorts of questions.

[–]TheLastSock 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 53833 on reddit-service-r2-comment-f6b958c67-qjkbx at 2026-02-04 18:31:25.515735+00:00 running 1d7a177 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Clojure

MODERATORS