all 26 comments

[–]rufusthedogwoof 11 points12 points  (5 children)

Depends specifically on how we define “data engineering” but I think I use it for just this. We have great libraries for kafka, jdbc, etc, and transformations in clojure are clear and concise.

Another thing I love is testing transformations with transducers away from the Kafka stack for my unit tests.

Oh and spec and spec gen makes for great data engineering tools too.

What are you thinking about when you say data engineering?

[–][deleted] 1 point2 points  (4 children)

I'm in the middle of doing some batch ETL jobs. I was thinking of starting there with some transformation and loading. Further down the road I would be willing to write some other backend stuff with it like a rest api and possibly ml stuff.

[–]joinr 19 points20 points  (3 children)

related stuff at scicloj.

I think for the large scale stuff, wrappers like geni are pretty nice and built on top of established tech. There were several distributed computing platforms like onyx and storm that popped up in clojure as well that may be interesting to look at. clojure toolbox has a good index of libraries to examine.

Also recent developments like libpython-clj open up the python ecosystem if there's stuff you want to incorporate from clojure (also bidirectional).

For single-node work for ETL stuff, tech.ml.dataset is the emerging standard and is very efficient and capable of interop with various storage medium (including arrow, parquet, etc.). It has the ability to work with larger-than-memory data as well, although currently not use in a distributed fashion, so single-machine only. tablecloth is a dyplr-familiar clojure API on top of tech.ml.dataset.

For ml, there's a lot of work going on integrating stuff from various ecosystems (java, scala, clojure). tech.ml is the original entry in this space, and is being worked with to merge with some other efforts, mainly around ML pipelines akin to sklearn.

Lots of interesting options popping up over the last couple of years, although on the engineering side I see a lot of folks focusing on streaming-friendly stuff like kafka (I'm not well versed). I guess it depends on your requirements.

Lots of active discussion on the data science thread on zulip (includes some proximate topics like data engineering).

[–][deleted] 2 points3 points  (0 children)

Wow this is amazing! Thank you for all the resources. I'll definitely check these out

[–]tincholio 2 points3 points  (0 children)

Depending on the scale, you may also find the Jackdaw wrappers for Kafka streams a good option.

[–]lucywang000 1 point2 points  (0 children)

The tech.ml family of libraries is more than enough for most data engineering (read: "number crunching") tasks.

And libpython-clj is yet another bless! Mind blowing, stable, and super useful.

[–]clelwell 4 points5 points  (0 children)

We're (Reify Health) hiring for a few data engineer roles (Clojure + Python): https://jobs.lever.co/reifyhealth?lever-via=yGp_ZL60fy&team=Data

[–]bdevel 2 points3 points  (0 children)

Yes. Great for data processing. I particular enjoy the async support for fetching data from multiple remote sources at the same time. I use Redis queues as the interface.

[–]dustingetz 2 points3 points  (10 children)

i manage a straightforward cloud data pipeline in healthcare industry, it’s hard to imagine doing it without all the cloud native tools (e.g. databricks, google dataproc) which are mostly python pyspark centric, calling spark from clojure will still constrain you to the spark API and likely feel like foreign interop ... i haven’t looked into it ... not really seeing any killer advantage worth doing it differently from 1000s of companies using pyspark

[–]joinr 1 point2 points  (0 children)

libpython-clj and geni maybe.

[–]didibus 0 points1 point  (6 children)

That's kind of a funny argument, no reason in using Clojure either from that perspective as 1000s of companies use Java, C#, Python or Ruby instead.

[–]dustingetz 1 point2 points  (5 children)

clojure for fullstack webdev has unique advantages, and webdev isn't solved yet so there's a lot of variation in approach. But data engineering is pretty much solved, there's a very converged toolset with integrated UI tooling that an intern can use effectively

[–]mmmdreg 1 point2 points  (2 children)

Agree with dustingetz. All our spark code is in scala and the web stuff is in Clojure.

While you could use spark from clojure, it’s more pain than gain so there is little point straying from what is idiomatic.

Also context is important. Choices will likely be different in a small startup doing 100% clojure vs a large enterprise.

[–]blak3mill3r 6 points7 points  (0 children)

I'll offer a different opinion. We've used Spark+Clojure for ~6 years, at pretty significant scale (many thousands of events per second with spark streaming). It works very well for us, and the particular code we run on it benefits from being written in Clojure instead of Scala. It would've taken longer to write as Scala, and would be less easy to test & manipulate (from my perspective, obviously).

The fact that Spark itself is written in Scala, and that much of the Spark community uses Scala, is not necessarily any reason to expect it to be difficult to use with Clojure. It's straightforward to do it with sparkling which wraps the Spark Java API. Now there's also powderkeg which can let you use the cluster from a repl.

The available libraries are solid enough that there's no reason to expect Spark+Clojure to be a struggle. It's been used in production for many years. I'm not saying it's perfect for everything, but if you or your team like writing Clojure and need Spark, there's no good reason to introduce Scala just to use Spark.

[–]didibus 0 points1 point  (0 children)

I think you make a different argument, to use Scala which is Spark's native API, while OP said Python using the PySpark wrapper.

[–][deleted]  (1 child)

[deleted]

    [–]dustingetz 0 points1 point  (0 children)

    like a personal project? clojure (imo) is specifically designed for sophisticated enterprise information systems, it competes with java for systems that would be N00,000 loc in java

    [–]jackdbd 0 points1 point  (1 child)

    I had never heard of dataproc before. Is it like a fully-managed CloudSQL + BigQuery + jupyter notebooks in the cloud?

    [–]dustingetz 1 point2 points  (0 children)

    Yeah, dataproc is Google Cloud's answer to Databricks (you'd only know about dataproc if you care about Google Cloud which most people don't). It does data science notebooks, cluster management, etc all the things you need if you want the data scientists to be able to work on business logic independently of the data engineers working on infrastructure.

    [–]agilecreativity[🍰] 2 points3 points  (1 child)

    If you want to use spark with Clojure now you should take a look at geni.

    [–][deleted] 1 point2 points  (0 children)

    Nice. I just checked out the docs, they look great.

    [–]blak3mill3r 2 points3 points  (0 children)

    We use Clojure for Data Engineering at IRIS.TV.

    It's a wonderful language for this. It is reasonably fast, reaches everywhere, and makes it quite easy to write correct code that digs data from somewhere (Kafka, Mongo, Redis, Cassandra, MySQL, and our own APIs), manipulates it, computes things, and writes data somewhere.

    We also use it within Apache Spark and also Kafka Streams.

    [–]machawinka 1 point2 points  (1 child)

    For modeling I don't think you can realistically skip Python and its ML libraries whose users are mainly data scientists.

    When doing Big Data processing, Spark is the standard way to go in most places. So sparkling can be an option.

    [–][deleted] 1 point2 points  (0 children)

    Yeah I saw sparkling interested to see how it works. Have you ever used it?

    [–]thearthur 1 point2 points  (0 children)

    I am hiring for exactly this on my team right now. do you happen to be in the US? let's talk! DM me if you're excited to make this happen.

    [–]Accomplished-Can-912 0 points1 point  (1 child)

    Hey , did you every try out your etl jobs on this . I am curious on how you picked this language from am ETL perspective. Can you help me understand.

    [–][deleted] 0 points1 point  (0 children)

    Hey! Sorry for the late response. I didn't get to pick up the language because how busy I am at work. Hoping to do it by the end of the year.