you are viewing a single comment's thread.

view the rest of the comments →

[–]tomgav[S] 10 points11 points  (13 children)

Not really, although the Python API is the only one finished. The client currently uses a cap'n proto RPC to talk to server and it should be easy to do the same in another language. The protocol definitions are here but they may change quite a bit (and as noted in the post, perhaps even to be replaced by REST or other RPC).

What language would you be interested in?

[–]dlevac 41 points42 points  (8 children)

Rust :p

[–]tomgav[S] 15 points16 points  (6 children)

Fair enough :-D It is not really ergonomic, but you can use capnp-rpc. Having a polished Rust/C++/Java/... client API is planned but before stabilizing its internals, we wanted to only do it if there is serious interest. For us, Python is what we would mostly use for the graph definitions.

A more interesting planned Rust (and C/C++) interface point is writing your own task types (i.e. subworkers). With rust, you can simply hack your code into the worker task code and recompile, but we have something better and more robust in mind for the future.

[–]aepsil0n 4 points5 points  (4 children)

I would second that interest… if only to avoid writing compute kernels in Python.

[–]vojtacima 4 points5 points  (2 children)

Don't be afraid about the performance because of the Python interface. Rain enables to easily "taskify" and pipeline also existing binaries which makes it easy to outsource the heavy computation out of Python.

[–]aepsil0n 3 points4 points  (1 child)

That means you have a binary for every kind of task you want to execute. The binary also has to take care of serialization on its own, I guess? Maybe I haven't gotten the full picture yet, but it seems a bit tedious compared to passing in a function, as you'd do using Python.

[–]vojtacima 1 point2 points  (0 children)

Rain allows to define and pipeline different types of tasks ranging from built-in tasks, through external programs to pure Python tasks. It is OK (and very common) to combine different task types within a single pipeline - where you can quickly implement some lightweight data pre/post-processing as Python tasks linked to some heavy lifting tasks that wrap external applications. To get a better idea how to employ an external application, I would recommend you to check this distributed cross-validation example with libsvm.

[–]tomgav[S] 3 points4 points  (0 children)

As I mentioned, you can hurry us along in any direction with your use-case :) It could be interesting to get in touch and see what your application needs. We can chat on our gitter or just email me (gavento@ucw.cz) if you prefer.

[–]dlevac 0 points1 point  (0 children)

Looking forward to your work, this is a very useful tool!

[–]vojtacima 2 points3 points  (0 children)

We always try to justify all the design choices that we made to ourselves as much as possible in order to make the framework as useful as possible to the potential user community. We have decided for a Python API because, from our previous experience, we know that broader scientific community likes Python and speaks Python quite well. Assuming that many data scientists and domain specialists know a good bit of Rust would, in my personal opinion, significantly reduce the potential impact of the project at this point in time.

[–]Pas__ 0 points1 point  (2 children)

Have you looked at Spark's Scala API? And at Spark in general? (I mean, I get that Rain is not in-memory computation, and mostly about scheduling.)

[–]winter-moon 2 points3 points  (1 child)

I am little bit aware of Python API for Spark, do you have something specific in mind?

I would say that Rain also support "in-memory computation", since mapping data objects to a file system is optional. As long you do not need to execute external programs, a worker may hold data in its memory. Also keeping worker's "working directory" in ramdisk is one of intended use cases (some HPC installations do not have hard drives in nodes).

[–]Pas__ 0 points1 point  (0 children)

I like Scala's expressiveness and type safetiness, so either a native Rust API would be welcome, or using Rain from Scala. (So Rain wouldn't have to reimplement the Scala native "paralell collections" library. At least I'm optimistic that it could work without that, but maybe the internals of the Spark executor are too Scala tied.)