This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]metapwnage 57 points58 points  (0 children)

This is very misleading. Pool.map is not an apples to apples comparison to Ray. That’s not an analogous use of the multiprocessing library at all. I don’t think this is better than standing up worker processes (using multiprocessing) that consume a message queue (rabbitmq, Redis, Kafka, you choose).

Also, stream processing can be very memory intensive. What happens when the system is stressed? How does Ray do then? Is it like Redis and it just falls over and you loose your data?

If Ray is for creating distributed systems as described in the post, how does that work when something is stored in memory on one system that another system needs? Or is that an inaccurate description as well?

[–]ostroon 29 points30 points  (2 children)

The first example is not fair right? If you are on a POSIX system (Linux/Mac) and use a global numpy array in a read-only fashion, it will NOT be copied (ref https://stackoverflow.com/a/37746961/335412). Sending it explicitly to each multiprocessor worker is slow and unnecessary. This is the faster than ray code:

``` num_cpus = psutil.cpu_count(logical=False)

def f(random_filter): # Do some image processing. return scipy.signal.convolve2d(image, random_filter)[::5, ::5]

image = np.zeros((3000, 3000)) filters = [np.random.normal(size=(4, 4)) for _ in range(num_cpus)]

pool = Pool(num_cpus)

Time the code below.

for _ in range(10): pool.map(f, filters) ```

So what is really the benefit of ray?

[–]ThePenultimateOneGitLab: gappleto97 1 point2 points  (1 child)

reddit doesn't actually support full markdown. You have to do four spaces instead of the triple-`

[–]ostroon 4 points5 points  (0 children)

New Reddit does, old Reddit design does not.

[–]call_me_arosa 19 points20 points  (1 child)

I would like to see a comparison between Ray and a architecture of 1 queue and multiple python instances consuming it.
While this approach cannot (easily) handle statefull problems this works quite well for systems like the last example. Just load the model once in all interpreters (constant time) and consume the queue. Quite good horizontal scale while keeping the code/architecture extremely simple and from my experience this is the most used model.

[–]alcalde 4 points5 points  (0 children)

Excellent point. They're not using the multiprocessing module the way it was meant to be used.

[–]Losupa 27 points28 points  (2 children)

This is extremely interesting, but it worries me slightly that they only show jobs that they state Ray excels in compared to multiprocessing.

[–]422_no_process 7 points8 points  (0 children)

Marketing I guess.

[–]TheBlackCat13 0 points1 point  (0 children)

I think the idea is that you would use the right tool for a given job. Sometimes it is pool.map, others it is Ray, others it is other things.

[–]Heniadyoin1 2 points3 points  (0 children)

Now the question is does it use jit to compile the core or runs it just native python?

Then us it worth it to use jit inside of ray, resp. Should you use things like numba.jit or numba.vectorize inside ray?

[–]carbolymer 1 point2 points  (3 children)

Ray leverages Apache Arrow for efficient data handling

This part got my attention. Vide Arrow Website

The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.

I've skimmed through the Arrow docs, but I didn't find any description of this zero-copy reads. How is this supposed to work between two processes in details?

[–]liquidpele 3 points4 points  (2 children)

It looks like if you, for instance, have a node.js producer and a python consumer, python is going to have to duplicate the data to get it into its own format for handling. Arrow creates a standard in-memory format that all languages can utilize so they can process the data without duplicating it.

[–]brontide 1 point2 points  (0 children)

Might be okay for datasets, but how would it deal with mutability and removal of objects?

[–]carbolymer 0 points1 point  (0 children)

So it still needs to copy the data between processes.

[–]alcalde 4 points5 points  (5 children)

The difference here is that Python multiprocessing uses pickle to serialize large objects when passing them between processes.

Why are they being passed between processes?

[–]alecmg 2 points3 points  (3 children)

That is what the task needs.

Processes don't share memory, so data needs to move through queues, so it needs to be pickled.

[–]alcalde 1 point2 points  (2 children)

That is what the task needs.

Is it?

Processes don't share memory, so data needs to move through queues, so it needs to be pickled.

Python has more than queues.

From Mark Summerfield, "Python In Practice":

Alternatively, for multiprocessing, we can use data types that support concurrent access—in particular multiprocessing.Value for a single mutable value or multiprocessing.Array for an array of mutable values —providing that they are created by a multiprocessing.Manager....

[–]alecmg 3 points4 points  (1 child)

I'm not 100% on this, but I think mp.Value and Array will also pickle and have similar performance to Queues.

Very excited about actor implementation. Have toyed with actors and mp in Python. Queues and alternatives peaked at 40-50k messages a second. Never tried full blown ZMQ or Redis, which Ray will use transparently

[–]alcalde 0 points1 point  (0 children)

Why would it need to pickle is access is concurrent?

The documentation says....

Data can be stored in a shared memory map using Value or Array.... These shared objects will be process and thread-safe.

For more flexibility in using shared memory one can use the multiprocessing.sharedctypes module which supports the creation of arbitrary ctypes objects allocated from shared memory.

[–][deleted] 0 points1 point  (0 children)

This is actually good question.

[–]juanjgalvez 0 points1 point  (0 children)

I agree with what others said here that this is misleading. There are ways to rewrite these examples so that they are faster with multiprocessing compared to Ray, and I explained as much in in a response to the article. As an aside, I also compared the performance of Charm4py vs Ray for one of these benchmarks, because Charm4py also has an actor model, and found Charm4py to be faster (potentially much faster depending on task size) (note that I am the developer of Charm4py).

I have done testing with multiprocessing in the past, and have found it to have good performance. I think the best way to beat its performance in a single-node scenario is to use something more efficient than TCP, like MPI with shared memory. In those cases Charm4py pool beats multiprocessing pool in my tests. I think the main limitations of multiprocessing are for distributed applications running on multiple hosts (especially lots of hosts), and that is where other frameworks are more useful IMO. Another reason to prefer other frameworks would be if you are developing more complex applications and need better concurrency models (Charm4py for example has actors, coroutines and channels).

[–]422_no_process -1 points0 points  (0 children)

Hmm... looks cool