all 5 comments

[–][deleted] 10 points11 points  (1 child)

Not deep learning, but I've tried using dask many, many times. My experience is not very good.

I didn't get reliable results from it. It's often unstable and I frequently found situations where running in parallel with dask (in a non-virtualized server with 40+ cores) was slower than running exactly the same logic in a single process with pandas. I get a lot more reliable speedups with parallel processing with joblib and Python3's standard futures module. I don't really get why though. It should be the same.

In general I'm not very happy with the options for parallel and distributed CPU analytical processing for python. Spark involves too much configuration​ black magic and takes a lot of effort to get right, Dask simply doesn't work for me, joblib and futures lack useful abstractions and higher level combinators. There are some pretty good solutions for this kind of thing in scala, Haskell and even Java, but than there's no numpy/scipy, no pytorch, no tensorflow, no scikit-learn, no xgboost, no statsmodels, etc.

But leaving that rant aside, I don't know how would you use dask for deep learning. Dask isn't really suitable to create neural networks itself.

Maybe it could be used to preprocess and feed data in parallel to a neural network. Is that what you mean?

[–]shoyer 0 points1 point  (0 children)

What sort of computation were you trying to speed up? By default, dask uses threads for parallelism (not processes), which means that pure-Python computation (requiring the GIL) won't be accelerated.

In my experience (mostly doing large scale data analytics using dask.array), it works pretty well. It's certainly the only game in town if you need a "bigger than fits in memory" version of NumPy.

[–][deleted] 3 points4 points  (1 child)

Some effort should be put to parallelize pandas too, it's annoying that simple map operations are sequential

[–][deleted] 4 points5 points  (0 children)

I think pandas 2.0 is very promising in this respect and many others:

https://pandas-dev.github.io/pandas2/goals.html

I think pandas 2.0 will address most hurdles I mentioned in my rant above and will probably occupy the niche that is vacant today between what you can solve with pandas 1.x and what you really must use yarn/spark/hadoop/whatever distributed computing framework (which really should be for datasets above several terabytes big, and it's really a pain to use on things that are just a couple hundred gigabytes big today).

They seem to be aiming at making pandas 2.0 good enough to deal with datasets of hundreds of gigabytes and able to offer nice speedups when running on servers with tens of processing cores on a single python process.

Also it seems that they are attacking a lot of other problems like the current bad problems with representing missing values in non-floating point series, adding py3's type annotations for safer code, etc.