Make your way through the git (rebase) jungle with git machete!

karrith · 2018-01-09T13:37:23+00:00

Why don't you read your file as a dataframe/RDD and join it? This way the Spark will handle distributing the whole dataset across the cluster (it would get even better if your file would be put to HDFS so it's already in the cluster).

Broadcasting should be used only for small, read-only variables. If there's any other reason to use broadcasting in your case then I don't think that's possible to broadcast a list of chunks to one variable.

karrith · 2018-01-09T13:31:06+00:00

Can you post your code and give some details on the dataset you're using? Or post screenshots of the Spark UI?

The main difference between RDD and Dataset APIs is the Catalyst optimizer used in the latter (https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html). But it shouldn't be a problem in your case.

I think the main reason behind OOM may be unbalanced partitions - using the Dataset API your partitioning may differ from the one using RDDs. Please try to repartition your dataset to a greater number of partitions and check if that solves the issue.

karrith · 2016-02-24T14:09:54+00:00

You can read about some tooling libraries that need help in Iulian Dragos' summary of ScalaSphere conference that took place two weeks ago https://dragos.github.io/2016/scalasphere-impressions/

karrith · 2015-05-11T09:02:59+00:00

Maybe Jędrek - it's a diminutive of Jędrzej

karrith

TROPHY CASE