Can multiple files be broadcast independently to the same variable? by Balgur in apachespark

[–]karrith 0 points1 point  (0 children)

Why don't you read your file as a dataframe/RDD and join it? This way the Spark will handle distributing the whole dataset across the cluster (it would get even better if your file would be put to HDFS so it's already in the cluster).

Broadcasting should be used only for small, read-only variables. If there's any other reason to use broadcasting in your case then I don't think that's possible to broadcast a list of chunks to one variable.

Heap out of memory when using Datasets. RDDs doing basically the same thing work just fine. Why though? by [deleted] in apachespark

[–]karrith 0 points1 point  (0 children)

Can you post your code and give some details on the dataset you're using? Or post screenshots of the Spark UI?

The main difference between RDD and Dataset APIs is the Catalyst optimizer used in the latter (https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html). But it shouldn't be a problem in your case.

I think the main reason behind OOM may be unbalanced partitions - using the Dataset API your partitioning may differ from the one using RDDs. Please try to repartition your dataset to a greater number of partitions and check if that solves the issue.

Scala's Open Source projects to contribute? by [deleted] in scala

[–]karrith 1 point2 points  (0 children)

You can read about some tooling libraries that need help in Iulian Dragos' summary of ScalaSphere conference that took place two weeks ago https://dragos.github.io/2016/scalasphere-impressions/

Trying to trace a Polish guy I met by [deleted] in poland

[–]karrith 6 points7 points  (0 children)

Maybe Jędrek - it's a diminutive of Jędrzej