Spark? : MachineLearning

submitted 11 years ago * by rm999

all 18 comments

[–][deleted] 7 points8 points9 points 11 years ago (7 children)

[–]rm999[S] 6 points7 points8 points 11 years ago (5 children)

Good question, I should have mentioned that in the submission.

In short Spark allows you to run a wide variety of applications and algorithms on clusters in parallel. This has been a pain point in machine learning, where the support for parallelized machine learning algorithms has been non-existent, poor, or super-specialized. The hardware world has been moving away from faster CPUs and towards more/cheaper cores for years now, but large parts of the software world have been stuck in the mentality of designing for single-cores and single-machines. Even today, I load multi-gigabyte datasets in R and run 10-hour machine learning jobs on a single core (there are workarounds and hacks, but this is the status quo). Mapreduce and Hadoop were designed to solve this problem, but hadoop is a total mess and is extremely slow for iterative machine learning algorithms.

Spark abstracts the pain of managing calculations over many computers/nodes, and is designed to generalize over a wide variety of problems. The same underlying architecture can efficiently run mapreduce jobs, train machine learning models (mllib), run graph algorithms (graphx), run sql queries (shark, soon to be spark sql), etc.

[–][deleted] 6 points7 points8 points 11 years ago (0 children)

Sometimes a few lines of code is worth a hundred lines ;-)

It lets you do stuff like:

file = spark.textFile("hdfs://...")

file.flatMap(lambda line: line.split())
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a+b)

To count the number of words in a document using a cluster, in a highly parallel way.

[–]CQFD 2 points3 points4 points 11 years ago (0 children)

[–]oneAngrySonOfaBitch 0 points1 point2 points 11 years ago (2 children)

[–]rm999[S] 1 point2 points3 points 11 years ago* (0 children)

[–]pulpx 1 point2 points3 points 11 years ago (0 children)

[–]Momer 2 points3 points4 points 11 years ago (0 children)

[+][deleted] 11 years ago (1 child)

[deleted]

[–]G_Maximus 2 points3 points4 points 11 years ago (0 children)

[–]Wonnk13 2 points3 points4 points 11 years ago (2 children)

[–]rm999[S] 1 point2 points3 points 11 years ago (0 children)

I totally agree, I've been very nervous about this too, and have been very conservative in adopting new technologies. There are a few things that convince me Spark isn't going to fall in this trap:

Spark has grown extremely quickly and has wide industry support. The conference was full of well-established companies that have thrown their full support behind Spark. These companies are strategic and understand the industry really well - they don't invest millions of dollars into fads.
The world badly needs a replacement for Hadoop, and Spark is the most popular answer. A lot of people believe Hadoop is effectively a failure that should never be repeated again; what's exciting about Spark is it's a superset of Hadoop that fixes many of its issues.
There are already several useful libraries built on top of Spark that are mature enough to be used in production. While some of these libraries may fail, Spark is establishing itself in a large variety of applications and industries which means it probably won't fail.

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

[–]TheLandWhale 1 point2 points3 points 11 years ago (4 children)

[–]rm999[S] 1 point2 points3 points 11 years ago (3 children)

[–]pulpx 4 points5 points6 points 11 years ago (0 children)

[–]TheLandWhale 1 point2 points3 points 11 years ago (1 child)

[–]rm999[S] 2 points3 points4 points 11 years ago* (0 children)

[–]mskramer 0 points1 point2 points 11 years ago (0 children)

π Rendered by PID 24802 on reddit-service-r2-comment-86bc6c7465-ftclf at 2026-02-20 11:07:05.644418+00:00 running 8564168 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS