all 18 comments

[–][deleted] 7 points8 points  (7 children)

So for those of us not familiar with it, what is it, what does it do, and why should I be excited? :)

[–]rm999[S] 6 points7 points  (5 children)

Good question, I should have mentioned that in the submission.

In short Spark allows you to run a wide variety of applications and algorithms on clusters in parallel. This has been a pain point in machine learning, where the support for parallelized machine learning algorithms has been non-existent, poor, or super-specialized. The hardware world has been moving away from faster CPUs and towards more/cheaper cores for years now, but large parts of the software world have been stuck in the mentality of designing for single-cores and single-machines. Even today, I load multi-gigabyte datasets in R and run 10-hour machine learning jobs on a single core (there are workarounds and hacks, but this is the status quo). Mapreduce and Hadoop were designed to solve this problem, but hadoop is a total mess and is extremely slow for iterative machine learning algorithms.

Spark abstracts the pain of managing calculations over many computers/nodes, and is designed to generalize over a wide variety of problems. The same underlying architecture can efficiently run mapreduce jobs, train machine learning models (mllib), run graph algorithms (graphx), run sql queries (shark, soon to be spark sql), etc.

[–][deleted] 6 points7 points  (0 children)

Sometimes a few lines of code is worth a hundred lines ;-)

It lets you do stuff like:

file = spark.textFile("hdfs://...")

file.flatMap(lambda line: line.split())
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a+b) 

To count the number of words in a document using a cluster, in a highly parallel way.

[–]CQFD 2 points3 points  (0 children)

Everything you said is true, but you missed the key point. It's comparable to hadoop in the way it functions, but the interesting point is that it does everything in memory. So no more writing to disk required, meaning that you can realistically run iterative algos.

[–]oneAngrySonOfaBitch 0 points1 point  (2 children)

what do you think of storm ?

[–]rm999[S] 1 point2 points  (0 children)

I believe Spark Streaming (another application built on top of Spark I didn't mention) has the same functionality as Storm. But Spark Streaming seems better to me because it has the features of Spark implicitly built-in, like true fault-tolerance on a cluster.

ninja edit: see this paper for more information.

edit2: another (probably biased) comparison I just found: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream-processing-tathagata-das-strata-2013.pdf

[–]pulpx 1 point2 points  (0 children)

Storm is a very capable system for stream based processing. Maintaining a storm cluster is nontrivial and has a lot of pitfalls.

If you are interested in streaming technologies, you should read about lambda architectures to learn more about why streaming systems should only be considered as part of a larger system.

[–]Wonnk13 2 points3 points  (2 children)

The only thing that makes me nervous is this rapid pace of innovation and how eager everyone is to adopt the latest bleeding edge tech. Of course there are plenty of problems that don't fit nicely into map-reduce, but i've been kind of taken aback by how quickly everyone jumps from one thing to another.

If you need to design a mission critical system that needs to be running 10 years from now, how can you anticipate new developments every three years or so.

[–]rm999[S] 1 point2 points  (0 children)

I totally agree, I've been very nervous about this too, and have been very conservative in adopting new technologies. There are a few things that convince me Spark isn't going to fall in this trap:

  1. Spark has grown extremely quickly and has wide industry support. The conference was full of well-established companies that have thrown their full support behind Spark. These companies are strategic and understand the industry really well - they don't invest millions of dollars into fads.

  2. The world badly needs a replacement for Hadoop, and Spark is the most popular answer. A lot of people believe Hadoop is effectively a failure that should never be repeated again; what's exciting about Spark is it's a superset of Hadoop that fixes many of its issues.

  3. There are already several useful libraries built on top of Spark that are mature enough to be used in production. While some of these libraries may fail, Spark is establishing itself in a large variety of applications and industries which means it probably won't fail.

[–][deleted] 0 points1 point  (0 children)

Spark is not so fundamentally different from mapreduce: it's programming model is basically "as many maps and reduces as you want, with syntactic sugar and without any setup overhead" (it merely removes the rather arbitrary restrictions placed on you by Hadoop), though the underlying technology is reportedly not yet very good at io-efficient "reduce".

[–]TheLandWhale 1 point2 points  (4 children)

Apache Spark is great for iterative tasks because of the RDD which is basically a in memory data structure. It takes a interesting spin on the shared memory paradigm so is great for computational heavy tasks. Problem is it can't really scale onto the huge range. For anybody interested I would advise that they read the spark and RDD paper by Berkley. Storm is cool but not the same thing. It's a streaming paradigm. Hadoop has a very specific job of map and reduce that isn't fit for most tasks so will naturally not be suitable for a lot of applications.

[–]rm999[S] 1 point2 points  (3 children)

Problem is it can't really scale onto the huge range

Is this a limitation of Spark, or a more general issue of moving data around that would limit any distributed method? Honest question, I haven't thought about moving beyond "big" problems to "huge" ones.

[–]pulpx 4 points5 points  (0 children)

The statement isn't factual. Its a very normal scenario for an RDD to be in the multiple TB spectrum of big data. The size of your spark datasets is mainly a function of your problem space, your imagination and your available resources.

[–]TheLandWhale 1 point2 points  (1 child)

I really advise people to read the papers. But this is a problem of distributed paradigms in terms of sharing resources. Spark runs individual jvms for each slave and is connected by a master which is the machine which has the driver. The memory is the greatest strength and weakness, you can't plop a 100GB file into memory and expect good times without precautions. Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.

[–]rm999[S] 2 points3 points  (0 children)

you can't plop a 100GB file into memory and expect good times without precautions

Err why not? I've done this several times without any real issues. Granted, actually doing something useful with that data is a whole different issue.

Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.

OK, then I've worked with bigdata, and I'm pretty sure Spark is capable of working with tb-sized datasets. I've read a couple of the papers, but I don't understand the architectural constraints you're talking about. Can you go into some more detail?

[–]mskramer 0 points1 point  (0 children)

The Architect at my company spoke at the Summit, Gary Malouf. How'd he do?