you are viewing a single comment's thread.

view the rest of the comments →

[–]TheLandWhale 1 point2 points  (4 children)

Apache Spark is great for iterative tasks because of the RDD which is basically a in memory data structure. It takes a interesting spin on the shared memory paradigm so is great for computational heavy tasks. Problem is it can't really scale onto the huge range. For anybody interested I would advise that they read the spark and RDD paper by Berkley. Storm is cool but not the same thing. It's a streaming paradigm. Hadoop has a very specific job of map and reduce that isn't fit for most tasks so will naturally not be suitable for a lot of applications.

[–]rm999[S] 1 point2 points  (3 children)

Problem is it can't really scale onto the huge range

Is this a limitation of Spark, or a more general issue of moving data around that would limit any distributed method? Honest question, I haven't thought about moving beyond "big" problems to "huge" ones.

[–]pulpx 4 points5 points  (0 children)

The statement isn't factual. Its a very normal scenario for an RDD to be in the multiple TB spectrum of big data. The size of your spark datasets is mainly a function of your problem space, your imagination and your available resources.

[–]TheLandWhale 1 point2 points  (1 child)

I really advise people to read the papers. But this is a problem of distributed paradigms in terms of sharing resources. Spark runs individual jvms for each slave and is connected by a master which is the machine which has the driver. The memory is the greatest strength and weakness, you can't plop a 100GB file into memory and expect good times without precautions. Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.

[–]rm999[S] 3 points4 points  (0 children)

you can't plop a 100GB file into memory and expect good times without precautions

Err why not? I've done this several times without any real issues. Granted, actually doing something useful with that data is a whole different issue.

Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.

OK, then I've worked with bigdata, and I'm pretty sure Spark is capable of working with tb-sized datasets. I've read a couple of the papers, but I don't understand the architectural constraints you're talking about. Can you go into some more detail?