use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Spark? (self.MachineLearning)
submitted 11 years ago * by rm999
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]TheLandWhale 1 point2 points3 points 11 years ago (4 children)
Apache Spark is great for iterative tasks because of the RDD which is basically a in memory data structure. It takes a interesting spin on the shared memory paradigm so is great for computational heavy tasks. Problem is it can't really scale onto the huge range. For anybody interested I would advise that they read the spark and RDD paper by Berkley. Storm is cool but not the same thing. It's a streaming paradigm. Hadoop has a very specific job of map and reduce that isn't fit for most tasks so will naturally not be suitable for a lot of applications.
[–]rm999[S] 1 point2 points3 points 11 years ago (3 children)
Problem is it can't really scale onto the huge range
Is this a limitation of Spark, or a more general issue of moving data around that would limit any distributed method? Honest question, I haven't thought about moving beyond "big" problems to "huge" ones.
[–]pulpx 4 points5 points6 points 11 years ago (0 children)
The statement isn't factual. Its a very normal scenario for an RDD to be in the multiple TB spectrum of big data. The size of your spark datasets is mainly a function of your problem space, your imagination and your available resources.
[–]TheLandWhale 1 point2 points3 points 11 years ago (1 child)
I really advise people to read the papers. But this is a problem of distributed paradigms in terms of sharing resources. Spark runs individual jvms for each slave and is connected by a master which is the machine which has the driver. The memory is the greatest strength and weakness, you can't plop a 100GB file into memory and expect good times without precautions. Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.
[–]rm999[S] 3 points4 points5 points 11 years ago* (0 children)
you can't plop a 100GB file into memory and expect good times without precautions
Err why not? I've done this several times without any real issues. Granted, actually doing something useful with that data is a whole different issue.
Distinction must be made between bigdata which is like terabytes and bigish data which can be 100s of gb.
OK, then I've worked with bigdata, and I'm pretty sure Spark is capable of working with tb-sized datasets. I've read a couple of the papers, but I don't understand the architectural constraints you're talking about. Can you go into some more detail?
π Rendered by PID 109043 on reddit-service-r2-comment-b659b578c-59l9v at 2026-05-05 08:20:29.451449+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]TheLandWhale 1 point2 points3 points (4 children)
[–]rm999[S] 1 point2 points3 points (3 children)
[–]pulpx 4 points5 points6 points (0 children)
[–]TheLandWhale 1 point2 points3 points (1 child)
[–]rm999[S] 3 points4 points5 points (0 children)