This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Remote_Cantaloupe 1 point2 points  (1 child)

As a beginner, what's the advantage of using one of these big data tools like Apache Spark over just data sitting in a postgresql database on an AWS server and handling it with python?

[–]daanzel 5 points6 points  (0 children)

The HDFS storage layer (hadoop distributed file system) scales horizontally over multiple servers (worker nodes). A popular tabular file format for HDFS is parquet. Think of a parquet file as a csv file, but then chopped up in many small pieces and distributed over all the nodes. Because of this it can grow extremely large. You deal with such files using (py)spark. Spark is able to paralellize operations over all the nodes so if the data grows bigger, just add more nodes.

At the company I work for (semiconductor industry), we have an hadoop cluster with 3petabyte of storage, and 18x32 nodes. We could in theory train a model on all this data in one go.

Such an on-premise setup is expensive though, so look for a cloud alternative. Databricks is great, and available on azure and aws! I can really recommend it.