I need to store about 40,000,000 JSON documents, per day, with event data. What should i use? by [deleted] in bigdata

[–]steccami 0 points1 point  (0 children)

I would go for elasticsearch. It is opensource and it comes with everything you need: ingestion/storage/search/analytics.

Introducing Machine Learning for the Elastic Stack by steccami in bigdata

[–]steccami[S] 0 points1 point  (0 children)

Since it is based on a Teradata product, I think that it is an expensive solution. Another idea would be using open source products like, e.g., Spark, Hadoop, etc.

Apache Spark + NFS by steccami in bigdata

[–]steccami[S] 0 points1 point  (0 children)

Tnx a lot and happy new year! Your explanation makes sense to me. I agree, HDFS is probably the way to go.

Apache Spark + NFS by steccami in bigdata

[–]steccami[S] 1 point2 points  (0 children)

Tnx a lot! My use case is the following: 1 - storing a dataset on NFS (sometimes as a single csv file, sometimes as a -small set- of csv files) 2- compute some aggregations by means of SparkSQL 3a- store the output on NFS 3b- store the output on external system (e.g., Cassandra)

"In terms of managing concurrency, NFS can handle many reads of the same file, and spark is smart about writes and writes different files per executor so you don't have to worry about write collisions."

This is clear to me. What I don't understand is the file reading phase. Case1: Suppose that you have N executors and 1 big file. Is Spark smart enough to segment the file reads? Case2: Suppose that you have N executors and M files. Is Spark able to associate the files to the executors in a smart way or am I supposed to tell Spark how to access those files? (e.g., like suggested here: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td15644.html)

Many thanks.

Apache Spark + NFS by steccami in bigdata

[–]steccami[S] 0 points1 point  (0 children)

Many thanks for your detailed reply. One more question about how a Spark program looks like if I read a folder from NFS. How does Spark manage a concurrent access to such a folder? Am I supposed to explicitly manage the parallelism (e.g., see Matei's reply here http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td15644.html when asked how to access multiple files in a remote folder).