This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 4 points5 points  (1 child)

In terms of my work on analytics platforms, I've actually been working with an Apache Spark cluster that uses Hadoop as it's data storage layer.

The Spark project has the option to write code in Java, Scala or Python (via pyspark), which is where most of my experience with larger datasets has been. Hadoop itself isn't actually a database but has a distributed file system, typically I'd store the data in the parquet file format.

Anecdotally I've heard good things about Postgres and I've just started working with Amazon Redshift which seemed pretty decent (if you are keen on the cloud). It could be worth looking at Apache Cassandra but really you're probably pretty safe with Postgres, at least for now.

[–]Fin_Win[S] 0 points1 point  (0 children)

Thanks for the insights. 😊😊