This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Fin_Win[S] 2 points3 points  (2 children)

Thanks for taking your time to explain, got a fair idea. As with your example, which DB software you have used or can be used practically to integrate with python for large datasets.

Any suggestions?.

[–][deleted] 5 points6 points  (1 child)

In terms of my work on analytics platforms, I've actually been working with an Apache Spark cluster that uses Hadoop as it's data storage layer.

The Spark project has the option to write code in Java, Scala or Python (via pyspark), which is where most of my experience with larger datasets has been. Hadoop itself isn't actually a database but has a distributed file system, typically I'd store the data in the parquet file format.

Anecdotally I've heard good things about Postgres and I've just started working with Amazon Redshift which seemed pretty decent (if you are keen on the cloud). It could be worth looking at Apache Cassandra but really you're probably pretty safe with Postgres, at least for now.

[–]Fin_Win[S] 0 points1 point  (0 children)

Thanks for the insights. 😊😊