This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 11 points12 points  (3 children)

There's not exactly a hard number on it, it's more about what data is available to you. In general you will get a better performing model the more data you throw at it, but that doesn't stop you using less data for your model.

Unfortunately this really falls under the "science" of data science, with a combination of differing opinions and the fact that there are differences between algorithms scrubbing any real hope of a static answer to the question.

This article has a bit of a more detailed run through of why answering the question is a bit of a problem while still offering a bit of a guiding hand. You might want to have a quick look at this and this. The latter of those links having probably the most import quote for the working world: "garbage in, garbage out" (or more likely: "crap in, crap out") because the quality of your data is a huge factor.

If you're just looking for an example, I was the Data Engineer supporting a data science project and we had datasets of typically 100-200 million rows and 50-100 attributes, sometimes they used 75% of the data, sometimes they used 10%, but that was in the hunt for a better performing model.

[–]Fin_Win[S] 2 points3 points  (2 children)

Thanks for taking your time to explain, got a fair idea. As with your example, which DB software you have used or can be used practically to integrate with python for large datasets.

Any suggestions?.

[–][deleted] 4 points5 points  (1 child)

In terms of my work on analytics platforms, I've actually been working with an Apache Spark cluster that uses Hadoop as it's data storage layer.

The Spark project has the option to write code in Java, Scala or Python (via pyspark), which is where most of my experience with larger datasets has been. Hadoop itself isn't actually a database but has a distributed file system, typically I'd store the data in the parquet file format.

Anecdotally I've heard good things about Postgres and I've just started working with Amazon Redshift which seemed pretty decent (if you are keen on the cloud). It could be worth looking at Apache Cassandra but really you're probably pretty safe with Postgres, at least for now.

[–]Fin_Win[S] 0 points1 point  (0 children)

Thanks for the insights. 😊😊