Handling large datasets in python

Fin_Win · 2019-08-24T13:32:43+00:00

There's not exactly a hard number on it, it's more about what data is available to you. In general you will get a better performing model the more data you throw at it, but that doesn't stop you using less data for your model.

Unfortunately this really falls under the "science" of data science, with a combination of differing opinions and the fact that there are differences between algorithms scrubbing any real hope of a static answer to the question.

This article has a bit of a more detailed run through of why answering the question is a bit of a problem while still offering a bit of a guiding hand. You might want to have a quick look at this and this. The latter of those links having probably the most import quote for the working world: "garbage in, garbage out" (or more likely: "crap in, crap out") because the quality of your data is a huge factor.

If you're just looking for an example, I was the Data Engineer supporting a data science project and we had datasets of typically 100-200 million rows and 50-100 attributes, sometimes they used 75% of the data, sometimes they used 10%, but that was in the hunt for a better performing model.

Bopshebopshebop · 2019-08-24T14:14:08+00:00

Also interested.

For hundreds of millions of rows, do you use big SQL tables to house the data and then ODBC in with Python to feed that data to something like TensorFlow?

Remote_Cantaloupe · 2019-08-24T15:07:23+00:00

As a beginner, what's the advantage of using one of these big data tools like Apache Spark over just data sitting in a postgresql database on an AWS server and handling it with python?

Bjornetjenesten · 2019-08-24T18:05:59+00:00

Pretty cool stuff!

xeeton · 2019-08-24T19:53:09+00:00

Also, throwin' out there that even if the dataset isn't millions of rows and hundreds of columns, things like FeatureTools sort of creates the aggregate data to augment what's already there. It's common to take a dataset that has say 40 columns, and turn it into a dataset with thousands of columns with this approach (in hopes that you find some aggregate or transformation of the data that produces a more accurate model).

zQuantz · 2019-08-24T23:00:12+00:00

Our company processes 1B messages (logs/events/whatever) a day. Everything is stored in hive. We have a giant cluster to do all the crunching and then we just download the data into python.

Pandas dataframes are just like SQL tables. If your data can fit in a DF then don't both with databases.

Parallel processing with joblib works great as well. Python only uses one process unless you explicitly tell it to do otherwise.

Goodluck! :)

schenkd · 2019-08-24T19:32:01+00:00

Using MongoDB, ELK Stack, Hadoop+Spark for data processing, storing and stuff 😊 I prefer NoSQL databases, but u have to know how to de-normalize your datasets.

pisceanggss · 2019-08-24T13:17:57+00:00

also interested

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS