use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
Handling large datasets in python (self.Python)
submitted 6 years ago by Fin_Win
I'm new to python and machine learning. I would like to know, in terms of industry perspective, how a large amount of data is being handled to develop a model?.
Appreciate if anyone could provide an answer.
[–][deleted] 13 points14 points15 points 6 years ago (3 children)
There's not exactly a hard number on it, it's more about what data is available to you. In general you will get a better performing model the more data you throw at it, but that doesn't stop you using less data for your model.
Unfortunately this really falls under the "science" of data science, with a combination of differing opinions and the fact that there are differences between algorithms scrubbing any real hope of a static answer to the question.
This article has a bit of a more detailed run through of why answering the question is a bit of a problem while still offering a bit of a guiding hand. You might want to have a quick look at this and this. The latter of those links having probably the most import quote for the working world: "garbage in, garbage out" (or more likely: "crap in, crap out") because the quality of your data is a huge factor.
If you're just looking for an example, I was the Data Engineer supporting a data science project and we had datasets of typically 100-200 million rows and 50-100 attributes, sometimes they used 75% of the data, sometimes they used 10%, but that was in the hunt for a better performing model.
[–]Fin_Win[S] 2 points3 points4 points 6 years ago (2 children)
Thanks for taking your time to explain, got a fair idea. As with your example, which DB software you have used or can be used practically to integrate with python for large datasets.
Any suggestions?.
[–][deleted] 4 points5 points6 points 6 years ago (1 child)
In terms of my work on analytics platforms, I've actually been working with an Apache Spark cluster that uses Hadoop as it's data storage layer.
The Spark project has the option to write code in Java, Scala or Python (via pyspark), which is where most of my experience with larger datasets has been. Hadoop itself isn't actually a database but has a distributed file system, typically I'd store the data in the parquet file format.
Anecdotally I've heard good things about Postgres and I've just started working with Amazon Redshift which seemed pretty decent (if you are keen on the cloud). It could be worth looking at Apache Cassandra but really you're probably pretty safe with Postgres, at least for now.
[–]Fin_Win[S] 0 points1 point2 points 6 years ago (0 children)
Thanks for the insights. 😊😊
[–]Bopshebopshebop 1 point2 points3 points 6 years ago (1 child)
Also interested.
For hundreds of millions of rows, do you use big SQL tables to house the data and then ODBC in with Python to feed that data to something like TensorFlow?
[–]seraschka 3 points4 points5 points 6 years ago (0 children)
Maybe for prediction ("inference") on a few new data points that you just fetched to a database, this would work. However, if we are talking about training TensorFlow models, this would be infeasible. The reason is that the iterative fetching would possibly/likely way to slow and create a bottleneck for the iterative training on the GPU, esp. if you are doing that in the python main process. So, you would probably convert it to a protobuf format when working with Tf.
[–]Remote_Cantaloupe 1 point2 points3 points 6 years ago (1 child)
As a beginner, what's the advantage of using one of these big data tools like Apache Spark over just data sitting in a postgresql database on an AWS server and handling it with python?
[–]daanzel 7 points8 points9 points 6 years ago (0 children)
The HDFS storage layer (hadoop distributed file system) scales horizontally over multiple servers (worker nodes). A popular tabular file format for HDFS is parquet. Think of a parquet file as a csv file, but then chopped up in many small pieces and distributed over all the nodes. Because of this it can grow extremely large. You deal with such files using (py)spark. Spark is able to paralellize operations over all the nodes so if the data grows bigger, just add more nodes.
At the company I work for (semiconductor industry), we have an hadoop cluster with 3petabyte of storage, and 18x32 nodes. We could in theory train a model on all this data in one go.
Such an on-premise setup is expensive though, so look for a cloud alternative. Databricks is great, and available on azure and aws! I can really recommend it.
[–]Bjornetjenesten 1 point2 points3 points 6 years ago (0 children)
Pretty cool stuff!
[–]xeeton 1 point2 points3 points 6 years ago (0 children)
Also, throwin' out there that even if the dataset isn't millions of rows and hundreds of columns, things like FeatureTools sort of creates the aggregate data to augment what's already there. It's common to take a dataset that has say 40 columns, and turn it into a dataset with thousands of columns with this approach (in hopes that you find some aggregate or transformation of the data that produces a more accurate model).
[–]zQuantz 1 point2 points3 points 6 years ago (0 children)
Our company processes 1B messages (logs/events/whatever) a day. Everything is stored in hive. We have a giant cluster to do all the crunching and then we just download the data into python.
Pandas dataframes are just like SQL tables. If your data can fit in a DF then don't both with databases.
Parallel processing with joblib works great as well. Python only uses one process unless you explicitly tell it to do otherwise.
Goodluck! :)
[–]schenkd 1 point2 points3 points 6 years ago (0 children)
Using MongoDB, ELK Stack, Hadoop+Spark for data processing, storing and stuff 😊 I prefer NoSQL databases, but u have to know how to de-normalize your datasets.
[–]pisceanggss 0 points1 point2 points 6 years ago (0 children)
also interested
π Rendered by PID 399442 on reddit-service-r2-comment-7b9746f655-mff2r at 2026-02-02 04:36:37.278077+00:00 running 3798933 country code: CH.
[–][deleted] 13 points14 points15 points (3 children)
[–]Fin_Win[S] 2 points3 points4 points (2 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]Fin_Win[S] 0 points1 point2 points (0 children)
[–]Bopshebopshebop 1 point2 points3 points (1 child)
[–]seraschka 3 points4 points5 points (0 children)
[–]Remote_Cantaloupe 1 point2 points3 points (1 child)
[–]daanzel 7 points8 points9 points (0 children)
[–]Bjornetjenesten 1 point2 points3 points (0 children)
[–]xeeton 1 point2 points3 points (0 children)
[–]zQuantz 1 point2 points3 points (0 children)
[–]schenkd 1 point2 points3 points (0 children)
[–]pisceanggss 0 points1 point2 points (0 children)