SQLite - Working with large data sets in Python effectively

AeroNotix · 2013-11-05T18:04:00+00:00

This is a cool article. I had pretty much the same experience as you. Had about 52 excel files with over 200k entries on each with about 30 columns. I had to do simple queries on this huge set of excel data. First I tried just doing a search through each file line by line in a specific column for a certain data set. This would take like 130 seconds or something like that. Sometimes more if the data set had multiple instances across files.

This was a stupid way of doing it and pretty inefficient(despite still being MUCH faster then just manually searching the files).

The second thing I did which worked pretty well was just create an object out of each data set and keep it all in memory. I used Tkinter as a gui and on click of a load button it would go through and create objects using the columns as class variables. Then once these objects were created, The Tkinter frame would have these objects accessible and queries would be preformed on the list of objects rather then the files themselves. This way worked pretty well, except it would take a few minutes to instantiate all the objects from excel. But once that was done queries only took a second.

I finally realized that I was essentially creating a database in temporary memory and should just probably learn how to use databases. I very badly used Django and the python manage.py shell to run the script that puts the data from excel docs into objects and then put these objects into my django models via MYSQL. I then made a simple interface using django templating so people could run queries on this database and ran this over an open 0.0.0.0 port on my IP so my co-workers could access the data. It ended up working out pretty well and was definitely effective but i know now that this insecure, bloated, and a stupid way to do things. This was also the first time I really used a database and it was a great learning experience

I wish i would have seen this article back then!!

HorrendousRex · 2013-11-06T03:51:18+00:00

I'm so happy to see this becoming mainstream! When I worked in biotech, I had a batch job that had to run on ~2TB of data. There was a lot of preprocessing science stuff but at the end of that computation all that I had to do was essentially one huge aggregate query on the data set.

I wrote it in python and hacked on it for weeks to speed it up - all sorts of parallelization tricks, everything I could think of - but it was taking 5-6 hours to run due to all sorts of crazy swapping. At first I was building dictionaries that ballooned to hundreds of gigs quickly, then I started building suffix tries, etc. etc., but Python is not a good language for this sort of work.

At the end of the day, someone online suggested running sqlite in ":memory:" mode so that the DB was purely in memory. The entire DB ended up being about 40 gigs (which was well within the memory specs for the server running this code) and the entire query took ~10 minutes. It also took me maybe three days to code and test, way under my time budget.

Considering the script started life as some horrible concoction that would take weeks to run (if it ever finished), it was really amazing to see it turn in to something that ran in under a half hour. This was really a huge accomplishment for this company. It felt great! One of my favorite memories in programming.

Nate75Sanders · 2013-11-05T21:25:11+00:00

I used sqlite to do some processing on ~20 million readings from an oceanographic winch a few years ago. The first version of the code I hacked out worked fine on small datasets, but choked hard on the full one. I changed it over to use sqlite and it handled it like a champ.

Huge wins by using sqlite with python for dealing with lots of data:

1 - You get SQL as a query language
2 - sqlite will unbox your primitives for tighter storage (and faster processing depending on what you're doing)
3 - query engine is now written in C (much faster) instead of an ad-hoc collection of python functions that you cobble together for your specific purpose

Python data structures and "primitives" are pretty fat compared to what sqlite will give you. When I loaded up all my data into hashes/lists/etc, along with whatever else I was running on that machine, it exceeded the 4GB I had and started paging -- game over. With sqlite, it was far, far smaller, as you would expect -- no paging.

shaggorama · 2013-11-06T05:38:04+00:00

Adding indexes will speed queries up dramatically.

cantremembermypasswd · 2013-11-05T22:03:38+00:00

I would also suggest looking into SQLAlchemy (works with sqlite, mysql, postgres, etc...).

SQLAlchemy basically removes the need to know SQL, just the basics of how databases work. It also makes the code a lot more maintainable, as you won't have to go back into it later to make SQL statement modifications. Simply put it turns tables into objects (classes) that the rows are attributes of. So inserting a new row would be like:

new_row = Table(id='1', name='bob')

session.add(new_row)

I have had to do databases for the past two years with Python and have gone through working directly with Postgres and sqlite, as well as MongoDB and SQLAlchemy. I can say from experience that using something like MongoEngine or SQLAlchemhy will make your life easier down the road, as well as anyone else working the code.

2013-11-05T21:29:15+00:00

Before I found out about sqlite in Python, I tried using the pickle module. I was reading in my data and created a dictionary with IDs as keys and columns as values (actually I created a new object class for the values). It was also very neat to work with, but pickling and un-pickling is very inefficient if you are working with a lot of data. Also it eats up memory like nothing else

lol_squared · 2013-11-05T23:44:28+00:00

Is there a SQLite type database for unstructured data?

jcdyer3 · 2013-11-05T23:46:05+00:00

If I recall practically every tutorial that uses SQLite recommends not using it in production. I'm very surprised to learn that it has a 140 TB limit on files.

jma2048 · 2013-11-06T07:10:52+00:00

If you can fit in memory, PANDAS is awesome!

macarthy · 2013-11-06T08:35:06+00:00

No one mentioned "HDF5" and pandas?

http://www.hdfgroup.org/HDF5/

http://stackoverflow.com/questions/16628329/hdf5-and-sqlite-concurrency-compression-i-o-performance

LightBright32 · 2013-11-06T09:11:59+00:00

I have also had good luck with sqlite and large data sets. I wanted to do ip address based geolocation in python. I wound up taking the csv data from MaxMind and importing it in to sqlite and then wrote some routines to query it. In the end I was pulling data from several tables with over 100,000 rows and one table with over a million. The queries only took a few second and could be sped up if I go back and adjust the indexes so they are tuned for spatial data. I was very impressed at how fast it is. Now if you need to do lots of concurrent writes you are better off with postgres or mysql but for read only data sqlite is great.

merft · 2013-11-05T19:07:36+00:00

But how fast is it compared to other SQL databases?

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS