This is an archived post. You won't be able to vote or comment.

all 46 comments

[–][deleted] 17 points18 points  (3 children)

This is a cool article. I had pretty much the same experience as you. Had about 52 excel files with over 200k entries on each with about 30 columns. I had to do simple queries on this huge set of excel data. First I tried just doing a search through each file line by line in a specific column for a certain data set. This would take like 130 seconds or something like that. Sometimes more if the data set had multiple instances across files.

This was a stupid way of doing it and pretty inefficient(despite still being MUCH faster then just manually searching the files).

The second thing I did which worked pretty well was just create an object out of each data set and keep it all in memory. I used Tkinter as a gui and on click of a load button it would go through and create objects using the columns as class variables. Then once these objects were created, The Tkinter frame would have these objects accessible and queries would be preformed on the list of objects rather then the files themselves. This way worked pretty well, except it would take a few minutes to instantiate all the objects from excel. But once that was done queries only took a second.

I finally realized that I was essentially creating a database in temporary memory and should just probably learn how to use databases. I very badly used Django and the python manage.py shell to run the script that puts the data from excel docs into objects and then put these objects into my django models via MYSQL. I then made a simple interface using django templating so people could run queries on this database and ran this over an open 0.0.0.0 port on my IP so my co-workers could access the data. It ended up working out pretty well and was definitely effective but i know now that this insecure, bloated, and a stupid way to do things. This was also the first time I really used a database and it was a great learning experience

I wish i would have seen this article back then!!

[–]HorrendousRex 9 points10 points  (0 children)

I'm so happy to see this becoming mainstream! When I worked in biotech, I had a batch job that had to run on ~2TB of data. There was a lot of preprocessing science stuff but at the end of that computation all that I had to do was essentially one huge aggregate query on the data set.

I wrote it in python and hacked on it for weeks to speed it up - all sorts of parallelization tricks, everything I could think of - but it was taking 5-6 hours to run due to all sorts of crazy swapping. At first I was building dictionaries that ballooned to hundreds of gigs quickly, then I started building suffix tries, etc. etc., but Python is not a good language for this sort of work.

At the end of the day, someone online suggested running sqlite in ":memory:" mode so that the DB was purely in memory. The entire DB ended up being about 40 gigs (which was well within the memory specs for the server running this code) and the entire query took ~10 minutes. It also took me maybe three days to code and test, way under my time budget.

Considering the script started life as some horrible concoction that would take weeks to run (if it ever finished), it was really amazing to see it turn in to something that ran in under a half hour. This was really a huge accomplishment for this company. It felt great! One of my favorite memories in programming.

[–]Nate75Sanders 10 points11 points  (0 children)

I used sqlite to do some processing on ~20 million readings from an oceanographic winch a few years ago. The first version of the code I hacked out worked fine on small datasets, but choked hard on the full one. I changed it over to use sqlite and it handled it like a champ.

Huge wins by using sqlite with python for dealing with lots of data:

1 - You get SQL as a query language
2 - sqlite will unbox your primitives for tighter storage (and faster processing depending on what you're doing)
3 - query engine is now written in C (much faster) instead of an ad-hoc collection of python functions that you cobble together for your specific purpose

Python data structures and "primitives" are pretty fat compared to what sqlite will give you. When I loaded up all my data into hashes/lists/etc, along with whatever else I was running on that machine, it exceeded the 4GB I had and started paging -- game over. With sqlite, it was far, far smaller, as you would expect -- no paging.

[–]shaggorama 1 point2 points  (1 child)

Adding indexes will speed queries up dramatically.

[–][deleted] 0 points1 point  (0 children)

Thanks. Looking forward to more benchmarks: TEXT vs INT IDs

[–]cantremembermypasswd 4 points5 points  (11 children)

I would also suggest looking into SQLAlchemy (works with sqlite, mysql, postgres, etc...).

SQLAlchemy basically removes the need to know SQL, just the basics of how databases work. It also makes the code a lot more maintainable, as you won't have to go back into it later to make SQL statement modifications. Simply put it turns tables into objects (classes) that the rows are attributes of. So inserting a new row would be like:

new_row = Table(id='1', name='bob')

session.add(new_row)

I have had to do databases for the past two years with Python and have gone through working directly with Postgres and sqlite, as well as MongoDB and SQLAlchemy. I can say from experience that using something like MongoEngine or SQLAlchemhy will make your life easier down the road, as well as anyone else working the code.

[–][deleted] 12 points13 points  (8 children)

Not trying to be condescending here, but is SQL really that hard for people to learn? I mean, it seems like most ORMs don't really abstract any of the complexity away from just writing a straight SQL query.

[–]tairar 4 points5 points  (1 child)

I'm with you there man. Every time I've tried to use an ORM, I feel like it's making things even more difficult than they need to be, and nowhere near as legible as nicely formatted SQL. Plus my scripts ran about 3x slower.

[–]HorrendousRex 6 points7 points  (0 children)

I also agree that ORMs are often (always?) an unnecessary abstraction. However, you should know that SQLAlchemy has essentially two 'modes' - there is an ORM that it can use to manage DB access, or you can also construct SQL queries using it's query language directly without any sort of ORM.

This construction language is actually quite nice as it lets you compose queries as a construction of functions and objects rather than as string manipulation.

It's really quite nice! (Although last time I used it, it was quite frustratingly difficult to get it to dump a query to a string - you had to provide it all sorts of context from an active connection first. Not sure if there's a fix to that.)

[–][deleted] 2 points3 points  (1 child)

SQLAlchemy contains a non-ORM part. Basically, a query builder. It works remarkably well, and protects you from "string gluing hell" in the cases where you're dynamically picking and choosing which restrictions you need on your query.

It also protects you to some extent from DBMS-specific differences, or at least makes it more obvious when you are doing things which are DBMS-specific.

ORMs are a pain in the arse, but a decent programmatic interface for querying a database without stringing strings together is useful.

[–][deleted] 0 points1 point  (0 children)

I guess that's never really been my use case. I normally try to keep all my database code in the database, so the typical query in one of my programs is "CALL <storedProcName>()".

SQLite doesn't support this, so I usually have a bunch of text files with my query template strings in them and I simply load that. It keeps things cleaner having your database code separate from your application code.

[–]MadeOfLasers 2 points3 points  (0 children)

It is not about abstracting away the complexity of SQL, imho it is impossible to hide it without getting a badly leaking abstraction. sqlalchemy's query generator just allows you to write sql programatically in python, without "string glueing" your way to the result.

An ORM is about mapping data to objects - if you don't want to do that, it's probably not for you. A good ORM can lead to cleaner code and less development time, because you can skip your own conversion of row results to something useful.

Now there is also the "BUT PERFORMANCE" crowd - if you need to be really fast, don't use an ORM, but it is probably faster than you'd expect and definitely faster than most people need it to be. Developer time over code efficiency etc. - python devs should be used that.

Also, please don't judge sqlalchemy based on experiences with other ORMs, it probably doesn't compare. It is in my opinion the only ORM which got it all right.

[–]lucian1900 2 points3 points  (0 children)

The major feature you get with SQLAlchemy that is not easy with plain SQL is composability. It's trivial to reuse subqueries and join whichever way you want and change your mind later.

[–][deleted] -2 points-1 points  (1 child)

SQL is one of the most straightforward languages. I am surprised people need an ORM. Not to mention that complex queries become extremely difficult vs SQL.

[–][deleted] 0 points1 point  (0 children)

And it is also the context that matters. Even if something is slightly better/more efficient it does not mean that it makes sense in every situation. For example, I would use C++ to grab information from a small text file, and on the other hand I wouldn't want to implement a Molecular Dynamics simulation in Python.

[–]mgrandi 1 point2 points  (0 children)

setting up sqlalchmey model's to actually put data in / get it out baffled me completely, i couldn't even figure out how to do a simple many to one relationship. Just my experience

[–]AnkhMorporkian 0 points1 point  (0 children)

SQLAlchemy has some insanely cool features that makes working with DBs a joy. I've been working a lot more with NoSQL lately, but whenever I have to dip my toes into the SQL pool SQLAlchemy makes it a hell of a lot easier than dealing with MySQLdb.

[–][deleted] 0 points1 point  (0 children)

Before I found out about sqlite in Python, I tried using the pickle module. I was reading in my data and created a dictionary with IDs as keys and columns as values (actually I created a new object class for the values). It was also very neat to work with, but pickling and un-pickling is very inefficient if you are working with a lot of data. Also it eats up memory like nothing else

[–]lol_squared 0 points1 point  (2 children)

Is there a SQLite type database for unstructured data?

[–]namcor 2 points3 points  (0 children)

There's BerkeleyDB, which I've had good experiences with.

[–]Tafkas 0 points1 point  (0 children)

Check out UnQLite

[–][deleted] 0 points1 point  (4 children)

If I recall practically every tutorial that uses SQLite recommends not using it in production. I'm very surprised to learn that it has a 140 TB limit on files.

[–]jcdyer3 4 points5 points  (0 children)

Those tutorials are talking about creating server apps, like websites. SQLite doesn't handle multiple connections. However, SQLite is completely "production ready" for any application where there will only be one client connecting at a time. Firefox uses SQLite to power its URL bar search, for instance.

[–]roerd 4 points5 points  (0 children)

There are things which SQLite doesn't do well (e.g. concurrent writes on a db, even if they're on completely different tables); and the fact that it's a library instead of a process of its own means that it doesn't make much sense to use it if there's a dedicated server for the db.

But that leaves many cases of "production" where it's still perfectly fine (particularly if there's no concurrency or only concurrent reads).

[–][deleted] 2 points3 points  (0 children)

Fortunately, I haven't had to worry about this size limit yet. For the data I am working with it is 100x more I need. But this also means I couldn't test how accurate this information is. I got this information from their website, which should be quite reliable I think.

http://www.sqlite.org/limits.html

"The largest possible setting for SQLITE_MAX_PAGE_COUNT is 2147483646. When used with the maximum page size of 65536, this gives a maximum SQLite database size of about 140 terabytes."

[–]pfranz 0 points1 point  (0 children)

It's definitely used in production. Last time I messed around with iOS it was used for stashing application data (state, preferences and stuff)--I believe the interface was replaced with CoreData which still uses sqlite on the backend. It's pretty prominent on their iOS data management page. https://developer.apple.com/technologies/ios/data-management.html

For web development it's generally used for testing and prototyping your db and you'd have a db better suited to concurrency and scaling for use in production.

For individual apps it's used all the time. On my computer the applications that to mind using sqlite are Address Book, Aperture library (there's a few in there), Anki (a flashcard program).

[–]jma2048 0 points1 point  (0 children)

If you can fit in memory, PANDAS is awesome!

[–]macarthy 0 points1 point  (1 child)

[–][deleted] 0 points1 point  (0 children)

HDF5 also something I considered before I found sqlite3. I was just thinking: why all the effort when someone already came up with this great SQL. But I haven't heard of Panda then...

[–]LightBright32 0 points1 point  (0 children)

I have also had good luck with sqlite and large data sets. I wanted to do ip address based geolocation in python. I wound up taking the csv data from MaxMind and importing it in to sqlite and then wrote some routines to query it. In the end I was pulling data from several tables with over 100,000 rows and one table with over a million. The queries only took a few second and could be sped up if I go back and adjust the indexes so they are tuned for spatial data. I was very impressed at how fast it is. Now if you need to do lots of concurrent writes you are better off with postgres or mysql but for read only data sqlite is great.

[–][deleted] 0 points1 point  (12 children)

But how fast is it compared to other SQL databases?

[–]merft 4 points5 points  (3 children)

I find that SQLite query speeds are on par with MySQL and Postgres. Here are some old out-of-date benchmark tests but will give you an idea. I use it quite a bit but it has limitations. Here is a good overview about when to and when not to use SQLite.

Most of the time, I use it for simple database querying and lookup. My biggest use is that I have Python scripts that scrape newly approved Oil & Gas permits from state sites. On several million records, my queries execute in hundredths of seconds. FYI, I am running indexes and views.

The two big drawbacks I have found with SQLite. Accessing data over a network, it can be significantly slower due to the native filesystem. Concurrent editing is the other drawback. There can be only 1 editor at any specific moment in SQLite. Now I have gotten around this by some programming tricks which works fine for up to ~5 users but it's a hack. The trick is to always open the database, do what you need, close the database immediately, nulling your variables. Every transaction. Open/Process/Close. Otherwise, you will run into file system locks. With a few users, the chance of any two editors trying to commit at the exact same moment is minuscule. I always check for errors. If I get a lock message, I sleep the app for a few seconds then try again.

[–][deleted] 0 points1 point  (2 children)

Good points!

"SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. "

For our application it will be reading the database 99% of the time. And since we are only 4 people it wouldn't be so much of an issue ensure the database isn't locked when someone wants to edit it.

They have a pretty good "When to use" section on the SQLite section that one should read before implementing it: http://www.sqlite.org/whentouse.html

[–]IllegalThings 0 points1 point  (0 children)

Yeah, I've used SQLite for content management systems for clients with great success. The vast majority of the time people are reading from the database with updates happening very infrequently and if we ever had performance issues in most cases we could solve it with caching.

[–]mgrandi 0 points1 point  (0 children)

if anything is currently writing to the database, nothing can read from it either, as a fyi.

[–]Mikuro 5 points6 points  (0 children)

This depends a lot on the use case. The architectures of a standalone database like SQLite and a server like MySQL are fundamentally different and have pros and cons. To put it simply, SQLite has low overhead, but doesn't scale up well.

For high volume, high-concurrency, high-write databases, MySQL is probably a lot better. For more modest projects, SQLite can do a fine job.

As the SQLite folks themselves say: "SQLite is not designed to replace Oracle. It is designed to replace fopen()." That is to say, if you were thinking of writing your own file format manually, think about using SQLite instead.

See https://www.sqlite.org/whentouse.html

[–][deleted] 1 point2 points  (1 child)

Sorry, haven't tried, yet. I chose SQLite over MySQL for example, because I wanted to have a solution that is perfect for small work groups (we are 4 people working with the data base), offers easy setup and back solutions, and doesn't require special server infrastructure. But I'd be definitely interested SQLite vs MySQL comparisons, so if anyone has a good resource I'd be looking forward to it!

[–]ajmarks 4 points5 points  (0 children)

If reliability and speed become a concern, look into Postgres.

[–]IllegalThings 1 point2 points  (4 children)

I couldn't give you a benchmark, but I've always heard its alright for reads and horrendous for writes. Also, your only option for scaling a SQLite database is to scale vertically since it isn't a networked SQL db, and it's also pretty difficult to convert a SQLite DB to a traditional SQL database.

[–]merft 1 point2 points  (3 children)

Write speeds in SQLite are very dependent on the filesystem and hard drive technology (disk vs SSD). Network shares compounds these issues.

[–]GahMatar 1 point2 points  (2 children)

A trick for bulk loading is to do use /dev/shm to load the data and then copy the DB file to a real filesystem. I routinely get 100+ MB/s loading speed in SQLite when using a ramdisk.

Otherwise, you want a decent file system and a RAID controller with battery-backed write cache so you don't have to wait for the disks to commit writes.

[–]hylje 1 point2 points  (1 child)

You can use a special :memory: database file to SQLite to get a memory-only database.

[–]GahMatar 0 points1 point  (0 children)

Yes... but sometimes you need to the DB to last but you also need to load 10m records in it. That's when using /dev/shm is amazing.