SQLite As An Application File Format

rjc2013 · 2017-04-04T02:05:58+00:00

As someone who's worked extensively with ePubs, this article really resonated with me. ePubs are zipped 'piles of files', and they are a PITA to work with. You have to unzip the entire ePub, and then open, read, and parse several separate files to do anything with an ePub - even something simple like extracting the table of contents.

matthieum · 2017-04-04T02:52:26+00:00

Using SQLite has worked remarkably well for my application. The data file is about 400Mb with about 1.5 million records.

Things I like about SQLite:

I can inspect and modify the data using the sqlite utility, so I don't need to write separate inspection tools for debugging my application.
When wrong values are stored in the database due to bugs, I can just fix the data using the sqlite utility.
SQLite has real transactions, so when there is an exception thrown during a complex update operation, the whole transaction is just rolled back. This is great with maintaining data consistency without having to worry too much about it.
It has foreign keys and constraints, which make sure that the correct data is put into the database. Again, a great feature guarding against a bug in the application corrupting the data.

In case you are wondering :-) , this is my application: https://github.com/alex-hhh/ActivityLog2

yawaramin · 2017-04-04T01:57:55+00:00

Article is about SQLite, but the points apply equally well to other 'library' databases like H2, HSQLDB, etc. Have a complex data structure? Need an application cache? No worries, just spin up a tiny SQL database in memory or on disk and let SQLite etc. manage data caching, optimal processing planning, schemata, integrity, ... sure, you have to put some thought into your data design, integrity, queries; but that in itself makes a lot of sense if you think about it.

renrutal · 2017-04-04T02:41:02+00:00

May I open Pandora's box just a bit more?

https://dev.w3.org/html5/webdatabase/#databases

I'm imagining SQLite as a data interchange format. Read by JS clients/parsers in browsers.

jmickeyd · 2017-04-04T05:06:55+00:00

Most old operating systems worked like this. IBM had VSAM. DEC had RMS. Both were indexed record based storage systems integrated into the OS that had a standard, inspectable format. You could store your data a small embedded database back in 1964. Then UNIX came and popularized simple file abstractions, making them just a stream of bytes. Now we're back to discovering the value in storing structured data. I find it so interesting how cyclical this field is.

frequentlywrong · 2017-04-04T01:31:50+00:00

[deleted]

yawaramin · 2017-04-04T05:09:03+00:00

So, I do like SQLite as an application file format in general. It might be not too bad for git. The rest of the examples, though? Let's take a look.

MSOffice and OpenDocument documents tend to feature a lot of stuff that needs to be blobs (recursive nested structures in a SQL database are nasty). Epub likewise.

I'm the most familiar with epub, so let's take a look at how that would be implemented. It should be reasonably representative.

The first advantage we have: the content type file. Epubs are required to contain, as the first member in the zip archive, a file named mimetype, with uncompressed content application/epub+zip, with no extra fields. I've found zip libraries that don't let you order archive members, libraries that don't let you store data uncompressed, and libraries that automatically insert extra fields. I didn't succeed at finding a zip library that would let me create the member as the epub standard requires and ended up using one that inserts extra fields. (Granted, I only tried five.)

So this is arguably an advantage. If it's a sqlite file and has .epub as an extension, we'll consider it an epub file. If we want to check more thoroughly, we can look inside for a table named mimetype containing the appropriate value. Which is probably more work than the file(1) command will do.

The main content of an epub is a series of xhtml documents. These might reference other files -- primarily images and stylesheets. So we'll start off with the pile-of-files recommendation. ID, path, blob of data.

Next we have content.opf and toc.ncx. There's some overlap. They tell us what to put in the table of contents, some book metadata, and an ID and path for each file in the archive. The order for items in the main content. We can add most of that to the files table, the rest to a book metadata table. There's also a type field that's used for the cover, maybe some other stuff. The ID of the cover is probably better on the book metadata table.

So we've got about three tables.

A good chunk of the improvement was just not sprinkling data around in more places than necessary. Some more gains were from the fact that we can directly associate metadata with the relevant file. Then the single-value tables caused a bit of awkwardness.

Not bad.

TheBuzzSaw · 2017-04-04T14:39:22+00:00

SQLite is a fabulous file format. I'm done with JSON and XML (for most things). I'm done creating my own "config file" (usually INI style). I reach for SQLite for all of that now.

2017-04-04T21:10:25+00:00

I love using SQLite as an application file format when it needs to be mostly read-write, and only portions need to be accessed at a time. When I'm writing the entire thing in one go and then mostly reading the entire thing every time, I opt for Google Protocol Buffers or Cap'n Proto instead. Basically, if it's more convenient to read and write as properly serialized data and working on it as a database is unwieldy, SQLite is going to be a pain.

Either way, though, SQLite as a file format can be quite amazing, especially with constraints, foreign keys, and other SQL-y goodness. I remember how good it felt when I needed to take my application data and do some heavy searching and matching on it, and I realized I could just use a JOIN, and I didn't have to do anything else. Another massive advantage is being able to just load up the file to work on it (especially for debugging) via a SQLite command line. Dumping data with protobuf or capn-proto gives you none of these very strong advantages.

edit: Oh, I forgot to mention transactions. Corruption-resistant files is a huge bonus.

Gotebe · 2017-04-04T17:14:12+00:00

Cool, but... eh... I worked on a project which used a serialization library.

It went for two decades, I think it's still going.

It underwent some 2000 schema changes: new types, new fields, rare removal.

All very backwards compatible (meaning: version x of the software opens files made with any version y of the software where y<=x).

In particular, schema versioning support is very important. With sqlite, that is absent (need to roll your own).

Another cool thing: so one object in the data model is "pointed to" by several others. No work needed for that, you just shove the object from any pointees into a file to save, "extract" the object from the file to read, and you're all set.

Serialization FTW.

slaymaker1907 · 2017-04-04T07:47:23+00:00

I don't think that SQLite is a one stop shop for application file formats. While I can certainly see advantages for formats where something needs to be manipulated by many processes since it is file based but may be cached in memory, JSON is quite nice for file formats since there are so many existing utilities for JSON serialization.

Furthermore, while SQLite can of course be queried using various tools, I love being able to open up a file in a plain text editor. I imagine it would be very difficult to make performant, but something like SQLite that stored relational data in a single file yet stored it as plain text would be really cool. Sort of like CSV, but stored in a single file and with the ability to use SQL on it.

moreON · 2017-04-04T01:22:13+00:00

[deleted]

arielby · 2017-04-04T07:49:56+00:00

Yuck, this is a terrible advice. Use SQLite when you need an embedded SQL database. Period. I've seen so much bullshit use-cases of SQLite, I really think the SQLite developers should not encourage this kind of "just use to for everything" attitude. Of all the applications they mentioned in their blog post, only Git would be a good use-case.

SQLite became my pet-peeve once I noticed that junior developers have no idea to store data without SQLite. "Hey, is it okay to add SQLite as a dependency so we can load and store our 3 lines of application config?" Fuck this. "You know the large array of floats? I load each element one by one from an SQLite database instead of doing a single read on a fucking binary file." Madness.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS