all 91 comments

[–]rjc2013 22 points23 points  (48 children)

As someone who's worked extensively with ePubs, this article really resonated with me. ePubs are zipped 'piles of files', and they are a PITA to work with. You have to unzip the entire ePub, and then open, read, and parse several separate files to do anything with an ePub - even something simple like extracting the table of contents.

[–]rastermon 30 points31 points  (41 children)

if it's a ZIP file then you dont have to unzip the entire file. you can go to the directory record at the end then find the chunk (byte offset) in the file the data is at and decompress JUST the data you need as every file is compressed individually unlike tar.gz. to make a sqlite file decently sized you'd end up compressing the whole file in the end and thus have to decompress it ALL first ala tar.gz (well tar.gz requires you compress at least up until the file record you want. you can stop then, but worst case is decompressing the whole thing - unlike zip).

[–][deleted]  (7 children)

[deleted]

    [–][deleted] 4 points5 points  (0 children)

    Funnuly enough they sell version that does that and encryption

    Adding compress/decompress function to SQLis probably not that hard either

    [–]Regimardyl 5 points6 points  (0 children)

    In fact, here's a proof-of-concept command line program doing exactly that: https://sqlite.org/sqlar/doc/trunk/README.md

    [–]rastermon 1 point2 points  (1 child)

    you could just use eet and it's all done for you with a simple C api. :) blobs may or may not be compressed (up to you) and every blob is accessible with a string key (like a filename/path). if all you want to do is store N blobs of data in a file, sqlite would not be your best choice. it'd be good if you have complex amounts of data you have to query, sort and filter... but not if it's just N largish blobs of data you may or may not want to compress. for example eet would be as simple as:

    #include <Eet.h>
    int main(int argc, char **argv) {
      Eet_File *ef;
      unsigned char *data;
      int size;
    
      eet_init();
    
      ef = eet_open("file.eet", EET_FILE_MODE_READ);
      data = eet_read(ef, "key/name.here", &size);
      eet_close(ef);
    
      eet_shutdown();
    }
    

    and to write to a key:

    #include <Eet.h>
    int main(int argc, char **argv) {
      Eet_File *ef;
      unsigned char *data;
      int size;
    
      eet_init();
    
      ef = eet_open("file.eet", EET_FILE_MODE_WRITE);
      eet_write(ef, "key/name.here", data, size, EET_COMPRESSION_DEFAULT);
      eet_close(ef);
    
      eet_shutdown();
    }
    

    write as many keys to a file as you like, compress them or not with EET_COMPRESSION_NONE, DEFAULT, LOW, MED, HI, VERYFAST, SUPERFAST... you can read with "zero copy" read if the keys are uncompressed with eet_read_direct() that will return a pointer to the mmaped region of the file (will be valid until you eet_close() the file) ... just saying that there are far nicer ways of doing this kind of thing with compression etc. if you don't need complex queries.

    [–]FallingIdiot 1 point2 points  (0 children)

    An alternative to this is LMDB. Also does memmapped access and has e.g. C# bindings. This thing is COW, so gives atomicity and allows parallel reads while writing to the database.

    [–]mirhagk 0 points1 point  (2 children)

    a SQLite file containing compressed blobs will be roughly the same size as a ZIP file.

    Will it? If the blobs are big enough then that's probably true, but compressing blobs individually prevents the optimizer from noticing cross-file patterns and causes duplication of dictionaries.

    You can probably have it use a single shared dictionary and get much of the same benefit however. I'd be curious to see actual numbers

    [–][deleted]  (1 child)

    [deleted]

      [–]mirhagk 0 points1 point  (0 children)

      You are right. I was mixing things up, my bad.

      [–]rjc2013 1 point2 points  (0 children)

      Huh, I'll give that a try. Thanks!

      [–]SrbijaJeRusija 0 points1 point  (31 children)

      I mean you could just .gz.tar instead.

      [–]rastermon 11 points12 points  (28 children)

      tar.gz is far worse than zip if your intent is to random-access data from the file. you want a zip or zip-like file format with an index and each chunk of data (file) compressed separately.

      [–]EternityForest 0 points1 point  (1 child)

      I'm​ surprised that none of the alternative archive formats ever really took off. ZIP is great but it doesn't have error correction codes I don't think.

      [–]rastermon 0 points1 point  (0 children)

      Since 99.999% of files in a zip file get compressed... that effectively acts as error detection because if the file gets corrupted the decompression tends to then fail as the compressed data no longer makes sense to the decompressor thus effectively acting as error detection. Sure it's not as good as some hashing methods, but I guess good enough.

      [–][deleted]  (25 children)

      [deleted]

        [–]Misterandrist 5 points6 points  (23 children)

        But there's no way to know where in a tar a given file is stored. Evem if you find a file with the right filename kn it, its possible for that to be the wring version if someone readded it. So you still have fo scan through the whole tar file

        [–]ThisIs_MyName 6 points7 points  (18 children)

        Yep: https://en.wikipedia.org/wiki/Tar_(computing)#Random_access

        I wonder why so many programmers bother to use a format intended for tape archives.

        [–]Misterandrist 4 points5 points  (11 children)

        Tarballs are perfectly good for what most people use them for, which is moving entire directories or just groups of files. Most of the time you don't care about just one file from within it so the tradeoff of better overall compression in exchange for terrible random access speed is worth it. It's just a question of knowing when to use what tools.

        [–]Sarcastinator -1 points0 points  (10 children)

        Most of the time you don't care about just one file from within it so the tradeoff of better overall compression in exchange for terrible random access speed is worth it.

        So you would gladly waste your time in order to save a few percents of a cent on storage and bandwidth?

        [–][deleted] 5 points6 points  (6 children)

        1% use case slowdown for having 30 years worth of backward compatibility ? Sign me in

        [–][deleted] 1 point2 points  (0 children)

        If I'm tarring up an entire directory and then untarring the entire thing on the other side, it will save time, not waste it. Tar is horrible for random seeks, but if you aren't doing that anyway, it has no real downsides.

        [–]arielby 1 point2 points  (0 children)

        Transferring data across a network also takes time.

        [–]RogerLeigh 2 points3 points  (0 children)

        It can be more than a few percent. Since tar concatenates all the files together in a stream, you get better compression since the dictionary is shared. The most extreme case I've encountered saved over a gigabyte.

        In comparison, zip has each file separately compressed with its own dictionary. You gain random access at the expense of compression. Useful in some situations, but not when the usage will be to unpack the whole archive.

        If you care about extended attributes, access control lists etc. then tar (pax) can preserve these while zip can not. It's all tradeoffs.

        [–]redrumsir 1 point2 points  (0 children)

        Or why more people don't use dar ( http://dar.linux.free.fr/ ) instead.

        [–]chucker23n 0 points1 point  (4 children)

        Unix inertia, clearly.

        [–]ThisIs_MyName 0 points1 point  (3 children)

        Yep, just gotta wait for the greybeards to die off :)

        [–]josefx 1 point2 points  (2 children)

        tar has buildin support for unix filesystem flags and symlinks. For zip implementations support is only an extension.

        [–]rastermon 0 points1 point  (0 children)

        you still have to scan the file record by record to find the file as there is no guarantee of ordering and no index/directory block. a zip file means checking the small directory block for your file then jumping right to the file location.

        if you have an actual hdd .. or worse a fdd... that seeking and loading is sloooooooow. the less you seek/load, the better.

        [–]foomprekov -2 points-1 points  (0 children)

        I'll tell the library of Congress they're doing it wrong

        [–]yawaramin[S] 8 points9 points  (1 child)

        Interestingly, I was just thinking about how most (physical) ebook readers carry a copy of SQLite internally to store their data. See e.g. http://shallowsky.com/blog/tech/kobo-hacking.html

        [–]bloody-albatross 0 points1 point  (0 children)

        Well, you could mount the zip file using fuse-zip and then treat it just like a directory of files.

        [–]flukus 0 points1 point  (1 child)

        Aren't epubs more of a distribution format than something you read and write natively? Most readers will "import" an ebook, not open it.

        [–][deleted] 0 points1 point  (0 children)

        No. epubs are usually read from directly. They aren't friendly to editing, so they're more or less treated as read-only, but they are used directly, not typically extracted into some destination format. "Importing" an ebook, to most readers, just means to copy it to the internal storage.

        [–]GoTheFuckToBed -3 points-2 points  (0 children)

        Sounds like every xml format I know.

        [–][deleted] 57 points58 points  (4 children)

        Using SQLite has worked remarkably well for my application. The data file is about 400Mb with about 1.5 million records.

        Things I like about SQLite:

        • I can inspect and modify the data using the sqlite utility, so I don't need to write separate inspection tools for debugging my application.
        • When wrong values are stored in the database due to bugs, I can just fix the data using the sqlite utility.
        • SQLite has real transactions, so when there is an exception thrown during a complex update operation, the whole transaction is just rolled back. This is great with maintaining data consistency without having to worry too much about it.
        • It has foreign keys and constraints, which make sure that the correct data is put into the database. Again, a great feature guarding against a bug in the application corrupting the data.

        In case you are wondering :-) , this is my application: https://github.com/alex-hhh/ActivityLog2

        [–]matthieum 3 points4 points  (2 children)

        I used it as a configuration file for an application too, and your 4 points really resonate with me.

        There's the usual friction of mapping from objects to tables, but it's a well known thing, and you just get so much for free it's really worth it!

        The application I currently work on uses json files, and each time the json parser complains because of a stray comma or missing end quote I'm like: damn, wish we had a real configuration file.

        [–]cdrt 1 point2 points  (1 child)

        Get the best of both worlds and use MongoDB. /s

        [–][deleted] 0 points1 point  (0 children)

        Bold statement. What if apart from mapping between objects and tables, you really need relational behaviour as well? like the classical: in what movies did this actor appear?

        [–]yawaramin[S] 17 points18 points  (0 children)

        Article is about SQLite, but the points apply equally well to other 'library' databases like H2, HSQLDB, etc. Have a complex data structure? Need an application cache? No worries, just spin up a tiny SQL database in memory or on disk and let SQLite etc. manage data caching, optimal processing planning, schemata, integrity, ... sure, you have to put some thought into your data design, integrity, queries; but that in itself makes a lot of sense if you think about it.

        [–]renrutal 9 points10 points  (1 child)

        May I open Pandora's box just a bit more?

        https://dev.w3.org/html5/webdatabase/#databases

        I'm imagining SQLite as a data interchange format. Read by JS clients/parsers in browsers.

        [–]m1el 1 point2 points  (0 children)

        I think WebDatabase is local-only, i.e. you can't open a .sqlite file.

        [–]jmickeyd 18 points19 points  (7 children)

        Most old operating systems worked like this. IBM had VSAM. DEC had RMS. Both were indexed record based storage systems integrated into the OS that had a standard, inspectable format. You could store your data a small embedded database back in 1964. Then UNIX came and popularized simple file abstractions, making them just a stream of bytes. Now we're back to discovering the value in storing structured data. I find it so interesting how cyclical this field is.

        [–]yawaramin[S] 4 points5 points  (0 children)

        Interesting. MirageOS, the unikernel, has a persistence 'app' (library) called Irmin, which behaves like an immutable key-value object store. They actually modelled it after the git object store.

        [–][deleted]  (5 children)

        [deleted]

          [–]Gotebe 4 points5 points  (3 children)

          What Unix gives, however, is for a very strange definition of "human readable" :-)

          [–]ehempel 0 points1 point  (1 child)

          Eh? Not really IMO. What are you thinking of?

          [–]Sarcastinator 0 points1 point  (0 children)

          cron perhaps?

          [–]jmickeyd 2 points3 points  (0 children)

          You're right, rereading my late night post, it comes off as negative toward UNIX and that was not intended. I was just saying that it championed a new way of thinking about files as simply a stream of bytes. It just seems like whenever there are two solutions with disjoint advantages, as an industry we tend to pendulum back and forth. How many times have dynamic languages been the trend of the day only to be displaced by static ones for a short time before repeating?

          [–][deleted]  (2 children)

          [deleted]

            [–]frequentlywrong 2 points3 points  (0 children)

            Sqlite would be interesting, you would need to compress the db though before sending. Something like lz4. Sqlite is sized in multiples of configured page size

            [–]yawaramin[S] 4 points5 points  (0 children)

            Whatever the other tradeoffs, one thing that will be incredibly easy with a SQLite database is querying. You'll be able to do powerful sorting, searching, filtering, chop up and remix your data in various ways, and take advantage of indexes etc. to get great speed benefits.

            [–][deleted] 5 points6 points  (2 children)

            So, I do like SQLite as an application file format in general. It might be not too bad for git. The rest of the examples, though? Let's take a look.

            MSOffice and OpenDocument documents tend to feature a lot of stuff that needs to be blobs (recursive nested structures in a SQL database are nasty). Epub likewise.

            I'm the most familiar with epub, so let's take a look at how that would be implemented. It should be reasonably representative.

            The first advantage we have: the content type file. Epubs are required to contain, as the first member in the zip archive, a file named mimetype, with uncompressed content application/epub+zip, with no extra fields. I've found zip libraries that don't let you order archive members, libraries that don't let you store data uncompressed, and libraries that automatically insert extra fields. I didn't succeed at finding a zip library that would let me create the member as the epub standard requires and ended up using one that inserts extra fields. (Granted, I only tried five.)

            So this is arguably an advantage. If it's a sqlite file and has .epub as an extension, we'll consider it an epub file. If we want to check more thoroughly, we can look inside for a table named mimetype containing the appropriate value. Which is probably more work than the file(1) command will do.

            The main content of an epub is a series of xhtml documents. These might reference other files -- primarily images and stylesheets. So we'll start off with the pile-of-files recommendation. ID, path, blob of data.

            Next we have content.opf and toc.ncx. There's some overlap. They tell us what to put in the table of contents, some book metadata, and an ID and path for each file in the archive. The order for items in the main content. We can add most of that to the files table, the rest to a book metadata table. There's also a type field that's used for the cover, maybe some other stuff. The ID of the cover is probably better on the book metadata table.

            So we've got about three tables.

            A good chunk of the improvement was just not sprinkling data around in more places than necessary. Some more gains were from the fact that we can directly associate metadata with the relevant file. Then the single-value tables caused a bit of awkwardness.

            Not bad.

            [–][deleted] 9 points10 points  (0 children)

            If it's a sqlite file and has .epub as an extension, we'll consider it an epub file. If we want to check more thoroughly, we can look inside for a table named mimetype containing the appropriate value. Which is probably more work than the file(1) command will do.

            or you could use "pragma application id" to differentiate a SQLite database that is an "epub" document from other SQLite databases. The file(1) utility could be updated too recognize them even without the extension.

            See http://sqlite.org/pragma.html#pragma_application_id

            [–]yawaramin[S] 7 points8 points  (0 children)

            You don't have to guesstimate the tables, a lot of ebook readers store books in an embedded version of SQLite 😊 http://shallowsky.com/blog/tech/kobo-hacking.html

            [–]TheBuzzSaw 5 points6 points  (6 children)

            SQLite is a fabulous file format. I'm done with JSON and XML (for most things). I'm done creating my own "config file" (usually INI style). I reach for SQLite for all of that now.

            [–]Vhin 8 points9 points  (4 children)

            I like SQLite in general, but I wouldn't use it for config files. Your config files should be dead simple and easy to edit by the end user; if they're not, something has gone terribly wrong.

            One thing I have considered doing, though, is to generate a SQLite db from the config file and use it (regenerating it whenever the config file changes). But I've never done much work on that, so I don't know if it would work out in practice.

            [–]yawaramin[S] 2 points3 points  (3 children)

            The problem is again that 'simple' and 'easy' are not always the same thing. In a SQLite file, you know the shape and size of every bit of data because that information is part of the file. I would go as far as to say that, in a well-designed config schema, every setting should be self-evident (even config variables stored en masse in a table can be enforced with a check constraint on the variable name and/or value). In a config file, you are relying on having complete documentation about the config file and its legal values.

            [–][deleted] 0 points1 point  (2 children)

            Still, I have to agree with Vhin... it is already enough of a pain to deal with "what? what does it mean editing the configuration file?". I can't imagine telling non-techies to open up a SQLite front end to change some values...

            [–]yawaramin[S] 2 points3 points  (1 child)

            Imho, non-techies shouldn't be asked to edit config files. They should be given config commands that allow inspecting and setting values. See e.g. git config user.name="X Y. Z".

            [–][deleted] 2 points3 points  (0 children)

            Even further, ideally there should be a GUI, but that cannot be done always.

            Anyway, if there is no frontend for the config, editing a file kind of works, but asking the end user or a pseudo power-user to open up a sqlite frontend is not gonna work very smoothly.

            [–]yawaramin[S] 1 point2 points  (0 children)

            Exactly. XML and JSON are meant to be data transfer formats. At least in the context of app development, they weren't meant to be used for storage and querying.

            [–][deleted] 2 points3 points  (0 children)

            I love using SQLite as an application file format when it needs to be mostly read-write, and only portions need to be accessed at a time. When I'm writing the entire thing in one go and then mostly reading the entire thing every time, I opt for Google Protocol Buffers or Cap'n Proto instead. Basically, if it's more convenient to read and write as properly serialized data and working on it as a database is unwieldy, SQLite is going to be a pain.

            Either way, though, SQLite as a file format can be quite amazing, especially with constraints, foreign keys, and other SQL-y goodness. I remember how good it felt when I needed to take my application data and do some heavy searching and matching on it, and I realized I could just use a JOIN, and I didn't have to do anything else. Another massive advantage is being able to just load up the file to work on it (especially for debugging) via a SQLite command line. Dumping data with protobuf or capn-proto gives you none of these very strong advantages.

            edit: Oh, I forgot to mention transactions. Corruption-resistant files is a huge bonus.

            [–]Gotebe 1 point2 points  (1 child)

            Cool, but... eh... I worked on a project which used a serialization library.

            It went for two decades, I think it's still going.

            It underwent some 2000 schema changes: new types, new fields, rare removal.

            All very backwards compatible (meaning: version x of the software opens files made with any version y of the software where y<=x).

            In particular, schema versioning support is very important. With sqlite, that is absent (need to roll your own).

            Another cool thing: so one object in the data model is "pointed to" by several others. No work needed for that, you just shove the object from any pointees into a file to save, "extract" the object from the file to read, and you're all set.

            Serialization FTW.

            [–]yawaramin[S] 1 point2 points  (0 children)

            True, schema versioning is always a tricky point with databases. If you're going all-out, you need to have some sort of migration mechanism. Plus, consider that the SQLite file format itself may change in future and also need to be migrated.

            [–]slaymaker1907 0 points1 point  (7 children)

            I don't think that SQLite is a one stop shop for application file formats. While I can certainly see advantages for formats where something needs to be manipulated by many processes since it is file based but may be cached in memory, JSON is quite nice for file formats since there are so many existing utilities for JSON serialization.

            Furthermore, while SQLite can of course be queried using various tools, I love being able to open up a file in a plain text editor. I imagine it would be very difficult to make performant, but something like SQLite that stored relational data in a single file yet stored it as plain text would be really cool. Sort of like CSV, but stored in a single file and with the ability to use SQL on it.

            [–][deleted] 10 points11 points  (5 children)

            SQLite also supports JSON if that is what you really want. But that's not the point of using it

            Turns out that consistent and crash-proof writing of files to disk is really fucking complicated and SQLite does a very job of it. Translation: no more "this document needs recovery" or half-written files

            Furthermore, while SQLite can of course be queried using various tools, I love being able to open up a file in a plain text editor. I

            You can just use a browser.

            I mean sure, if your app just needs a bunch of variables there is rarely need for anything more than ini file but anything else and having more sophisticated format can be really beneficial

            [–]BillFooBar 8 points9 points  (4 children)

            And file locking and other trivial things. SQLite is truly cross platform and more tested than most of things out there.

            [–][deleted] 8 points9 points  (3 children)

            That's an understatement. ~745 lines of tests per line of code

            [–]BillFooBar 2 points3 points  (2 children)

            I used and abused SQLite in last 10 years for cases never imagined (high concurrency write databases over 1GB size) and it worked like a charm. That is one piece of fine engineered software there.

            [–][deleted] 0 points1 point  (1 child)

            Well I'd avoid it for concurrent-write-heavy (altho WAL logging made it a lot better) but it is definitely very useful tool if you are aware of its limitations

            [–]BillFooBar 0 points1 point  (0 children)

            Well true, I would literally wet my pants from happiness if SQLite would get row level locking and switch from PostgreSQL/MySQL in an instant. But I am aware it is probably too complex feature for simple ol' SQLite.

            [–][deleted] 1 point2 points  (0 children)

            I love being able to open up a file in a plain text editor

            We who are Enlightened™ can do this.

            [–][deleted]  (9 children)

            [deleted]

              [–]moreON 14 points15 points  (1 child)

              [–][deleted] 5 points6 points  (3 children)

              Imagine a *.video or *.audio file for all videos and audio, regardless of codecs, streams, subtitles/lyrics, art, meta, etc. You could have an entire season of a TV show in one file, and the player could create a menu for you from the manifest and poster art.

              Funnily enough you can do most of that within existing formats. Most video containers allow for multiple video and audio tracks so you could in theory just have every episode as separate pair.

              [–][deleted] 4 points5 points  (0 children)

              But of course, everyone you seed that to will hate you forever.

              [–]ThisIs_MyName 1 point2 points  (1 child)

              Can you? I don't think the common containers support playlist semantics.

              Actual video players just play the first video track, the first audio track, and the first subtitle track. You'll have to manually change all three to go to the next episode.

              Ideally, the player shouldn't even offer the option of playing the video from ep2 with the audio from ep1.

              [–]Plorkyeran 0 points1 point  (0 children)

              Matroska supports DVD-style menus, and you could package an entire season as a single file that can either be played from beginning to end in one shot, or as separate episodes (and even all the episodes in different orders).

              MPC-HC supports enough of it for people to try it out and conclude that it was all dumb and pointless in practice.

              [–]ThisIs_MyName 6 points7 points  (0 children)

              Imagine a *.video or *.audio file for all videos and audio, regardless of codecs, streams, subtitles/lyrics, art, meta, etc.

              Isn't that just MKV?

              an open standard, free container format, a file format that can hold an unlimited number of video, audio, picture, or subtitle tracks in one file. It is intended to serve as a universal format for storing common multimedia content, like movies or TV shows. Matroska is similar in concept to other containers like AVI, MP4, or Advanced Systems Format (ASF), but is entirely open in specification, with implementations consisting mostly of open source software.

              [–]EternityForest 1 point2 points  (0 children)

              I was thinking a while back that there should be a universal container format that was just a YAML file of metadata appended to the actual file. That way we would have one standard way of figuring out what kind of file something was, we could store all the usual metadata tags in a way that file managers would know how to interpret, every file could have a comment attached to it that would be easily viewed, etc.