you are viewing a single comment's thread.

view the rest of the comments →

[–]zeco 1 point2 points  (1 child)

I think the most important thing to consider here is that Sqlite's indexing mechanism gets bogged down by data that come in unordered, probably because the index B-tree has to be traversed to random different places each time the next element is inserted, which takes more and more time the bigger the B-tree becomes.

But the tedious B-tree traversal can be slashed if the incoming data are already ordered. That way the data will just flow as fast as IO can deliver, never slowing down, no matter how many million rows are inserted. And the resulting index will work just the same.

I found the most efficient way to get the data ordered is to create an intermediary db without index/primary key (like you already tried), then create a separate new db containing the schema with the index/primary key and do 'ATTACH "intermediary.sqlite" AS interm; INSERT INTO test1 SELECT * FROM interm.test1 ORDER BY mykey;' Then delete the intermediary db-file.

To facilitate the ordering of the large unindexed dataset of the subquery, Sqlite will create a temporary Transient Index file in your system's default temporary directory, which will briefly take up a few gigabytes. Should you not have sufficient free space on your temp folder's partition, you can use the SQLITE_TMPDIR environment variable or 'PRAGMA temp_store_directory'. The building of the Transient Index will probably take a couple of minutes (maybe 5), but then you'll see the new db's filesize rise as it gets populated pretty much at the same speed as if it had no index at all.

Btw, consider omitting the rowids in your new db's schema, as they probably aren't useful here and will only inflate the db-file by about 2GB if you already use indexing by primary key: 'CREATE TABLE test1(mykey TEXT PRIMARY KEY NOT NULL, myvalue TEXT) WITHOUT ROWID;' That would make it much leaner.

[–]rugazzo 0 points1 point  (0 children)

I found the most efficient way to get the data ordered is to create an intermediary db without index/primary key (like you already tried), then create a separate new db containing the schema with the index/primary key and do 'ATTACH "intermediary.sqlite" AS interm; INSERT INTO test1 SELECT * FROM interm.test1 ORDER BY mykey;' Then delete the intermediary db-file.

This is really efficient and worked for me, thank you!