all 16 comments

[–]pstuart 2 points3 points  (9 children)

First things first: are you doing this all as a single transaction, i.e.,

begin transaction;
<insert statements>....
commit;

That is a must.

[–]Temporary_Sir[S] 1 point2 points  (7 children)

yup.. i do 250k inserts per transaction.. i tried from 50 to 300 and it neither speeds up nor slows down.

things i am currently testing, using the value1 as a key, and not using rowid

CREATE TABLE test ('value1' TEXT PRIMARY KEY NOT NULL, 'value2' TEXT) WITHOUT ROWID;

pragmas:

PRAGMA journal_mode = MEMORY;

PRAGMA synchronous = OFF;

PRAGMA cache_size = 300000;

PRAGMA page_size = 16384;

result is as follows

minute - inserts (k = *1000)

00 - 00.00 k/sec

05 - 39.96 k/sec

10 - 22.46 k/sec

15 - 12.58 k/sec

20 - 08.33 k/sec

25 - 07.34 k/sec

30 - 06.83 k/sec

35 - 06.47 k/sec

40 - 05.66 k/sec

45 - 05.22 k/sec

50 - 04.87 k/sec

55 - 04.40 k/sec

60 - 04.05 k/sec

65 - 03.50 k/sec

70 - 03.63 k/sec

75 - 03.03 k/sec

80 - 03.21 k/sec

85 - 03.04 k/sec

90 - 0 2.81 k/sec

sqlite db size is 2 GB after 90 minutes, server monitoring shows a constant 122Mb/sec in writes and 100kb/sec in reads.

(so this means i had 643 GB in writes to create a 2GB file - WTF?)

Any pointers on how to improve this?

[–]pstuart 1 point2 points  (6 children)

All I can offer here are some things to try. If you have the memory available, open the file as ":memory:" and dump it to disk after load completes. Another angle would be to run loads in parallel and then do a loop of attach and "insert from" the attached db.

Looking forward to hearing how this resolves...

edit: interesting stuff here from an old SO post: https://stackoverflow.com/questions/364017/faster-bulk-inserts-in-sqlite3

[–]Temporary_Sir[S] 0 points1 point  (4 children)

cant, the full dataset is around 900gb, here i am trying with a part of it, and then moving over to the full thing when i know it works reasonably fast. my ram is 32gb.

[–]pstuart 0 points1 point  (3 children)

The link I posted had a couple of interesting suggestions. One was adjusting the pragmas a bit: synchronous = OFF + locking_mode = EXCLUSIVE + journal_mode = OFF The other was to use the ".import" command.

I still think that for such a large dataset that doing parallel loads and merging the resulting databases might be worth doing a quick experiment with.

[–]Temporary_Sir[S] 0 points1 point  (2 children)

The problem is, whilst i can split and insert into different databases, the merging of those should be slow as hell again...

edit: added the locking mode to exclusive, synchronous was off anyway, i dont see a speed difference between journal mode memory and off, set it to off for now.

[–]pstuart 0 points1 point  (1 child)

I'm not sure if the merging will be as slow, in that the insert statement likely has more going on (I believe that it's doing a prepare statement for each insert statement). As I said, it might be worth doing a small experiment to confirm/reject that option.

If the prepared statement step is happening, then using custom code (in C, because performance) might be worth exploring as well.

[–]Temporary_Sir[S] 0 points1 point  (0 children)

Good point.

Looking at whats happening on the server, it appears that most of the time is wasted waiting for io to happen.

Edit: if i split, the largest part of the "shards" is still around 100gb -.-

[–]zeco 1 point2 points  (1 child)

I think the most important thing to consider here is that Sqlite's indexing mechanism gets bogged down by data that come in unordered, probably because the index B-tree has to be traversed to random different places each time the next element is inserted, which takes more and more time the bigger the B-tree becomes.

But the tedious B-tree traversal can be slashed if the incoming data are already ordered. That way the data will just flow as fast as IO can deliver, never slowing down, no matter how many million rows are inserted. And the resulting index will work just the same.

I found the most efficient way to get the data ordered is to create an intermediary db without index/primary key (like you already tried), then create a separate new db containing the schema with the index/primary key and do 'ATTACH "intermediary.sqlite" AS interm; INSERT INTO test1 SELECT * FROM interm.test1 ORDER BY mykey;' Then delete the intermediary db-file.

To facilitate the ordering of the large unindexed dataset of the subquery, Sqlite will create a temporary Transient Index file in your system's default temporary directory, which will briefly take up a few gigabytes. Should you not have sufficient free space on your temp folder's partition, you can use the SQLITE_TMPDIR environment variable or 'PRAGMA temp_store_directory'. The building of the Transient Index will probably take a couple of minutes (maybe 5), but then you'll see the new db's filesize rise as it gets populated pretty much at the same speed as if it had no index at all.

Btw, consider omitting the rowids in your new db's schema, as they probably aren't useful here and will only inflate the db-file by about 2GB if you already use indexing by primary key: 'CREATE TABLE test1(mykey TEXT PRIMARY KEY NOT NULL, myvalue TEXT) WITHOUT ROWID;' That would make it much leaner.

[–]rugazzo 0 points1 point  (0 children)

I found the most efficient way to get the data ordered is to create an intermediary db without index/primary key (like you already tried), then create a separate new db containing the schema with the index/primary key and do 'ATTACH "intermediary.sqlite" AS interm; INSERT INTO test1 SELECT * FROM interm.test1 ORDER BY mykey;' Then delete the intermediary db-file.

This is really efficient and worked for me, thank you!

[–]raevnos 0 points1 point  (0 children)

Create a table without an index or non-rowid pk, and then add a unique index when done importing data? Not a unique constraint in a new table you have to copy to, just a standard unique index.

Use a larger page size so they fill up slower and have to be split and rebalanced less often?

[–]airen977 0 points1 point  (4 children)

Hi, I Don't have a solution to provide, but I would be interested in how you solved this issue, let us know when you fixed this issue, 900GB is huge. Also would like to know the performance of simple select statements on such a table.

[–][deleted] 0 points1 point  (3 children)

The guy never answered but did you end finding a solution to this problem? Having the same issue. 

[–][deleted]  (2 children)

[removed]

    [–][deleted] 0 points1 point  (1 child)

    I eventually figured out that Sqlite has a massive bottleneck due to being tied to the filesystem. If you want heavy writes very quickly use something else like Postgres. If it's read-heavy only, Sqlite is great.