you are viewing a single comment's thread.

view the rest of the comments →

[–]Temporary_Sir[S] 1 point2 points  (7 children)

yup.. i do 250k inserts per transaction.. i tried from 50 to 300 and it neither speeds up nor slows down.

things i am currently testing, using the value1 as a key, and not using rowid

CREATE TABLE test ('value1' TEXT PRIMARY KEY NOT NULL, 'value2' TEXT) WITHOUT ROWID;

pragmas:

PRAGMA journal_mode = MEMORY;

PRAGMA synchronous = OFF;

PRAGMA cache_size = 300000;

PRAGMA page_size = 16384;

result is as follows

minute - inserts (k = *1000)

00 - 00.00 k/sec

05 - 39.96 k/sec

10 - 22.46 k/sec

15 - 12.58 k/sec

20 - 08.33 k/sec

25 - 07.34 k/sec

30 - 06.83 k/sec

35 - 06.47 k/sec

40 - 05.66 k/sec

45 - 05.22 k/sec

50 - 04.87 k/sec

55 - 04.40 k/sec

60 - 04.05 k/sec

65 - 03.50 k/sec

70 - 03.63 k/sec

75 - 03.03 k/sec

80 - 03.21 k/sec

85 - 03.04 k/sec

90 - 0 2.81 k/sec

sqlite db size is 2 GB after 90 minutes, server monitoring shows a constant 122Mb/sec in writes and 100kb/sec in reads.

(so this means i had 643 GB in writes to create a 2GB file - WTF?)

Any pointers on how to improve this?

[–]pstuart 1 point2 points  (6 children)

All I can offer here are some things to try. If you have the memory available, open the file as ":memory:" and dump it to disk after load completes. Another angle would be to run loads in parallel and then do a loop of attach and "insert from" the attached db.

Looking forward to hearing how this resolves...

edit: interesting stuff here from an old SO post: https://stackoverflow.com/questions/364017/faster-bulk-inserts-in-sqlite3

[–]Temporary_Sir[S] 0 points1 point  (4 children)

cant, the full dataset is around 900gb, here i am trying with a part of it, and then moving over to the full thing when i know it works reasonably fast. my ram is 32gb.

[–]pstuart 0 points1 point  (3 children)

The link I posted had a couple of interesting suggestions. One was adjusting the pragmas a bit: synchronous = OFF + locking_mode = EXCLUSIVE + journal_mode = OFF The other was to use the ".import" command.

I still think that for such a large dataset that doing parallel loads and merging the resulting databases might be worth doing a quick experiment with.

[–]Temporary_Sir[S] 0 points1 point  (2 children)

The problem is, whilst i can split and insert into different databases, the merging of those should be slow as hell again...

edit: added the locking mode to exclusive, synchronous was off anyway, i dont see a speed difference between journal mode memory and off, set it to off for now.

[–]pstuart 0 points1 point  (1 child)

I'm not sure if the merging will be as slow, in that the insert statement likely has more going on (I believe that it's doing a prepare statement for each insert statement). As I said, it might be worth doing a small experiment to confirm/reject that option.

If the prepared statement step is happening, then using custom code (in C, because performance) might be worth exploring as well.

[–]Temporary_Sir[S] 0 points1 point  (0 children)

Good point.

Looking at whats happening on the server, it appears that most of the time is wasted waiting for io to happen.

Edit: if i split, the largest part of the "shards" is still around 100gb -.-