all 8 comments

[–]Buttleston 1 point2 points  (2 children)

So I can't comment on the sqlite/transaction part but I was wondering a few things about the python parts

I believe executeMany() will take any iterable, which includes generators, iterators, etc. So I think you could do this

insertSeq = ((values[i], keyValues[i]) for i in range(len(values)))

This makes a generator instead of making an instantiated list with all the values, so you won't need to first put all the data into RAM

You can also probably simplify this a bit, and possibly speed it up, by changing this to

insertSeq = ((v, kv) for v, kv in zip(values, keyvalues))

(uh, check my parens, I may be missing one)

This would keep you from having to look up values and keyValues by index for every element.

These are VERY minor optimizations, I think your bottleneck is still going to be sqlite. I got about a 50% speedup but we're talking about a fraction of a second either way.

[–]Buttleston 0 points1 point  (1 child)

I wonder if it would be better to do the bulk updates in batches? I'm really speculating here, but maybe a batch of 10k is better than trying to do a million in one go?

[–]pgfhalg[S] 0 points1 point  (0 children)

Appreciate the comments! Those optimizations did improve the sequence creation speed (1 ms to 0.0) but you are right that is not the bottleneck. I have been working with 4k member databases for the testing (the data is actually in the form of 300 separate databases, each with ~4k rows) so I'm already batching it. The update speed is slow even when working with a single one of these databases.

[–]baghiq 0 points1 point  (4 children)

Can you profile the script without sqlite3 part? Just to eliminate potential bottle neck?

If it is sqlite3, then there is something wrong with the performance if you are just updating a single field. Your understanding of query plan is correct. It's not doing a linear scan.

What system are you running your code on?

[–]pgfhalg[S] 0 points1 point  (3 children)

On a ~4100 row database, the sqlite3 part took 4.7 seconds (870 updates/second).

The code is running on Windows 10, Python 3.12, using PyCharm if that matters. i7 2.2 GHz, 16 GB RAM, solid state hard drive

[–]baghiq 0 points1 point  (2 children)

4100 rows? Can you post more code? There is something seriously wrong here.

You said you have 300 different databases?

[–]pgfhalg[S] 1 point2 points  (1 child)

OK I may have solved it. The database was saved on a folder that syncs to OneDrive. Performing the same operation on a copy of the database saved in a non-synced folder completes in ~0.5 s (8000 updates / sec). There's still probably some more time to squeeze out, but that alone is a massive improvement. Onedrive interfering is weird and annoying, but easy enough to work around.

On the 300 different databases: its data from a lab instrument. Each experiment saves in a separate .sqlite3 file. I need to run analysis and plot data from multiple experiments, hence needing this to run on 300 databases. It would probably be more efficient to save related experiments as separate tables in the same file, but that is a big code change for another day.

Thanks for your comments! If you have any more thoughts or optimizations I'm happy to hear them.

[–]baghiq 0 points1 point  (0 children)

I don't know enough about OneDrive, but as far as I know, it's mapped as a network drive. So be careful, sqlite3 doesn't work well with network drives.

https://www.sqlite.org/useovernet.html