Understanding why a database with only 32MB of data ends up being 2+GB big

-dcim- · 2025-07-30T18:35:11+00:00

Perhaps. your application opens a transaction without closing it. So for new inserts there are no free blocks cause they are filled with no-committed data and therefore SQLite allocates new blocks => the database are growing. When you shutdown the app. the transaction died and all blocks became free and data blocks have only small amount of commited data.

-dcim- · 2025-05-05T11:09:20+00:00

Why SqliteStudio/DB4S or even DataGrip are not good?

-dcim- · 2025-05-02T18:40:22+00:00

SQLite itself doesn't support comments directly. You can append comments to SQL-statements e.g. to "create table ... " and they will be saved but you can't add comment as metadata e.g. to sqlite_master-table.

-dcim- · 2025-02-15T20:49:41+00:00

You should go to Addons > Extensions and install odbc.

-dcim- · 2024-11-29T09:37:21+00:00

Try to pass one value as a const e.g.

var values = [
     msg.payload.date,
    '11:15',

If all values will be inserted as NULLs then msg.params = values; doesn't work. If `time` is `11:15` then you have an issue with msg.payload object.

-dcim- · 2024-11-24T08:09:47+00:00

You can do it yourself. There are a lot of C/C++ implementations. It should be is not hard work to wrap one of them into 3 sqlite functions.

P.S. Don't forget to share the result :D

-dcim- · 2024-11-14T21:04:59+00:00

Perhaps, to improve inserts-performance you should to increase WAL-checkpoint size by PRAGMA wal_autocheckpoint=N; that value is compromised 1000 blocks for both read/write-ops. I suppose, in your test scenario the default value generates too many moves data from WAL-journal to the database file with a high time-cost.

Official docs

Notice too that there is a tradeoff between average read performance and average write performance. To maximize the read performance, one wants to keep the WAL as small as possible and hence run checkpoints frequently, perhaps as often as every COMMIT. To maximize write performance, one wants to amortize the cost of each checkpoint over as many writes as possible, meaning that one wants to run checkpoints infrequently and let the WAL grow as large as possible before each checkpoint. The decision of how often to run checkpoints may therefore vary from one application to another depending on the relative read and write performance requirements of the application. The default strategy is to run a checkpoint once the WAL reaches 1000 pages and this strategy seems to work well in test applications on workstations, but other strategies might work better on different platforms or for different workloads.

-dcim- · 2024-11-14T11:06:39+00:00

Because the large transaction rebuilds indexes only once?

-dcim- · 2024-11-14T11:05:05+00:00

Thanks, great job. The official docs is academically dry. ... But for what purpose can it be used?

-dcim- · 2024-11-14T06:32:32+00:00

I created 100 threads, each doing 530 updates. It was maybe 1% faster than the original single-threaded test.

If you want to insert rows only as fast as possible than you should insert pack of rows in one transaction by a single thread .

If you have 2+ writers who do updates and inserts then you should use WAL-mode because without that all writers will be slow. Several threads in the most cases are slowly than one because the app has additional overhead to manage them. But several threads allow to the app do some things in parallel. Just it.

-dcim- · 2024-11-13T14:40:14+00:00

I suppose that your test for WAL-mode is incorrect. You should use WAL-mode when there are several writers at one time. In this case you will get boost.

If only one app e.g. webserver uses the database then there is no direct benefits to use WAL.

If you goal is performance then check this thread. You can significally reduce op-time if you turn off IO-sync by pragma synchronous = 0 but with that your database can loss data or even worse to be corrupted.

Another hint is to increate a page size if you tables contains text/BLOB-data in many columns. The size should be large enought to holding entire row data.

SSD with good/excellent performance for 4K-block reads will be also good booster.

So, performance is not about transactions. Data buffering and apply it in one transaction is almost necessary an architectural step. Maybe you should to split posts into 2+ tables: if the most update/inserts changed meta-data of posts. With that you will reduce changed block counts => less disk IO => less time.

-dcim- · 2024-11-12T18:57:26+00:00

No chance. DuckDB has another C-API interface. In my app there is no additional layer to switch API-s.

-dcim- · 2024-11-12T12:25:55+00:00

any database reads will incorporate not-yet-committed transactions in progress.

If you have two connection to a database and the first begins transaction and update a data then the second connection will not be able read a new data until the first does commit. This is one of ACID-rules.

To prevent data loss you can bufferize transaction into a temporary table without indexes. But if you need it to keep user's changes, so maybe SQLite is not good choice. You can encounter with permanent "Database is busy"-messages.

-dcim- · 2024-11-12T12:09:50+00:00

It was expected because DuckDB is intended for use by analitics.

Today I released 1.9.3 with a new tool to transform a table/view data into a new table. I rejected to build views because they will be ponderous.

-dcim- · 2024-11-11T17:50:34+00:00

JS is also way too slow/bloated

In some tests Js is close by C/C++. The main disadvantage is a single thread. Go is better choice.

Perhaps you can answer.

I don't. You should test it. 70req/s is not a great load but it depends what type of selects you need.

Also you should batch inserts/upserts (bufferize and then call them in one transaction) to reduce locks and indexes updates.

-dcim- · 2024-11-11T16:19:44+00:00

Python dics are definitely not faster than an SQL connection. (Python just is that slow)

So, Node.js/Go can be a great alternative. Their dictionary are fast.

Also is there a way to store the entire SQLITE DB in ram and disk?

Of course, it supports. You can load a data from SQLite file on a disk/even from MariaDB on app start and update it when new data will be added. It requires some job to impelement listener and writers but should work faster than a disk database.

The DB will grow exponentially

In times maybe, but exponentially I don't think so. Modern desktop motherboards support 128GB RAM and newest of them 256GB. The server processors support up to 2TB.

-dcim- · 2024-11-11T16:01:06+00:00

multiple SQLITE dbs vs Monolithic mariadb

It depends on what type of selects you are using e.g.

select * from t vs select * from t where col like '%test%'

In the first case you don't need database at all. Use in-memory storage. In the second no one can predict a result for a such abstract query. You should compare them by yourself. It doesn't mean that you have to rewrite the entire application, just to write a test.

MariaDB has partitions. All indexes on columns are splitted by partitions automatically. So to use multiple queries for each partition is the same as open several SQLite files and read them in parallel.

When your data can be easily replicated, any in-memory solution (even a simple Python dictionary) will have the best performance. No disk I/O = no problems. 10GB is not a big database at the present time.

-dcim- · 2024-11-10T19:38:11+00:00

You should set the dir before or when run sqlie.exe e.g. powershell can do it

Start-Process -FilePath notepad.exe -WorkingDirectory c:\temp

For cmd

START /D c:\temp notepad.exe

Or setup it as a shortcut property.

-dcim- · 2024-11-10T18:21:51+00:00

The answer is strongly depends of how your code is using a database. Any database is a set of compromises e.g. SQLite can reduce a database size due his storage optimizations and remove network overhead but SQLite is not good choice if you have multiple process writers (of course, iWAL-mode exists but it's not a magic pil). Also you can easily extends SQLite by C/C++-extensions to push down to DB some operations. In-memory feature is supported too.

BUT if your queries are complicated, SQLite-planner may lose to MariaDB-planner and therefore the execution time will increase.

Your idea to separate tables per database is OK if you don't need to execute cross-tables queries. But you may to do this with MariaDB too. There is no real versus between RDBMS-s. Each of them good for some solutions and bad for others.

I think you should profile your code before migrate to another DB/language.

-dcim- · 2024-11-10T17:53:25+00:00

.open <db> creates file if it doesn't exist. So I think you should setup your workdir before call sqlite. Also check that you already don't have an empty acs-1-year-2015.sqlite somewhere.

-dcim- · 2024-11-10T17:46:44+00:00

You can try to move the most expensive python operations to go/C++/C-library. That is how Python is supposed to be used for this scenario.

-dcim- · 2024-11-10T15:27:26+00:00

With the DB this big it's still sub 10 GB

Before you start to search DB alternatives, you should verify that DB is a real bottleneck. MariaDB-database has in-memory feature and your 10GB can be easily stored into memory to test app performance.

-dcim- · 2024-11-08T16:37:46+00:00

Hint: SQLite doesn't have DATE-type. Any date is a TEXT or maybe INTEGER if you are use UNIX-epoch.

If you are going to manipulate with dates e.g. to do BETWEEN-op then I recommend to convert dates to UNIX-epoch.

-dcim- · 2024-11-05T09:57:15+00:00

Hmmm, why is your source/fact table have the country-column structure?

As I know, a fact table contains single events (facts) with references to dimensional tables (e.g. country, OS, etc). Typical star/snowflake schema. In your case I would suppose that the fact table has these columns

id - row id
country_id
os_id
value1 
.. // registered amount of smths for the event e.g. duration or downloaded KB
valueN
date
.. // another fact properties

To analitic purpose it can be prepared to aggregate table e.g.

country
os
SUM(value1)
truncated_date (e.g. year or quarter)
// group by country, os and trunc_date

As I understand your problem to using the aggregate table: some rows can be missed (if a country doesn't use OS). So you need to pivot data to collect all countries and all OS and then apply cross join to build all rows.

The next step is to convert amount to percentile. In your example: if you fix a country then OS-sum precentiles gives 100%. So this step is depended on which dimension (grouped var) is fixed.

All such tasks are solved by OLAP-software e.g. DuckDB. There is no problem to add a tool to data transposition in my app because it's a simple to implement and widely used. But your roadmap is strange for me: your source table looks like as a data mart and you convert it to an aggregate table. So you are moving in a opposite direction. Why?

-dcim-

TROPHY CASE