SQLite Transactions

What is SQLite?

In the past few years SQLite (not SQL-light) has had a surge of popularity as people have come to realise its power as an in-process, highly reliable SQL database engine as a backend for server processes rather than its traditional role of client or edge applications. This change in stance for SQLite has happened despite the authors almost actively discouraging its use for this purpose.

I am interested in SQLite for some key reasons:

It is conceptually simple: Imagine a B-tree of rows/tuples partitioned by a table’s primary key that has been exhaustively tested to persist that data to disk reliably, then add a SQL interaction layer.
Is able to have practical backup strategy via Litestream that will perform backups and continuous replication of the write-ahead-log to a remote location. The backups are then able to be automatically restored on startup with a trivial command. Does your business really need the 99.99% uptime offered by expensive cloud managed databases like AWS, Azure or Google?
It can be run locally as I still like having a complete development environment able to be run offline.
in-memory via file::memory: allowing test code to easily start up and tear down an instance per test if required.

The single-writer limitation

The limitations of SQLite on servers has been well documented by the SQLite authors and the best server side configurations dissected, however the standout limitation is high-volume websites by which they mean write-heavy websites.

In Write-Ahead Logging mode (required by Litestream) SQLite employs a single writer by design which allows at most one write transaction but many read-only transactions to run concurrently. This design places the bottleneck of a high-volume write-heavy websites squarely on how to manage throughput on that single writer… and that brings us back to one of the core building-blocks of modern technology:

ACID

If you have forgotten about the Atomic, Consistent, Isolated, Durable guarantees provided by most modern databases then this is the time to go and refresh your knowledge.

SQLite

SQLite, by default, offers strict SERIALIZABLE Isolated transactions which is the strongest isolation guarantee (unless certain conditions are met: https://sqlite.org/isolation.html). By employing a single-writer, SQLite is using a form of Pessimistic concurrency control that makes it easy to guarantee that the underlying data has not changed whilst the write-transaction is in progress.

Postgres

Postgres, on the other hand, actually differs from the SERIALIZABLE default defined in the SQL Standard and chooses the more relaxed READ COMMITTED (and this is despite having the much more complex Multiversion concurrency control). This reduction in strictness means that it is at risk of non-repeatable reads where, even within the same transaction, it is possible to retrieve different results by executing the same read-query multiple times if the values have been changed by another COMMITTED transaction in the background. By choosing this isolation level Postgres opens the risk of a transaction operating on stale data. This fact needs to be kept-in-mind by the developer - imagine what might be possible if performing financial transactions with stale balance data?

If set to SERIALIZABLE then Postgres uses an optimistic-concurrency control scheme where data accessed during the transaction is tracked and before-commit is verified not to have changed. Postgres does this based on either finer-grain locking (row level) or course-grain locking (page level) depending on the transaction so as to manage memory usage. This pattern is referred to as optimistic due to the fact it is expected that there will not be a conflict due to the underlying data having changed when the transaction is committed - which is less likely to change the more fine-grained the data monitored by the transaction is.

FoundationDB

Transactions are not just limited to Relational databases. FoundationDB, in my view one of the finest open-source projects (that incidentally has used SQLite’s storage code for many years), also employs optimistic-concurrency control to achieve SERIALIZABLE guarantees across a distributed key-value store. At the time when NoSQL came onto the scene, distributed NoSQL stores with ACID guarantees were not common and FoundationDB wrote the Transaction Manifesto to highlight how greatly the developer can benefit from the ACID guarantees.

FoundationDB goes one-step further and offers advice on how to write code for optimistic-concurrency control and the fact that sometimes data will change in response to a concurrent transaction conflict and the transaction is automatically retried:

Idempotence

An idempotent transaction is one that has the same effect when committed twice as when committed once. – FoundationDB

In making any transaction idempotent FoundationDB provides patterns for how to avoid issues if a transaction needs to retried multiple times due to a conflict. Things like avoiding creating random identifiers inside the transaction might be immediately recognizable to functional programmers but logic to guard against a state that may have changed since the intial execution may not be immediately obvious:

# Increases account balance and stores a record of the deposit with a unique depositId
@fdb.transactional
def deposit(tr, acctId, depositId, amount):

    # If the deposit record exists, the deposit already succeeded, and we can quit
    depositKey = fdb.tuple.pack(('account', acctId, depositId))
    if tr[depositKey].present(): return

    amount = struct.pack('<i', amount)
    tr[depositKey] = amount

    # The above check ensures that the balance update is executed only once
    balanceKey = fdb.tuple.pack(('account', acctId))
    tr.add(balanceKey, amount)

With all that in our heads, what options does SQLite give us?

BEGIN …

SQLite offers multiple ways for developers to indicate to the engine how the transaction will behave in the form of the IMMEDIATE, EXCLUSIVE and DEFERRED keywords that when in Write-Ahead Logging mode can be reduced to DEFERRED vs IMMEDIATE.

It is worth mentioning the SQLITE_BUSY_TIMEOUT parameter that is configurable milliseconds delay that a transaction can wait before returning SQLITE_BUSY as seen below:

DEFERRED

Deferred means that the transaction will be started in READ mode that can run concurrently with any existing read or write transactions and will only be upgraded to a blocking READ-WRITE transaction if a query that modifies the database state is executed (like INSERT, UPDATE or DELETE). If at the time of upgrade the database is locked by another transaction then a SQLITE_BUSY error will be returned and it is up to the client to handle it (e.g. fail/retry). Note that this state cannot be resolved by waiting so SQLITE_BUSY_TIMEOUT does not apply.

IMMEDIATE

Immediate means that the transaction will be started immediately in READ-WRITE mode and, if a write transaction is already running, will immediately (plus SQLITE_BUSY_TIMEOUT) return SQLITE_BUSY. Once again, it is up to the client to decide how to handle.

CONCURRENT

SQLite has an experimental branch that moves the transactions from Pessimistic to limited Optimistic - limited in that the optimistic locking operates at a database page level (by default 4096 bytes) not row/tuple level.

In CONCURRENT mode, SQLite will allow multiple write transactions to be active at once but, before commit, will verify that none of the pages accessed while performing the transaction have changed since the transaction began. If no conflict occurs then the changes will be committed sequentially and strict SERIALIZABLE guarantees are achieved. If conflicts occur then SQLite will return a SQLITE_BUSY and the client can decide how to handle. Note that this state cannot be resolved by waiting so SQLITE_BUSY_TIMEOUT does not apply.

HC-Tree

Another experimental branch of SQLite is [HC-Tree] which is a work-in-progress that aims to provide optimistic row/tuple level locking. One interesting outcome is that they provide an excellent set of benchmarks to demonstrate the performance benefits of such a design against the BEGIN CONCURRENT branch.

Why don’t we take their benchmarking approach and run it against the standard options?

Benchmarking

Setup

This script is set up to seed the database with n (default 1000000) rows which will create a database file of around 350MB.

PRAGMA journal_mode = WAL;
PRAGMA mmap_size = 1000000000;
PRAGMA synchronous = off;
PRAGMA journal_size_limit = 16777216;

CREATE TABLE tbl(
    a INTEGER PRIMARY KEY,
    b BLOB(200),
    c CHAR(64)
);

-- https://www.sqlite.org/series.html
WITH RECURSIVE generate_series(value) AS (
    SELECT 0
    UNION ALL
    SELECT value+1 FROM generate_series
    WHERE value < {rows}
)
INSERT INTO tbl
SELECT value, randomblob(200), hex(randomblob(32))
FROM generate_series;

CREATE INDEX tbl_i1 ON tbl(substr(c, 1, 16));
CREATE INDEX tbl_i2 ON tbl(substr(c, 2, 16));

Strategies

The hc-tree benchmarks run two queries: one random SELECT and one UPDATE.

The SELECT query performs a random index lookup using the tbl_i1 index. frandomblob has been replaced with a variable so the transaction is idempotent.

-- repeat nScan times:
-- SELECT * FROM tbl WHERE substr(c, 1, 16)>=hex(frandomblob(8)) ORDER BY substr(c, 1, 16) LIMIT 10;
SELECT * FROM tbl WHERE substr(c, 1, 16)>=? ORDER BY substr(c, 1, 16) LIMIT 10;

The UPDATE query refers to an updateblob function which can update a part of the b BLOB. As this does not exist the query has been changed to update the entire b and c values and to be idempotent.

-- repeat nUpdate times:
-- UPDATE tbl SET b=updateblob(b, ?, ?), c=hex(frandomblob(32)) WHERE a = ?;
UPDATE tbl SET b=?, c=? WHERE a=?;

Behavior

The code has been written to start up n threads that will try to execute as many transactions as possible in a loop for 30 seconds (measured by incrementing a shared atomic counter).

In all cases the code has been written to retry on SQLITE_BUSY.
Each transaction has been written to be idempotent so the Scan and Update parameters are generated outside of the retry loop to simulate how an automatic retry might operate.

The final piece of setup is to control the file location to protect a solid-state-drive. This command on MacOS will create a 16G ramdisk mounted at /Volumes/ramdisk. Calculated by RAM in MB (16384) * 2048.

diskutil erasevolume apfs 'ramdisk' `hdiutil attach -nobrowse -nomount ram://33554432`

Or Linux:

sudo mkdir -p /mnt/ramdisk
sudo mount -t tmpfs -o size=16g tmpfs /mnt/ramdisk

Results

nUpdate=1, nScan=0

Update 1 / Scan 0

This write-only transaction shows:

IMMEDIATE vs DEFERRED benefits are clear as the locking happens immediately and the transaction does not suffer the upgrading cost.
CONCURRENT does show increased throughput as the number of threads increases and conflicts increase. At the lower end it performs slightly worse indicating some level of bookkeeping cost. Even with only one update per transaction there were more than 1000 retried transactions at 16 threads.

nUpdate=10, nScan=0

Update 10 / Scan 0

This write-only batched transaction shows:

As expected, batching writes helps a lot with number of updated rows at 16 threads, CONCURRENT increasing from ~12k/sec to ~19k/sec.
IMMEDIATE vs DEFERRED matters less as presumably the cost of doing the underlying update becomes more significant than the cost of upgrading the transaction (the upgrade only happens once per transaction vs the 10 updates).

nUpdate=1, nScan=10

Update 1 / Scan 10

This transaction should expose the weakness of the page level CONCURRENT locking as the random reads should increase collisions:

This immediately shows why using IMMEDIATE for any transaction that will update is important vs the cost of upgrading a DEFERRED transaction.
For CONCURRENT the fact that these results are so solid is due to the fact that underlying collisions did not actually increase very much. See discussion below.

nUpdate=10, nScan=10

Update 10 / Scan 10

This test does not show anything of significance compared with what we have already observed.

nUpdate=0, nScan=10

Update 0 / Scan 10

This read-only batched transaction shows:

IMMEDIATE vs DEFERRED shows the impact of Pessimistic concurrency control and why you should not default to IMMEDIATE for all transactions.
CONCURRENT vs IMMEDIATE indicates there is a slight downside to using CONCURRENT mode under heavy load where performs slightly worse in all cases but CONCURRENT would be a good default option.

Discussion

Some comments about CONCURRENT:

Write-ahead Logging

To gain optimistic-concurrency control CONCURRENT makes heavy use of the write-ahead-log and if you watch the target directory while the CONCURRENT stage is running you will see the -wal suffixed file growing rapidly. This means you should be aware of a few things:

These benchmarks were run against a temporary ram disk if the heavy writes of the CONCURRENT mode do need to be written to a storage then you can expect the write performance of your storage to become important and it is likely that IMMMEDIATE mode will be faster. So it is a good idea to benchmark against your actual device.
Sustained writes to the write-ahead log will not allow SQLite to execute checkpointing (copying the WAL data back into the main SQLite file and truncating the log) so the WAL will continue to grow until out of storage. SQLite maintains another experimental branch which uses two rotating WAL files to prevent this problem. The bedrock branch incorporates CONCURRENT and WAL2 patches however Litestream does not support WAL2.

Collisions

This chart shows the number of retried transactions per second when executing with nUpdate=10, nScan=10.