Python is, in a lot of ways, a very rich language. After years of using it, I still regularly discover new parts of the ecosystem, even in the standard library. In particular, there are a few modules which are not very well-known, but can be very useful in some situations. Today I discovered dbm
a persistent key/value store:
Quick start:
import dbm
with dbm.open('my_store', 'c') as db:
db['key'] = 'value'
print(db.keys()) # ['key']
print(db['key']) # 'value'
print('key' in db) # True
It behaves a lot like a dict
except:
- It persists its values on disk
- You can only use
str
orbytes
as key and values
The performance is also slower than a dictionary, but faster than sqlite3
.
The benchmark consists in performing 10k writes and 10k random reads:
from random import random
import time
operations = 10000
writes = [str(i) for i in range(operations)]
reads = [str(int(random() * operations)) for _ in range(operations)]
# Create some records
for i in writes:
db[i] = 'x'
# Read values in random order
for i in reads:
x = db[i]
Here are the results:
dict
– total: 0.002 seconds, 0.23398 μs/record x 1dbm
– total: 0.054 seconds, 5.35984 μs/record x 27sqlite3
in-memory: total: 2.468 seconds, 246.84346 μs/record x 1234sqlite3
file: total 42.407 seconds, 4240.69593 μs/record x 21207
Why is sqlite3
so slow? Well, the benchmark is probably not representative of the typical workload for sqlite (lots of individual insertions and selections). If we perform the same operations using executemany
and one select
of all the keys at once, we get:
:memory:
– total: 0.038 seconds, 3.81415 μs/record x 19file
– total 0.071 seconds, 7.07369 μs/record x 35
It’s much better, but still not as fast as dbm
(when we persist to a file). So, if you have a workload where keys and values are added and retrieved very often, are always str
or bytes
and need to be persisted on disk, dbm
is a serious contender!
The code used to benchmark can be found here: dbm_bench.py
Edit 06-11-2022: There was some great feedback on Hacker News yesterday. I’d like to address some of it.
- Using a primary key—It was suggested that adding a
PRIMARY KEY
when creating the table improves performance of SQLite3, so I’ve updated the benchmark accordingly. It does help with performance indeed (see below). - Using WAL pragma—It was suggested that using WAL journaling instead of default mode would also make more sense. I have updated the benchmarking script accordingly.
- Using one INSERT at a time vs.
executemany
—It was called out that executing one INSERT command per key instead of batching the operations is less efficient. The benchmark measures both scenarios: one where single INSERTs are made (“sqlite3 (:memory:)” and “sqlite3 (file)”) and one whereexecutemany
is used to insert and read everything in one batch (“sqlite3 (executemany)”). I’ve added one more benchmark for the use ofexecutemany
on a:memory
SQLite3 database. The bottomline is that in all these cases, SQLite3 is still slower than DBM. - Using an index with SQLite3—In order to improve performances of SQLite3 operations, creating an index can also be benefitial, and this was suggested in the comments. I’ve added two new entries to the benchmark to show the performance when creating an index before insertions and also after insertions but before reading the values back.
- Which DBM implementation is used—This is something that the initial blogpost missed. The results are about the GNU implementation of DBM, which the new benchmarking code now explicitely imports. The performance might vary vastely between implementations (worst case being if the dbm.dumb module is used).
With all the above, I have updated the benchmarks and here are the new results obtained with CPython 3.11.0, best of 5 runs:
dict
—took 0.001 seconds, 0.11792 microseconds / recorddbm
—took 0.009 seconds, 0.88954 microseconds / recordsqlite3, executemany, :memory:
—took 0.029 seconds, 2.86062 microseconds / recordsqlite3, executemany, file
—took 0.045 seconds, 4.52690 microseconds / recordsqlite3, executemany, index after insertions
—took 0.044 seconds, 4.37975 microseconds / recordsqlite3, executemany, index before insertions
—took 0.048 seconds, 4.84734 microseconds / recordsqlite3, :memory:
—took 0.082 seconds, 8.19964 microseconds / recordsqlite3, file
—took 4.661 seconds, 466.07621 microseconds / record
With all optimizations brought to the SQLite3 version, dbm.gnu
still has an edge on this very synthetic benchmark. And that is, I think, the bottom line of this article. It is only that, a synthetic benchmark and you should probably not read too much into it, unless your particular use-case involves writing 10,000 random values then reading them back and nothing else. As everything, you should first perform your own measurements based on your own use-case and constraints, then make a decision about what technology best serves your need.
3 comments
Discussion on Hacker News: https://news.ycombinator.com/item?id=32849592
The consensus seems to be that sqlite3 is faster after all if you add the obvious index.
Thanks @arthur-tacca, I’ve taken some time to look at the feedback and update the article (see bottom edit). The benchmarking script has also been updated with the suggested optimizations (index, primary key, etc.) but at least when I run the benchmark
dbm
still comes up ahead. Happy to get any feedback on that.Replit has internal kvstore, and I was searching for local alternate and found your post. Thanks for sharing your knowledge, and making it more informative with the benchmarks. Learnt more from the benchmarks too.