../

Getting the most out of SQLite3 with Python

└─ 2017-10-21 • Reading time: ~7 minutes

I’ve recently made heavy use of sqlite3 for a project involving a lot of data and processing. My first attempt involved no database at all, and all data would be kept in memory and queries would consist in a mix of dictionary lookups, iteration, conditions, etc. This was nice, but there is only so much you can fit in memory, and re-generating/loading the data from disk to memory became a tedious and time consuming process.

I decided to give sqlite3 a try. This allowed an increase in the amount of data that could be processed, and reduced the loading time of the application to nothing, since only opening a connection to the database was needed. Moreover, I could replace a lot of Python logic by SQL queries.

I’d like to share a few learnings and findings about this experience.

TL;DR:

  1. Use bulk operations (a.k.a. executemany).
  2. You don’t need cursors (most of the time).
  3. Cursors can be iterated upon.
  4. Use context managers.
  5. Use pragmas (when it makes sense).
  6. Postpone index creation.
  7. Use placeholders to interpolate python values.

Use Bulk Operations

If you need to insert a lot of rows at once in your database, you really should not use execute. The sqlite3 module provides a way to bulk insertions: executemany.

Instead of doing something like:

for row in iter_data():
    connection.execute('INSERT INTO my_table VALUES (?)', row)

You can leverage the fact that executemany accepts as argument a generator of tuples:

connection.executemany(
    'INSERT INTO my_table VALUE (?)',
    iter_data()
)

This is not only more concise, it’s also much more efficient. In fact, sqlite3 implements execute using executemany behind the scene, but the former inserts a single row instead of many.

I wrote a small benchmark which consists in inserting a million rows into an empty table (the database lives in memory):

  • executemany: 1.6 seconds
  • execute: 2.7 seconds

You Don’t Need Cursors

most of the time.

One thing I often found confusing at the beginning, was cursor management. Examples online and in the documentation often look like:

connection = sqlite3.connect(':memory:')
cursor = connection.cursor()
# Do something with cursor

But most of the time you don’t need a cursor at all, and you can directly use the connection object (it is mentioned at the end of the documentation).

Operations such as execute and executemany can be called directly on the connection and will return a cursor. Here is an example to demonstrate that:

import sqlite3

connection = sqlite3.connect(':memory:')

# Create a table
connection.execute('CREATE TABLE events(ts, msg)')

# Insert values
connection.executemany(
    'INSERT INTO events VALUES (?,?)',
    [
        (1, 'foo'),
        (2, 'bar'),
        (3, 'baz')
    ]
)

# Print inserted rows
for row in connection.execute('SELECT * FROM events'):
    print(row)

Cursors Can Be Iterated Upon

You might often see examples making use of fetchone or fetchall on the result of a SELECT query. But I find that the most natural way to consume the results is to actually iterate on the cursor directly:

for row in connection.execute('SELECT * FROM events'):
    print(row)

This way, you can stop as soon as you got enough results and not waste resources. Of course, if you know beforehand how many results you want, you can use the LIMIT SQL statement instead, but Python generators are very handy and allow you to decouple data generation from data consumption.

Use Context Managers

Shit happens, even in the middle of a SQL transaction. To avoid having to deal manually with rollback or commit, you can simply use the connection object as a context manager. In the following example we create a table, and insert by mistake duplicated values:

import sqlite3
connection = sqlite3.connect(':memory:')

with connection:
    connection.execute(
        'CREATE TABLE events(ts, msg, PRIMARY KEY(ts, msg))')

try:
    with connection:
        connection.executemany('INSERT INTO events VALUES (?, ?)', [
            (1, 'foo'),
            (2, 'bar'),
            (3, 'baz'),
            (1, 'foo'),
        ])
except (sqlite3.OperationalError, sqlite3.IntegrityError) as e:
    print('Could not complete operation:', e)

# No row was inserted because transaction failed
for row in connection.execute('SELECT * FROM events'):
    print(row)

connection.close()

Use Pragmas

when it makes sense

There are a few pragmas you can use to tweak the behavior of sqlite3 in your program. In particular, one that could improve the performance is synchronous:

connection.execute('PRAGMA synchronous = OFF')

You should be aware though that this can be dangerous. If the application crashes unexpectedly in the middle of a transaction, the database will probably be left in an inconsistent state. So use with care! But if you want to insert a lot of rows faster, that can be an option. A safer option is to use the WAL option instead of disabling synchronous completely.

connection.execute('PRAGMA journal_mode = WAL')

Postpone Index Creation

Let’s say you need a few indices on your database, and you also need to insert a lot of rows while creating them. Postponing the creation of the indices to after all rows have been inserted could result in a substantial performance improvement.

Use Placeholders to Interpolate Python Values

It is tempting to use Python string operations to include values into queries. Do not! This is highly insecure, and sqlite3 gives you a better way to do it:

# Do not do this!
my_timestamp = 1
c.execute("SELECT * FROM events WHERE ts = '%s'" % my_timestamp)

# Do this instead
my_timestamp = (1,)
c.execute('SELECT * FROM events WHERE ts = ?', my_timestamp)

Also, string interpolation using Python %s (or format, or formatted string literals) does not go well with executemany. So there is really no point in trying!


Keep in mind though that these tips might or might not give you a benefit, depending on your specific use-case. You should always try for yourself and decide if it’s worth it or not.

Edit 28-12-2017: Thanks for all the great feedback from Reddit and Hacker News. I took the liberty to amend the original article with a few of the suggestions.

There are a few topics that were not mentioned in this article but are definitely worth reading:

  • Using transactions can dramatically improve the speed of your code if you need to run several SQL statements in a row (it should not be needed if you use executemany).
  • If you don’t need to re-use your database across sessions, you can use an in-memory database by specifying :memory: as a location, which should give you a nice speed-up.
  • You can customize row_factory to get something more useful than tuples as results from SELECT queries.
  • Consider changing isolation_level to DEFERRED or IMMEDIATE.

Edit 30-05-2021: Fixed a missing .connect in the “You Don’t Need Cursors” section. Thanks to Gelma for noticing.

Leave a comment on GitHub
5 comments
  • GitHub avatar from user bk322 bk322 commented 3 years ago

    How do I execute dot directives:

    .mode csv
    .header on
    
    .once "output.csv"
    
    select
      *
    from my_table
    ;
    

    I tried to run it via conn.executescript but it doesn’t work.

  • GitHub avatar from user bk322 bk322 commented 3 years ago

    Nevermind, seems to be impossible: https://stackoverflow.com/q/2346074/788700

  • GitHub avatar from user Justsoos Justsoos commented a year ago
    conn.execute('''CREATE TABLE dlt (
        id integer not null primary key autoincrement unique,
        a integer,
        b integer,
        c integer,
        d integer,
        e integer,
        f integer,
        g integer,
        time timestamp default current_timestamp
    )''')
    ####
    # data is a List of list
    conn.executemany("INSERT OR IGNORE INTO dlt(id,a,b,c,d,e,f,g) VALUES (?,?,?,?,?,?,?,?)", [x.split()[0:8] for x in data])
     
    

    a complex use which

  • GitHub avatar from user Gelma Gelma commented 3 months ago

    Thanks a lot for your work.

    Just a note: even when you don’t use cursor you have to use

    connection = sqlite3.connect(‘:memory:’)

    instead of

    connection = sqlite3(‘:memory:’)

    or it complains with:

    TypeError: ‘module’ object is not callable

    Also you have to invoke

    connection.commit()

    to keep data (yes, in this case it doesn’t matter because of db in memory, but could be confusing for people trying on files).

    Thanks a lot, Gelma

  • Thanks @Gelma. I have updated the post with the missing .connect.