Monday, May 11, 2009

Fastest bulk import into sqlite

I just wanted to populate a sqlite database with 100 million rows of data from a 1.8 gigabyte tab delimited file (tsv).

.echo ON

.read create_table_without_pk.sql

PRAGMA cache_size = 400000;
PRAGMA synchronous = OFF;
PRAGMA journal_mode = OFF;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA count_changes = OFF;
PRAGMA temp_store = MEMORY;
PRAGMA auto_vacuum = NONE;

.separator "\t"
.import a_tab_seprated_table.txt mytable

.read add_indexes.sql


PRAGMA cache_size = 400000 lets sqlite allocate about 460MB of RAM to work with. You can find more detailed explaination of these PRAGMA options on the Sqlite site.

sqlite3 mydb.db < commands.txt

Using this approach, the inserting of data took only about 20 mins.

However, adding a composite primary key (i.e. unique index) and a non-unique composite index ran for 70+ hours before I just canceled it.

If you can pre-sort your data in the order it is indexed, the import speed is practically the same whether you create the index before or after the import. However, if you create the index before the import, the database data and indexes will be intermingled with each other, thus fragmented (but this won’t affect the file size). The fragmentation can be corrected by running the VACUUM command.

There is a powerful ExternalSort implementation as part of the Java Small Text library. Look under DocumentationSmall Text Utils tutorial for info about it.


    I was populating a table with about 50 million records and had estimated it would take about 50 hours (reading and writing to SSD).

    I tried your PRAGMA options about 10 minutes ago and I estimate it'll be finished before I click publish on this comment... ;)

    Thanks! Matt

  2. To confirm, it took just 17 minutes to read 50 million records from a 7GB text file (generated using sqlite ".mode insert newtable") creating a database of 3.6GB on a SSD.