I just wanted to populate a sqlite database with 100 million rows of data from a 1.8 gigabyte tab delimited file (tsv).
PRAGMA cache_size = 400000 lets sqlite allocate about 460MB of RAM to work with. You can find more detailed explaination of these PRAGMA options on the Sqlite site.
Using this approach, the inserting of data took only about 20 mins.
However, adding a composite primary key (i.e. unique index) and a non-unique composite index ran for 70+ hours before I just canceled it.
If you can pre-sort your data in the order it is indexed, the import speed is practically the same whether you create the index before or after the import. However, if you create the index before the import, the database data and indexes will be intermingled with each other, thus fragmented (but this won’t affect the file size). The fragmentation can be corrected by running the VACUUM command.
There is a powerful ExternalSort implementation as part of the Java Small Text library. Look under Documentation→Small Text Utils tutorial for info about it.
BRILLIANT!
ReplyDeleteI was populating a table with about 50 million records and had estimated it would take about 50 hours (reading and writing to SSD).
I tried your PRAGMA options about 10 minutes ago and I estimate it'll be finished before I click publish on this comment... ;)
Thanks! Matt
To confirm, it took just 17 minutes to read 50 million records from a 7GB text file (generated using sqlite ".mode insert newtable") creating a database of 3.6GB on a SSD.
ReplyDeleteSweet!
Thank you so much for this! -Lawrence
ReplyDelete