Monday, May 11, 2009

Fastest bulk import into sqlite

I just wanted to populate a sqlite database with 100 million rows of data from a 1.8 gigabyte tab delimited file (tsv).

commands.txt
.echo ON

.read create_table_without_pk.sql

PRAGMA cache_size = 400000;
PRAGMA synchronous = OFF;
PRAGMA journal_mode = OFF;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA count_changes = OFF;
PRAGMA temp_store = MEMORY;
PRAGMA auto_vacuum = NONE;

.separator "\t"
.import a_tab_seprated_table.txt mytable

BEGIN;
.read add_indexes.sql
COMMIT;

.exit

PRAGMA cache_size = 400000 lets sqlite allocate about 460MB of RAM to work with. You can find more detailed explaination of these PRAGMA options on the Sqlite site.

sqlite3 mydb.db < commands.txt

Using this approach, the inserting of data took only about 20 mins.

However, adding a composite primary key (i.e. unique index) and a non-unique composite index ran for 70+ hours before I just canceled it.

If you can pre-sort your data in the order it is indexed, the import speed is practically the same whether you create the index before or after the import. However, if you create the index before the import, the database data and indexes will be intermingled with each other, thus fragmented (but this won’t affect the file size). The fragmentation can be corrected by running the VACUUM command.

There is a powerful ExternalSort implementation as part of the Java Small Text library. Look under DocumentationSmall Text Utils tutorial for info about it.

Friday, May 1, 2009

Compounding Hibernate Problems

Platform information
Eclipse Platform
Version: 3.4.2
Build id: M20090211-1700

Hibernate Tools 3.2.4.GA-R200903141626-H5
and
Hibernate Tools 3.2.4.GA-N200904280907-H52

Windows XP Pro SP3

JRE 6

Tested on Oracle 10g and Apache Derby

Two problems with Hibernate Tools reverse engineering:

  1. Unable to create a relationship from a non-primary key column in one table to one column of a composite primary key in another table, and visa-versa.

  2. Creating a relationship from a column in a composite primary key of one table to a non-primary key column in another table produces incorrect mapping file (I have submitted a poorly written bug report for this one).

These issues will be explorered in the following example. For simplicity we’ll use an Apache Derby database.

Start the database, and connect to it.

jdbc:derby://localhost:1527/myDB;create=true;user=me;password=mine

Create these tables in the myDB database, in the ME schema.

ddl
create table vehicles (
    make_id integer,
    model_id integer,
    name varchar(50),
    PRIMARY KEY (make_id, model_id)
);

create table owners (
    owner_id integer PRIMARY KEY,
    favorite_make_id integer,
    name varchar(20)
);

insert into vehicles values (1, 0, 'Chevrolet Corvette');
insert into vehicles values (1, 1, 'Chevrolet Malibu');
insert into vehicles values (2, 0, 'Ford Focus');
insert into vehicles values (2, 1, 'Ford Mustang');
insert into vehicles values (3, 0, 'Honda Accord');
insert into vehicles values (3, 1, 'Honda Odyssey');
insert into vehicles values (4, 0, 'Toyota Avalon');
insert into vehicles values (4, 1, 'Toyota Camry');

insert into owners values (50, 3, 'Jim');
insert into owners values (51, 1, 'Alex');
insert into owners values (52, 4, 'Sue');
insert into owners values (53, 1, 'Casey');
insert into owners values (54, 3, 'Ted');
insert into owners values (55, 2, 'Mary');

Create this Hibernate Configuration file.

.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-configuration PUBLIC
        "-//Hibernate/Hibernate Configuration DTD 3.0//EN"
        "http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd">
<hibernate-configuration>
    <session-factory>
        <property name="hibernate.connection.driver_class">org.apache.derby.jdbc.EmbeddedDriver</property>
        <property name="hibernate.connection.url">jdbc:derby://localhost:1527/myDB</property>
        <property name="hibernate.connection.username">me</property>
        <property name="hibernate.connection.password">mine</property>
        <property name="hibernate.dialect">org.hibernate.dialect.DerbyDialect</property>
        <property name="hibernate.current_session_context_class">thread</property>
        <property name="hibernate.default_schema">ME</property>
        <property name="hibernate.show_sql">true</property>
    </session-factory>
</hibernate-configuration>

Setup a Hibernate Console Configuration with the configuration above and connection to the Derby database.

Create this Hibernate reverse engineering configuration.

.reveng.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-reverse-engineering PUBLIC
    "-//Hibernate/Hibernate Reverse Engineering DTD 3.0//EN"
    "http://hibernate.sourceforge.net/hibernate-reverse-engineering-3.0.dtd" >

<hibernate-reverse-engineering>
    <table-filter match-name="OWNERS"/>
    <table-filter match-name="VEHICLES"/>

    <table name="OWNERS">
        <foreign-key foreign-table="VEHICLES">
            <column-ref local-column="FAVORITE_MAKE_ID" foreign-column="MAKE_ID" />
        </foreign-key>
    </table>

</hibernate-reverse-engineering>

Setup a Hibernate Code Generation configuration like this:
[Main tab] [Exporters tab]

Trying to run it produces this error:

org.hibernate.MappingException: Foreign key (FK8BBD58205C954C50:OWNERS [FAVORITE_MAKE_ID])) must have same number of columns as the referenced primary key (VEHICLES [MAKE_ID,MODEL_ID])
Foreign key (FK8BBD58205C954C50:OWNERS [FAVORITE_MAKE_ID])) must have same number of columns as the referenced primary key (VEHICLES [MAKE_ID,MODEL_ID])

Searching for "must have same number of columns as the referenced primary key" on Google returns 6 pages of results. You can find references to this error as early as 2005.

The Other Way (incorrect)

So let’s try the other way, even though it’s incorrect.

Change the Hibernate reverse engineering configuration to this.

.reveng.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-reverse-engineering PUBLIC
    "-//Hibernate/Hibernate Reverse Engineering DTD 3.0//EN"
    "http://hibernate.sourceforge.net/hibernate-reverse-engineering-3.0.dtd" >

<hibernate-reverse-engineering>
    <table-filter match-name="OWNERS"/>
    <table-filter match-name="VEHICLES"/>

    <table name="VEHICLES">
        <foreign-key foreign-table="OWNERS">
            <column-ref local-column="MAKE_ID" foreign-column="FAVORITE_MAKE_ID" />
        </foreign-key>
    </table>

</hibernate-reverse-engineering>

Now the Code Generation configuration will run, but it produces an incorrect mapping file.

Vehicles.hbm.xml
<?xml version="1.0"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
"http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="Vehicles" table="VEHICLES">
        <composite-id name="id" class="VehiclesId">
            <key-property name="makeId" type="int">
                <column name="MAKE_ID" />
            </key-property>
            <key-property name="modelId" type="int">
                <column name="MODEL_ID" />
            </key-property>
        </composite-id>
        <many-to-one name="owners" class="Owners" update="false" insert="false" fetch="select">
            <column name="MAKE_ID" not-null="true" />
        </many-to-one>
        <property name="name" type="string">
            <column name="NAME" length="50" />
        </property>
    </class>
</hibernate-mapping>

The many-to-one tag should have this additional attribute:

property-ref="favoriteMakeId"

Because it doesn’t, this code will fail:

Main.java
import org.hibernate.cfg.AnnotationConfiguration;
import org.hibernate.*;

public class Main {

    public static void main(String[] args) {
        System.out.println("Connecting to Derby database.");
        SessionFactory sf = new AnnotationConfiguration().configure().buildSessionFactory();
        Session session = sf.getCurrentSession();

        System.out.println("Querying for some vehciles.");
        session.beginTransaction();
        Query q = session.createQuery("from Vehicles as v where v.id.modelId = 0");
        java.util.List<Vehicles> vehicleList = q.list();

        System.out.println("Got " + vehicleList.size() + " matches.");

        for (Vehicles vehicle : vehicleList) {

            System.out.println("Getting an owner that likes the maker of " + vehicle.getName());
            Owners owner = vehicle.getOwners();
            try {
                System.out.println(owner.getName() + " likes it.");
            } catch (ObjectNotFoundException ex) {
                ex.printStackTrace();
                System.err.println("This fails because it was matching VEHCILES.MAKE_ID to OWNERS.OWNER_ID");
                System.err.println("when it SHOULD have matched to OWNERS.FAVORITE_MAKE_ID like the reveng says.");
            }
        }
    }

}

Program output:

Connecting to Derby database.
Querying for some vehciles.
Hibernate: select vehicles0_.MAKE_ID as MAKE1_0_, vehicles0_.MODEL_ID as MODEL2_0_, vehicles0_.NAME as NAME0_ from ME.VEHICLES vehicles0_ where vehicles0_.MODEL_ID=0
Got 4 matches.
Getting an owner that likes the maker of Chevrolet Corvette
Hibernate: select owners0_.OWNER_ID as OWNER1_1_0_, owners0_.FAVORITE_MAKE_ID as FAVORITE2_1_0_, owners0_.NAME as NAME1_0_ from ME.OWNERS owners0_ where owners0_.OWNER_ID=?
org.hibernate.ObjectNotFoundException: No row with the given identifier exists: [Owners#1]
    at org.hibernate.impl.SessionFactoryImpl$2.handleEntityNotFound(SessionFactoryImpl.java:409)
    at org.hibernate.proxy.AbstractLazyInitializer.checkTargetState(AbstractLazyInitializer.java:108)
    at org.hibernate.proxy.AbstractLazyInitializer.initialize(AbstractLazyInitializer.java:97)
    at org.hibernate.proxy.AbstractLazyInitializer.getImplementation(AbstractLazyInitializer.java:140)
    at org.hibernate.proxy.pojo.javassist.JavassistLazyInitializer.invoke(JavassistLazyInitializer.java:190)
    at Owners_$$_javassist_1.getName(Owners_$$_javassist_1.java)
    at Main.main(Main.java:23)
This fails because it was matching VEHCILES.MAKE_ID to OWNERS.OWNER_ID
...

However, even if you manually add the property-ref, when you run it with the example program, it returns an error of “More than one row with the given identifier was found”.

Connecting to Derby database.
Querying for some vehciles.
Hibernate: select vehicles0_.MAKE_ID as MAKE1_0_, vehicles0_.MODEL_ID as MODEL2_0_, vehicles0_.NAME as NAME0_ from ME.VEHICLES vehicles0_ where vehicles0_.MODEL_ID=0
Hibernate: select owners0_.OWNER_ID as OWNER1_1_0_, owners0_.FAVORITE_MAKE_ID as FAVORITE2_1_0_, owners0_.NAME as NAME1_0_ from ME.OWNERS owners0_ where owners0_.FAVORITE_MAKE_ID=?
Exception in thread "main" org.hibernate.HibernateException: More than one row with the given identifier was found: 1, for class: Owners
    at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:92)
    at org.hibernate.loader.entity.EntityLoader.loadByUniqueKey(EntityLoader.java:108)
    at org.hibernate.persister.entity.AbstractEntityPersister.loadByUniqueKey(AbstractEntityPersister.java:1672)
    at org.hibernate.type.EntityType.loadByUniqueKey(EntityType.java:641)
    at org.hibernate.type.EntityType.resolve(EntityType.java:415)
    at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:139)
    at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:877)
    at org.hibernate.loader.Loader.doQuery(Loader.java:752)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:259)
    at org.hibernate.loader.Loader.doList(Loader.java:2228)
    at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2125)
    at org.hibernate.loader.Loader.list(Loader.java:2120)
    at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:401)
    at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:361)
    at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:196)
    at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1148)
    at org.hibernate.impl.QueryImpl.list(QueryImpl.java:102)
    at Main.main(Main.java:14)

This is because the example column has non-unique entries. If it only had unique entires, our manual property-ref fix would work.

Doing it manually

Hibernate reverse engineering has pretty much no support for special table relationships (e.g. a non-inverse one-to-many). The only real solution to this problem is to manually create the mapping files and classes yourself.

Some links that probably won’t help at all:

Trying to use a composite-id with key-many-to-one won’t work because the foreign column needs to be a primary key.

Here is how the mapping of Vehicles → Owners should be:

Vehicles.hbm.xml
<?xml version="1.0"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
"http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="Vehicles" table="VEHICLES">
        <composite-id mapped="false" >
            <key-property name="makeId" type="int" >
                <column name="MAKE_ID" />
            </key-property>
            <key-property name="modelId" type="int">
                <column name="MODEL_ID" />
            </key-property>
        </composite-id>
        <property name="name" type="string">
            <column name="NAME" length="50" />
        </property>
        <bag name="owners">
            <key>
                <column name="FAVORITE_MAKE_ID" />
            </key>
            <one-to-many class="Owners" />
        </bag>
    </class>
</hibernate-mapping>

Some problems here:

  • Any kind of collection mapping (set, list, bag, array, map) require a key element be defined.
  • The key generates an error because it doesn’t map the “same number of columns” (i.e. it needs 2 columns to match the 2 column primary key). It’s saying it needs another key defined.
  • So why not just get rid of the primary key? Because the class element requires a id or composite-id element. And we can’t make just MAKE_ID or MODEL_ID the primary key because those won’t be unique.

Solution

At long last! For Owners → Vehicles, this actually works.

Owners.hbm.xml
<?xml version="1.0"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
"http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="Owners" table="OWNERS">
        <id name="ownerId" type="int">
            <column name="OWNER_ID" />
            <generator class="assigned" />
        </id>
        <property name="favoriteMakeId" type="int">
            <column name="FAVORITE_MAKE_ID" />
        </property>
        <property name="name" type="string">
            <column name="NAME" length="20" />
        </property>
        <set name="vehicles" inverse="false">
            <key property-ref="favoriteMakeId" column="MAKE_ID" />
            <one-to-many class="Vehicles" />
        </set>
      </class>
</hibernate-mapping>

There still doesn’t appear to be any method to properly associate Vehicles → Owners, but thankfully the mapping file above is all I really need for my case.

There also doesn’t appear to be any way to produce a mapping file like this using Hibernate Tools reverse engineering, so I just have to make it manually.

Going back through some of the many links I visited while searching for help, I noticed this page also basically explains the solution above.