Wednesday, March 19, 2014

OpenJUMP and GeoPackage Part 3: Make it fast

GeoPackage file is a full relational SQLite database which is stored into one physical data file. Knowing this reveals the two most effective ways for making GeoPackage faster:

  1. Create appropriate indexes. A database query without an index must check every row of the table. If the aim is to select only features which  have certain attribute values a search from index can cut down the the number of rows which need to be scanned dramatically.
  2. SQLite data file can get fragmented. Reading data from the data file is most efficient when the data are stored into contiguous chunks. VACUUM is a tool for optimizing the data file.

When to index?

There is no need for indexing if GeoPackage is used only as a substitute for shapefiles and layers are read as they are without any filtering. Indexes are good to have if:

  • SQL query contains a selective WHERE clause which leaves lots of features out of the result set.
  • The query is executed frequently. Ten seconds once in a lifetime is not much but a ten second lag in refreshing a map  after each panning or zooming may be intolerable.
  • Table is relatively big. SQLite is fast and if a table has less than 10000 rows index does not make much difference.
  • Size limit is lower if there are joins in the query. One thousand rows in one table tested against thousand rows in another table makes a million comparisons.

How to index with OpenJUMP

Let's check first the initial speed of an unindexed GeoPackage by executing some reasonable SQL SELECT statement. It is good to repeat the query couple of times and ensure that operating system is utilizing caching effectively or otherwise the result can be unrealistically slow. Following query selects a few hundred features from a table that contains 1.2 million rows and it seems to take about 3 seconds. That is not especially slow but because the table is biggish and most rows are discarded it should be possible to make it faster with indexes.

Query is using column "knro" and "pinta_ala" for filtering so we will index them both. Indexes can be created directly from OpenJUMP with the DB Query plugin.


 Now we can repeat the test. This time we do not want to get unrealistically fast result that may come directly from cache so we will use new unused values in the WHERE clause.

 DB Query plugin shows execution times in full seconds but reference measurement with another tool revealed that the initial query took 2.78 seconds. Indexes made the query to run about 30 times faster in only 0.08 seconds.

When and how to VACUUM

SQLite data file may get fragmented after subsequent deletes and inserts. Deletes leave empty spaces into the data file and new inserted records are first written into these holes. However, it is also possible to create a GeoPackage database which is suboptimal from the very beginning. This happens if data are appended into same database tables from, for example, very many shapefiles. New data are written into the end of the GeoPackage file and extented indexes must also be written into end of the file. Finally both the data and indexes of the table are stored into the GeoPackage data file into at least as many chunks as was the number of original shapefiles.

Vacuum is a tool that re-organizes SQLite data file and corrects fragmentation. Fragmented GeoPackage tables are a real concern only if the final table has been collected from thousands or tens of thousands appends and that alone does not usually make vacuuming necessary. However, a GeoPackage data file can be unnecessary big and vacuuming corrects also that issue. The reason for a possibly unnecessary big file size is that deleting data from SQLite does not shrink the file but unused space is only marked as free. On the other hand, vacuuming is forcing a rebuild of the database and that reduces the size of the data file.

So how to know if there is a need to vacuum? When is comes to unused space, it is possible to evaluate how much the file size could be reduced by comparing the total page count of the database and the count of free pages. Commands can be sent with DB Query and result can be read from attributes of the new layer that will be created.

pragma page_count;
pragma freelist_count;

However, the fragmentation rate of tables cannot be checked with any SQL or pragma query. As a rule of a thumb, consider vacuuming: 
  • When you receive a new GeoPackage which is at least a few gigabytes in size and you want to feel comfortable.
  • When you are going to send a GeoPackage to other users.
  • If you have done a massive number of inserts, updates and deletes and speed matters.
  • After large deletes if file size matters.

 Vacuuming itself is simple and it can be done with DB Query plugin. Notice that vacuuming needs  at least two times more free space in the data directory than the size of the database file. Vacuuming a multi-gigabyte data file will also take quite a long time.