Wednesday, March 19, 2014

OpenJUMP and GeoPackage Part 3: Make it fast

GeoPackage file is a full relational SQLite database which is stored into one physical data file. Knowing this reveals the two most effective ways for making GeoPackage faster:

  1. Create appropriate indexes. A database query without an index must check every row of the table. If the aim is to select only features which  have certain attribute values a search from index can cut down the the number of rows which need to be scanned dramatically.
  2. SQLite data file can get fragmented. Reading data from the data file is most efficient when the data are stored into contiguous chunks. VACUUM is a tool for optimizing the data file.

When to index?

There is no need for indexing if GeoPackage is used only as a substitute for shapefiles and layers are read as they are without any filtering. Indexes are good to have if:

  • SQL query contains a selective WHERE clause which leaves lots of features out of the result set.
  • The query is executed frequently. Ten seconds once in a lifetime is not much but a ten second lag in refreshing a map  after each panning or zooming may be intolerable.
  • Table is relatively big. SQLite is fast and if a table has less than 10000 rows index does not make much difference.
  • Size limit is lower if there are joins in the query. One thousand rows in one table tested against thousand rows in another table makes a million comparisons.

How to index with OpenJUMP

Let's check first the initial speed of an unindexed GeoPackage by executing some reasonable SQL SELECT statement. It is good to repeat the query couple of times and ensure that operating system is utilizing caching effectively or otherwise the result can be unrealistically slow. Following query selects a few hundred features from a table that contains 1.2 million rows and it seems to take about 3 seconds. That is not especially slow but because the table is biggish and most rows are discarded it should be possible to make it faster with indexes.


Query is using column "knro" and "pinta_ala" for filtering so we will index them both. Indexes can be created directly from OpenJUMP with the DB Query plugin.

  

 Now we can repeat the test. This time we do not want to get unrealistically fast result that may come directly from cache so we will use new unused values in the WHERE clause.

 DB Query plugin shows execution times in full seconds but reference measurement with another tool revealed that the initial query took 2.78 seconds. Indexes made the query to run about 30 times faster in only 0.08 seconds.

When and how to VACUUM


SQLite data file may get fragmented after subsequent deletes and inserts. Deletes leave empty spaces into the data file and new inserted records are first written into these holes. However, it is also possible to create a GeoPackage database which is suboptimal from the very beginning. This happens if data are appended into same database tables from, for example, very many shapefiles. New data are written into the end of the GeoPackage file and extented indexes must also be written into end of the file. Finally both the data and indexes of the table are stored into the GeoPackage data file into at least as many chunks as was the number of original shapefiles.


Vacuum is a tool that re-organizes SQLite data file and corrects fragmentation. Fragmented GeoPackage tables are a real concern only if the final table has been collected from thousands or tens of thousands appends and that alone does not usually make vacuuming necessary. However, a GeoPackage data file can be unnecessary big and vacuuming corrects also that issue. The reason for a possibly unnecessary big file size is that deleting data from SQLite does not shrink the file but unused space is only marked as free. On the other hand, vacuuming is forcing a rebuild of the database and that reduces the size of the data file.

So how to know if there is a need to vacuum? When is comes to unused space, it is possible to evaluate how much the file size could be reduced by comparing the total page count of the database and the count of free pages. Commands can be sent with DB Query and result can be read from attributes of the new layer that will be created.

pragma page_count;
pragma freelist_count;

However, the fragmentation rate of tables cannot be checked with any SQL or pragma query. As a rule of a thumb, consider vacuuming: 
  • When you receive a new GeoPackage which is at least a few gigabytes in size and you want to feel comfortable.
  • When you are going to send a GeoPackage to other users.
  • If you have done a massive number of inserts, updates and deletes and speed matters.
  • After large deletes if file size matters.

 Vacuuming itself is simple and it can be done with DB Query plugin. Notice that vacuuming needs  at least two times more free space in the data directory than the size of the database file. Vacuuming a multi-gigabyte data file will also take quite a long time.


 



Friday, February 21, 2014

OpenJUMP and GeoPackage Part 2: Spatial index with libgpkg

OpenJUMP DB Query Plugin adds native parsing of geometry BLOBs into standard SQLite database engine. Therefore no additional libraries are needed for reading geodata from OGC GeoPackages or Spatialite databases. Normal attribute queries with SQL are also supported out-of-the-box as long as they use the SQL language as understood by the standard SQLite engine. However, by installing libpgkg as an extension OpenJUMP users can create spatial indexes on GeoPackage tables and make BBOX queries much faster. A Drawback is that libgpkg is a platform dependent component which makes installation a little bit harder.

How to install, test and use libgpkg with OpenJUMP


Libgpkg is used as a SQLite Run-Time Loadable Extension. Windows users can get ready made binaries directly from libgpkg download page. Linux and Mac users must read the project home page and follow the installation instructions. OpenJUMP users must notice that we are leaving the platform independent zone now and also Windows users must select if they are going to run OpenJUMP on 32-bit or 64-bit jre and download the right binaries.

Place the libgpkg extention (on Windows "gpkg.dll") into directory [home-of-OpenJUMP-PLUS]\lib\ext\. If you plan to run both 32-bit and 64-bit jre sometimes you can rename the files into "gpkg_32.dll" and "gpkg_64.dll" and keep them both available in the directory. Now you are ready to start OpenJUMP but read first the following note:


Inportant note:
By now the DB Query Plugin comes with SQLite jdbc driver file "sqlite-jdbc-3.7.2.jar". The libgpkg extension will not work with it. Download a new driver "sqlite-jdbc4-3.8.2-SNAPSHOT.jar" from the Xerial downloads and place the new file into directory [home-of-OpenJUMP-PLUS]\lib\ext\jumpdbquery-1.0.0\.  Close OpenJUMP, delete the old jdbc driver and start again.

Now we are ready for a test.

The DB Query Plugin tries to load a SQLite extension if it finds an option "?spatialite=file_name" after the name of the SQLite database file. Extension loader was made for loading Spatialite extension but it works similarly for all valid extensions. If there is any error with loading the extension, all SQL queries will fail too.


This test proves that a special function "ST_MinX" from the libgpkg extension is available and obviously all the other libgpkg functions as well. See the list of functions at https://bitbucket.org/luciad/libgpkg/overview







  
Next something useful that we can't do with the standard SQLite functions. Let's build a spatial query for selecting features which are totally or partly inside a 10 km by 10 km selection box.  Or to be exact, whose bounding boxes are totally or partly inside the selection box.



Nine seconds to select 1567 polygons out of 1,2 million.  It is not hopeless, but thanks to libgpkg we can be make it a lot faster.


One of the libgpkg functions is "CreateSpatialIndex" which, not surprisingly, creates a spatial index.  Let's run it and modify our spatial query so that it utilizes this brand new  index.








Much better now. Accurate timing is obviously slightly more than zero seconds but for sure less than one second. It may be surprising, but BBOX queries from SQLite are often considerably faster than from the heavy weight spatial databases. What is interesting is that the query that hits the spatial index is using only standard SQLite functions. If GeoPackage database is delivered so that it contains spatial indexes OpenJUMP can do flaming fast spatial BBOX queries without platform dependent extensions. However, libgpkg is needed for creating the spatial indexes because it is the only open source library that can do it at the moment.

Template for making a spatial query from index


Spatial index in GeoPackage, as well as in Spatialite, does not work transparently as it does in PostGIS. For utilizing the spatial index a spatial query must always contain a subquery which selects the IDs of the candidate features from the index which is actually a virtual table that holds the minx, miny, maxx and maxy values for each feature.  The logic of the subquery may feel odd at the beginning but the image below hopefully clarifies why it makes sense to compare minimum bounds of features to maximum bounds of the BBOX and vice versa.


Here is a copy-paste template for selecting all features whose bounding boxes are partly or totally inside the BBOX:

select * from tablename
where rowid in
(select id from rtree_tablename_geom
where
minx<BBOX_maxx and
maxx>BBOX_minx and
miny<BBOX_maxy and
maxy>BBOX_miny);



Troubleshooting



Probable reason:
Sqlite-jdbc driver must be updated.





Probable reasons:
  • Typo in the dll name
  • Dll file not on the OpenJUMP search path
  • Wrong version of dll-file (32-bit/64-bit) 


Obvious reason:
Missing parameter "?spatialite=" for loading the extension

Wednesday, February 19, 2014

OpenJUMP PLUS reads OGC Geopackages

OGC GeoPackage is a new international standard which defines an open, non-proprietary, platform-independent SQLite container for distribution and direct use of geospatial data, including vector features and tile matrix sets. It was adopted by Open Geospatial Consortium (OGC®) as an international standard on 13 February 2014.

Thanks to Larry Reeder, OpenJUMP PLUS has already a fast, native, platform-independent support for reading vector data from OGC GeoPackage data files.


Create GeoPackage with GDAL from a shapefile or OpenStreetMap data


First we need a GeoPackage with some vector data. An excellent tool for creating such is GDAL, which has had an GPKG driver since end of January, 2014. The GPKG driver will be included in GDAL version 1.11 but by now the GeoPackage support is available only in GDAL development versions.  The following instruction shows hot to create GeoPackeges on Windows but GDAL works in a similar way also on Linux and Mac. Just remember that before GDAL 1.11 is released a development version from February 2014 or later is needed.

GDAL-dev versions can be downloaded from http://gisinternals.com/sdk/. A direct link to Windows 64-bit zipfile package is http://gisinternals.com/sdk/Download.aspx?file=release-1600-x64-gdal-mapserver.zip. No installation is needed, just unzip the file to some directory on your disk. After unzipping find the batch file "SDKShell.bat" from the new GDAL main directory.  Double click it and GDAL command window opens.

A shapefile "test.shp" is converted into GeoPackage "test.gpkg" with the following command:

ogr2ogr -f GPKG test.gpkg test.shp

This command converts an OpenStreetMap data extract downloaded from Geofabrik into GeoPackage:

ogr2ogr -f GPKG --config OSM_COMPRESS_NODES YES germany.gpkg germany.osm.pbf -progress

Read data from GeoPackage with OpenJUMP PLUS



The recent OpenJUMP PLUS snaphots (download from http://sourceforge.net/projects/jump-pilot/files/OpenJUMP_snapshots/) come with DB Query plugin which adds a new menu item into File menu.


 DB Query Plugin is extremely simple and robust. On the other hand, it lacks some fancy features. It does not have a file browser to search where your SQLite/Spatialite/GeoPackage file is. It does not read any metadata tables for building drop down menus. This is the feature that makes DB Query so robust. Missing or faulty metadata don't make it fail - the database query that is sent to the database is exactly what you see in the SQL window.

Without bells and whistles, the DB Query plugin does the job. It sends a query to the database and tries to find geometry BLOBs from the response. The main jar file which takes care of this is merely 47 kB in size and it recognizes geometry blobs in PostGIS, Oracle, MySQL, FDO, Spatialite and GeoPackage formats.


At the moment the GDAL GPKG driver does not support creating spatial or even attribute indexes. However, SQLite database, which is what GeoPackage essentially is, can filter quite fast even without indexes. In this example, walking through 1.2 million polygons and selecting a subset of 3121 features without index took 4 seconds.



DB Query supports almost all that can be done with SQL As Understood By SQLite.  Main exception is that query must not return more than one geometry.

If a record has no geometry at all the plugin creates a null geometry as
"GEOMETRYCOLLECTION EMPTY" and new layer with all the records is added to the OpenJUMP map anyhow.

 User with some knowledge of SQL can utilise free queries as workarounds. For example, run a query and check the table list of the GeoPackage DB from the attributes of a new map layer.


Limitations


The DB Query Plugin is shipped with basic SQLite3 binaries and, as mentioned, that is enough for using SQL As Understood By SQLite. However, the OGC GeoPackage standard introduces some additional SQL functions and those cannot be used with standard DB Query installation. Using the additional functions with OpenJUMP is possible by using the libpgkg SQLite extension by Luciad but that makes installation and running a bit more difficult and platform dependent. Obviously that is worth another OpenJUMP blog post.

Remember also that OpenJUMP is a totally memory bound Java program. For reading big datasets a 64-bit jre and lots of RAM is needed. Even then, reading the 6.1 million OpenStreetMap highways of Germany takes about 5.1 GB of committed memory and makes OpenJUMP to react somewhat slow even on a powerful computer.

More reading:

OpenJUMP DB Query Plugin: http://sourceforge.net/p/jumpdbqplugin/wiki/Documentation/
The page is not up to date
Source code: http://sourceforge.net/projects/jumpdbqplugin/files/jumpdbqplugin/jumpdbqplugin-1.0.0/
OGC GeoPackage standard: http://www.opengeospatial.org/standards/geopackage
GDAL GeoPackage driver: http://www.gdal.org/ogr/drv_geopackage.html
GDAL OpenStreetMap driver: http://gdal.org/ogr/drv_osm.html
GDAL ogr2ogr utility: http://gdal.org/ogr2ogr.html
Luciad libgpkg site: https://bitbucket.org/luciad/libgpkg/overview

Saturday, October 05, 2013

The OpenJUMP Blog is now team work

Dear Blog readers and new or existing OpenJUMP users,

The Sunburned Surveyor was writing this blog over the past few years. This month after a couple of users found that they would like to present some OpenJUMP - we asked The Sunburned Surveyor if he would be willing to extent the writer-ship of this blog and/or to hand over the administration to the main OJ team. Well you can guess it already... from now on, this blog is run by a team of authors. Hopefully this will bring more frequent posts on "How-To"s (e.g. how to use a function or create something) and on new functions in new versions.

An integration of this blog into our website and wiki, will hopefully follow soon as well.

stay tuned,
stefan & all the others who work on OJ

Posted on 5:49 AM | Categories:

Sunday, May 12, 2013

Deactivating Tools in OpenJUMP

There was a question on the OpenJUMP mailing list the other day about how to deactivate or remove the tools that come with the standard OpenJUMP installation. (This can be done because OpenJUMP is built with a pluggable architecture.)


Ede responded with the answer. Turns out you might have to make a tweak in one of three (3) different places. They are:


  1. The "default-plugins.xml" file in the "bin" folder of your OpenJUMP installation.
  2. The JumpConfiguration class Java source code file.
  3. The OpenJumpConfiguration class Java source code file.

Note: If you have to make a change to items 2 or 3 listed above, then you will need to rebuild OpenJUMP from the source code with the source code files that include your changes.


The Sunburned Surveyor
Posted on 9:05 PM | Categories:

Friday, May 06, 2011

Moving to Wordpress

I'm consolidating most of my blogs and moving them to Wordpress. You can catch my new blog posts here:

www.redefinedhorizonsblog.com.

The Sunburned Surveyor
Posted on 3:32 PM | Categories:

Tuesday, October 19, 2010

Tortured Execution Path For Union By Attribute Plug-In

I've been doing some work on the Union By Attribute Plug-In. One of the first thing I tried to identify when I began working with the program was the execution path. I started by taking a look at the plug-ins execute method. In this plug-in, the execute method only displayed the dialog to collect the parameters from the user needed to peform the union operation. After that, I lost the path of execution. I knew that the union method did the main work of the plug-in, but I couldn't figure out how it got called from the execute method.

The "open call hiearchy" tool in Eclipse may have provided the answer. Here is what I believe is the execution path of the UnionByAttributePlugIn class:

(1) The TaskMonitorManager. execute method is called. It receives an instance of a ThreadedPlugIn class as an argument, which is in this case the UnionByAttributePlugIn.

(2) The TaskMonitorManager.execute method creates a TaskWrapper object. The run method of the TaskWrapper object is then called.

(3) The TaskWrapper.run method calls the UnionByAttributePlugIn.run method.

(4) The UnionByAttributePlugIn.run method calls the UnionByAttributePlugIn.unionFeaturesWithSameAttributeValue method.

(5) The UnionByAttributePlugIn.unionFeaturesWithSameAttributeValue method calls the UnionByAttributePlugIn.union method, which does the main work of the plug-in.

I still haven't figured out the direct link between the execute method and the TaskMonitorManager. I'll be working on that. It would be nice if the path of execution was a little more clear in the execute method. This is something to remember when writing your own plug-ins for OpenJUMP. Try to keep a direct link between your plug-ins execute method and the methods that do the main work of your plug-in. One way this can be achieved is by putting your code to display and collect data from a GUI in its own method that is called from execute. Here is an example:

public boolean execute(PlugInContext context)
{
this.showGUI();
boolean workDoneSuccessfully = this.doWork();
return workDoneSuccessfully;
}

I hope to write more on this blog about the TaskMonitorManager class, TaskWrapper class, and threads in OJ as I learn more about them.

The Sunburned Surveyor
Posted on 9:40 AM | Categories: