Monday, October 20, 2008

A FeatureCache (Almost) For OpenJUMP

For most of last week and this past weekend I have been sick with the flu. Since I was stuck at home without a great deal to do I spent some time on a programming challenge related to OpenJUMP that I have wanted to tackle for a long time. (I made some attempts at tackling this challenge before, but they weren’t successful.)

The challenge has to do with the way OpenJUMP manages Feature objects. All of the features in a data source (like an ESRI Shapefile) are currently read from the data source and put into a computer’s Random Access Memory (RAM). This has some advantages over the alternative ways you can access a data source, including faster operations on the features and the ability to overcome some limitations when writing modifications to the features in a data source back to the data source.

However, this approach also has some limitations. Every computer has a limit to how much information it can put into RAM, and most computer operating systems will only give a program like OpenJUMP a certain percentage of the available RAM. It is quite possible (especially on older computers) to run out of RAM when working with really large data sources in OpenJUMP.

One possible solution to this problem is to move your large data sources into a database like PostgreSQL or MySQL, and then to use OpenJUMP to connect and view this data. This approach comes with its own technical challenges, the least of which is not the requirement to install and operate an relational database.

Another solution that has always interested me is the idea of a Feature Cache. In this solution Features read from a data source into OpenJUMP are not kept entirely in RAM, but are left on the hard disk. This would allow OpenJUMP to (at least in theory) work with some very, very large datasets, even on old computers. This isn’t a perfect solution. Operations on features in a Feature Cache will be slower (I’m not sure how much slower) than it would be on features stored in a computer’s RAM. This solution also imposes some limitations on the type of data that can be stored in a Feature Cache. A Feature Cache is much easier to make read-write (instead of read-only) if you put a practical limit on the size of textual (String) attribute values and feature geometries.

There is also a part of the OpenJUMP API that makes implementation of a FeatureCache somewhat tricky. The FeatureCollection interface, which a FeatureCache must implement to be very useful, defines a method named getFeatures which must return a list of objects that implement the Feature interface. This causes a problem because you need to return a collection of Feature objects from this method that are presumably all in RAM. This requirement sort of negates the whole point of a Feature Cache to begin with.

The Feature Cache solution that I worked on this past week gets around this tricky problem by using a class called a FeatureFacade. A FeatureFacade object is a proxy that forwards all of its method calls to the FeatureCache object that is its parent. This means that a FeatureFacade object doesn’t need to keep all of its geometry and attribute values in RAM. The FeatureCache can read this data from disk and return it to the FeatureCache.

My FeatureCache implements methods that are very similar to those defined in the Feature interface to pull this off. It also implements the FeatureCollection interface, which means it can be wrapped with a Layer object and displayed in OpenJUMP.

Internally the FeatureCache uses two binary data files to store its data. One is for the attribute values, and the other is for geometry values. My FeatureCache also uses indexes for each of the files and RandomAccessFile objects to make the read and write operations as fast as possible. It also keeps track of empty “slots” in both files to keep them from growing any larger than necessary.

The only major shortcoming of the FeatureCache at this point is that it lacks a spatial index. This is a challenge I may tackle in the future. I also didn't include a buffer in the FeatureCache. Some sort of buffer, like a first-in first-out que, could potentially speed up FeatureCache operations. I left the buffer out of this implementation, because of the complexity it adds.

At any rate, the guts of the FeatureCache I describe here is in the SurveyOS SVN:

http://surveyos.svn.sourceforge.net/viewvc/surveyos/java/openjump_feature_cache/trunk/src/

The code is ugly, and I’m sure it is full of bugs because I haven’t tested anything yet. But I thought I would put it online in case others were interested. I’ve got a fair amount of work left to complete the FeatureCache implementation and plug it into OpenJUMP. Still, I’m excited about the concepts and seeing how it will work. If it is successful it could make a real difference for users of OpenJUMP on computers with modest RAM. (At least for those that work with big datasets.) I'd like to do some more work on the FeatureCache when I get finished with my next release of the Super Select Tool.

Some interesting challenges I ran into (so far) while working on the FeatureCache:

- Creating an object that implements the Iterator interface that will step trough the Feature attributes and geometries stored in a FeatureCache. This Iterator had to look no different than one obtained from a FeatureCollection that stores all of its features in RAM.

- Storing a FeatureSchema object in a binary file, and restoring a FeatureSchema object form this binary file.

The Sunburned Surveyor

Preview of org.geotools.gpx2 code.

I've made some more improvements to the GPX support code I'm working on as part of an experimental module for GeoTools. Here's how the code works:

A SimpleGpxReader parses a GPX file and provides BasicWaypoint objects. (BasicWaypoint objects implement the SimpleWaypoint interface.) The SimpleWaypointToFeatureConverter class is then used to create a BasicFeature object based on the BasicWaypointobject. These BasicWaypoint Feature can be stored in a FeatureCollection and wrapped in a Layer object for display in OpenJUMP.

Most of the code described above is completed. I need to do a little unit testing and then I can make a release. I hope to get the code in the GeoTools SVN soon as well.

Future plans for this module includes support for GPX tracks and routes, not just waypoints. I'd also like to support GPX file metadata and queries, and the ability to work with some of the temporal attributes of GPX entities.

The Sunburned Surveyor

Posted on 10:29 AM | Categories: