DB2 Spatial Extender performance tuning

Configure your spatial database to improve performance

In almost every database system it is important to achieve good performance for the workload that is run on the system. With the help of the IBM® DB2® Spatial Extender, spatial data can be seamlessly integrated in your database together with any other data you might want to store. The extender provides a set of data types to represent spatial information. However, the DB2 Universal Database™ (DB2 UDB) database engine does not (yet) know the exact properties of spatial data, and the specialized data types are not (yet) available as built-in types. So tuning your spatial database system towards optimal performance requires a few considerations above and beyond the usual tuning steps you take for other DB2 databases. This article introduces and explains in detail the basic optimization steps that you should consider when working with spatial information.

Share:

Knut Stolze (stolze@de.ibm.com), Information Integration Development, IBM

Knut StolzeKnut Stolze started his work with DB2 when he joined IBM as a visiting scientist at the Silicon Valley Lab where he worked on the DB2 Image Extender. He moved on to the DB2 Spatial Extender Version 8 and was responsible for several enhancements to improve the usability, the performance, and the standard-conformance of the Extender for more than two years. Currently, he works as a teaching assistant at the University of Jena, Germany, and continues his work for IBM in the area of federated databases. He can be reached through the newsgroups comp.databases.ibm-db2 and ibm.software.db2.udb.spatial or at stolze@de.ibm.com.



06 October 2005

Introduction

DB2 UDB for Linux®, UNIX®, and Windows® has offered the DB2 Spatial Extender since Version 7.1 to support the storage, management, and modification of spatial data (see the Resources section to download the DB2 Spatial Extender). You can extend the data types available for all your data with specialized structures solely dedicated to spatial information and its properties.

Every database system, however superb the supported functionality might be, is worthless if it can't deliver the performance that is needed for the applications that are built on top of the system. Performance tuning is, therefore, a very critical task. It is not only a task for the database administrator but also for the application developer. DB2 UDB already offers a variety of tools to help tuning and monitoring the system, for example the "Configuration Advisor" and the "Design Advisor". Both advisors can be accessed using the Control Center.

The DB2 Spatial Extender comes with an Index Advisor that you can use to optimize indexes defined on columns that contain spatial data. However, index tuning is just one piece in the performance tuning puzzle, and there are a few other parameters like the inline length, clustering, or the type of the table space that you can adjust to improve the performance of spatial operations. In the following sections we will explain each of those parameters and show the effect they have compared to an unmodified system. The measurements were done on an IBM ThinkPad T30 with a 1.2 GHz CPU and 1 GB physical RAM. SUSE Linux 9.3 was installed as operating system and all measurements are based on DB2 UDB Version 8.2 with FixPak 9 applied. Please note that we did not perform real benchmarking, but rather focused on the single changes only to show the effect of a specific tuning option. You cannot simply compare the absolute numbers shown in the results with other systems.


Basic system tuning

The initial step in any performance tuning is to achieve a good baseline for the system. Running the DB2 Configuration Advisor helps with that task (see the Resources section for an introduction to the DB2 Configuration Advisor and for information on tuning database performance). The Configuration Advisor can be started from the Control Center by right-clicking on a database, as shown in Figure 1.

Figure 1. Starting the Configuration Advisor
Starting the Configuration Advisor

You just have to answer some simple questions about the whole system, which you need to know in any case. Here is a condensed list of the questions, together with the choices we made set in italics. As you can see from our choices, we focused on performance right from the start. Your specific scenario might require some different choices, for example with respect to locking and recovery questions.

  1. How much memory you'd like to dedicate to the DB2 instance? 809 MB (80%)
  2. Whether you perform more data warehousing or more transactional operations? Mixed
  3. Are your transactions long or rather short ones and how many transactions per minute are expected? Short transactions; 60 per minute
  4. Has faster recovery a higher priority for you versus faster transaction? Faster transactions
  5. Does the database already contain some data? No data available yet
  6. What's the average number of local and remote connections? 5 local, 2 remote connections on average
  7. What's the default isolation level that you want to use? Cursor stability (fewer locks)

At the end of the questionnaire you will get a summary of the modifications suggested by the advisor. You can now apply the suggested modifications right away or save a task in the task center. You should review the single changes and decide if they do make sense for your system. If you want to adjust anything according to your own needs, you can create a task that contains the configuration modifications in the task center and then modify the task script in whichever way you like. Running the Configuration Advisor on our system gives the results shown in Figure 2. Most notably is the change of the size of the default buffer pool. Increasing the size of the buffer pool is usually the single-most important decision to improve performance of the overall system because it specifies how much data can be cached in memory and, thus, reduces the need for (slow) disc I/O.

Figure 2. Suggestions made by the Configuration Advisor
Suggestions of the Configuration Advisor

The other changes are primarily due to dividing up the available memory for necessary caches like the package and catalog cache or adjusting the settings for short transactions (logging and sorting). All changes were accepted without any further intervention, except the decisions to reduce of secondary log files. We set the number of secondary log files to 10 to avoid potential issues during the spatial import operations. Your mileage may vary depending on your system and answers you give in the questionnaire.

Performance comparison

First we run a set of spatial operations and queries without the suggested configuration changes on a newly installed system. The execution time of the operations is measured using the operating system command time or DB2's db2batch tool. Finally, the database is cleaned up, the suggestions of the DB2 Configuration Advisor applied, and the same operations repeated. To achieve some reliable results, we executed the import operation and queries several times and averaged the measured times. The spatial operations are the following:

  • Load the shapefile europe/roads.shp from the first Data & Maps CD that is shipped with the DB2 Spatial Extender into a table called ROADS using the Spatial Extender command line tool db2se.
  • Perform a spatial operation that compares all geometries in the table with a fixed linestring using a simple spatial query.
  • Create a grid index on the spatial column.

Listing 1 shows the exact steps and the results they yield in regards to performance. You can find the SQL script test_config_advisor.sql in the download section.

Listing 1. Spatial operations on an untuned database
$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -idColumn id \
        -commitScope 1500 -messagesFile /home/stolze/import.msg
GSE0000I  The operation was completed successfully.

real    2m19.086s
user    0m0.050s
sys     0m0.021s

$ db2batch -d testdb -f test_config_advisor.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds
Execute Time is:           1.248      seconds
Fetch Time is:             0.000      seconds
Elapsed Time is:           1.248      seconds

---------------------------------------------

Statement number: 2

CREATE INDEX roads_grid_index ON roads(shape)
   EXTEND USING db2gse.spatial_index(0.27, 0.54, 1.6)

Elapsed Time is:           25.503     seconds

---------------------------------------------

Note that the used configuration of the test system is rather sub-optimal because the data that was imported is read from a file that resides on the same hard drive as the database and the database logs. Thus, the read operations compete with log writing and the write operation for data pages from the buffer pool. You might want to distribute the various disk-related tasks on different file systems. When importing shapefiles with the DB2 Spatial Extender import tools it is recommended that you do not try to load the data directly from the CD but rather copy it first on a hard disk. The access pattern to read the shapefile is not very favored by CD-ROM drives and, therefore, the whole operation slows down dramatically.

Repeating the above listed steps after the suggestions of the Configuration Advisor were applied gave the results shown in Listing 2. As you can see, there was an 11% improvement for the import alone, the query ran 28% faster and even the index creation needed only 90% of the previous timing now. So keeping an eye on the most basic performance tuning is something that should not be ignored.

Listing 2. Spatial operations on a tuned database
$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -idColumn id \
        -commitScope 1500 -messagesFile /home/stolze/import.msg
GSE0000I  The operation was completed successfully.

real    2m2.848s
user    0m0.051s
sys     0m0.027s

$ db2batch -d testdb -f test_config_advisor.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds
Execute Time is:           0.895      seconds
Fetch Time is:             0.000      seconds
Elapsed Time is:           0.895      seconds

---------------------------------------------

Statement number: 2

CREATE INDEX roads_grid_index ON roads(shape)
   EXTEND USING db2gse.spatial_index(0.27, 0.54, 1.6)

Elapsed Time is:           22.980     seconds

---------------------------------------------

DB2-internal handling spatial data

Spatial data can become rather complex and require quite a lot of space to store the information of the points that form a geometry. For example, a geometry that represents the whole area of the United States consists of 60 single polygons with a combined total of 198569 points defining those polygons. The complete definition of that geometry uses 0.9 MB disk space in the Spatial Extender internal format (which uses compression). If the geometry is encoded in the ESRI shape format, it actually requires 3.1 MB of disk space (see the Resources section for more information on ESRI shape format). All this information is encapsulated in an ST_Geometry value, implying that this value also needs about one MB disk space in the database. The other case is the situation where a spatial value represents just a single point. There we have merely 8 bytes for the two floating point values for the X and Y dimension. Representing a point as an ST_Point value adds some overhead, but still we are only talking about a few bytes.

The total size of all values stored in a row of a DB2 table is limited by the page size that is used for the table space. An exception are large objects (LOBs), which can be up to 2 GB. The largest pages size supported by DB2 is 32K. So storing a geometry that requires 1 MB space requires some sort of LOB-like storage mechanism. But always using that mechanism for point data would be complete overkill. To address such widely differing requirements, DB2 implemented a hybrid approach when it comes to storing spatial data (or structured data in general). If a spatial value exceeds a certain size (the so-called inline length), then the value is stored like a BLOB. Otherwise, the value is stored like a VARCHAR FOR BIT DATA value. The next section goes into the details of setting the inline length for spatial columns and the benefits you can get from such a change. After that we discuss spatial data clustering, give guidance for writing spatial SQL queries, tuning the spatial grid index, and finally we explain which type of a table space is recommended if you have many data modifications.


Setting the inline length of spatial columns

We already explained in a previous section the DB2-internal mechanism to store spatial data with widely varying size requirements. The deciding factor to determine whether a geometry will be stored as VARCHAR FOR BIT DATA or as BLOB is the so-called inline length that is applicable to a spatial column in any table. If the internal representation of the spatial value requires less bytes than specified in the inline length setting, then it will be stored inline as VARCHAR FOR BIT DATA. Otherwise, the value will be lobified and stored as LOB in the LONG table space of the table.

INLINE LENGTH

When a new structured type is created in a database, DB2 will calculate the default inline length of that data type based on the attributes specified in the type definition. You can find the default inline length of a structured type in the system catalog view SYSCAT.DATATYPES, column INLINE_LENGTH. That default is always used if no explicit inline length is specified when a column of a table is defined in a CREATE TABLE or ALTER TABLE ... ADD COLUMN ... statement.

You can modify (increase) the inline length of existing spatial columns using the ALTER TABLE ... ALTER COLUMN ... SET INLINE LENGTH ... statement. This alteration only affects the DB2 catalog and all subsequent data modifications, unless you reorganize the data stored in the table via the REORG TABLE statement, including the LONGLOBDATA option. The reorganization will convert lobified spatial values to inline values if the size of the values is less than the new inline length.

You should remember that storing the data inline is very much preferable to lobified data. The reason is simply that inlined data is treated exactly like VARCHAR FOR BIT DATA. The value is stored on a data page together with all other attributes of the same row. And once the data is on such a page, the page will be accessed via the buffer pool, which provides sophisticated caching techniques and tries to avoid file I/O as much as possible. Things are quite different with LOBs, which are always read directly from disk.

So the rule of thumb is pretty straightforward: set the inline length as high as possible to get as many spatial values as possible to be stored inline. Of course, this is often not as simple as it appears. A high inline length tells DB2 that the spatial values can actually take that many bytes in a single row. The maximum size of each row is restricted by the page size and the attributes (columns) that were defined for the table. For example, if you have a table space with a page size of 4 KB (4096), then the maximum length of a row is restricted to 4005 bytes (see the Resources section for more information on SQL limits). If the table in question has a not-nullable INTEGER column and a nullable VARCHAR(100) column in addition to the spatial column, then you can set the inline length to at most 4005 - 6 - 4 - (1+2+100) - 1 = 3891, where 6 bytes are for the prefix of the row, 4 bytes is the space requirement for the INTEGER column, the (1+2+100) bytes are reserved for the VARCHAR(100), and finally the single 1 byte is the NULL indicator for the spatial column (see the Resources section for an article on database objects and the CREATE TABLE statement). As you can see, the length of other columns and the inline length actually compete with each other. To further increase the inline length, you can settle on a larger page size of 8K, 16K, or even 32K. In the previous example, you could set the inline length to 7987, 16179, or 32563 bytes, respectively.

Choosing a proper inline length

Before you go ahead and move all your spatial data to 32K table spaces and set the inline length to the highest possible value, you should first analyze how big your data really is and which other parameters might have an influence on the page size. If you only have ST_Point values, then each point would require at most 245 bytes of physical storage, as Listing 3 demonstrates. For that case, it might even be helpful to reduce the inline length because you could be able to go with a smaller page size and/or get additional columns in your table. Note, however, that the ALTER TABLE statement only allows you to increase the inline length. If you want to use a smaller value, it must be specified when the table is created. The Spatial Extender import procedures allow you to explicitly specify the inline length for the spatial column.

Nesting LOBs in structured types

Despite the points attribute being defined as BLOB, DB2 will not store it separately. Instead, the whole geometry information, including the BLOB data, will be stored together -- either inline or lobified, depending on the column's inline length. The implementation of structured types is such that all attribute values are concatenated into a single binary stream, any necessary meta information added and the resulting binary stream will be inlined or lobified when it is to be materialized, in other words, stored in a table.

That approach offers a way for any application that deals with LOBs to store short LOB values inline and exploit DB2's buffer pool.

Listing 3 shows how to calculate how much disk space each geometry would require if it were stored inline. We assume that all geometries reside in the column GEOMETRY of a table named SPATIAL_DATA. The first query employs the LENGTH function. That function shows the size of the value that is stored inline. If the value is lobified, it shows the size of the locator referencing the actual value. So the function can only be used reliably if all spatial values are known to be inlined. Therefore, the following queries calculate the size of the data based on the attributes of the spatial data types. The information about the attributes can be retrieved from the DB2 catalog views SYSCAT.DATATYPES and SYSCAT.ATTRIBUTES. Given that spatial data is implemented using structured types, the rules for determining the size of values of a structured types as explained in the SQL Reference for the CREATE TYPE statement apply (see the Resources section for the CREATE TYPE statement). In particular, the ST_Geometry type defines 16 attributes, and none of its sub-types adds any attributes of its own. All attributes are so-called short attributes, except three of them. The two non-short attributes anno_text and ext are not used, and the third non-short attribute points contains the internally encoded geometry information as a BLOB. In addition to the actual data, DB2 needs to maintain the mandatory null indicator (1 byte) and length information (4 bytes). Thus, the size of a geometry can be calculated by the formula "32 + 16*10 + 5 + LENGTH(points) = 197 + LENGTH(points)".

Listing 3. Space requirements of geometries
-- maximum space requirement for spatial point data
CREATE TABLE test ( p db2gse.ST_Point INLINE LENGTH 3800 )@
INSERT INTO test VALUES ( db2gse.ST_Point(
                             1234567890123456, 1234567890123456,
                             1234567890123456, 1234567890123456, 0) )@
SELECT LENGTH(p) FROM test@

1
-----------
        245

  1 record(s) selected.

-- determining the size of all geometries in a table
SELECT 197 + LENGTH(geometry..points)
FROM   spatial_data@

-- calculating #geometries that would be stored inline/lobified
-- for a given inline length
SELECT SUM(inline) AS inline, COUNT(*) - SUM(inline) AS lobified
FROM   ( SELECT CASE
                   WHEN 197 + LENGTH(geometry..points) <= <inline_length>
                   THEN 1
                   ELSE 0
                END
         FROM   spatial_data ) AS t(inline)@

Performance comparison

To demonstrate the effect of a small versus a large inline length, we first import the shapefile europe/roads.shp into a table with an inline length of 292. This is the minimum possible value as is mandated by DB2. Following that, an SQL script is run that determines how many geometries are stored inline and how many had to be lobified. The time to execute a simple spatial query is measured afterwards and excerpts from a statement snapshot taken during the execution are shown to illustrate the most prominent factors that influence performance. The whole procedure is repeated with an inline length of 2000, which is sufficient to cause all geometries in that shapefile to be inlined. The script test_inline_length.sql that we used for the db2batch run is included in the download section.

Listing 4. Effect of different inline length settings
$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 292 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    3m15.604s
user    0m0.050s
sys     0m0.026s

$ db2batch -d testdb -f test_inline_length.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT SUM(inline) AS inline_storage,
       COUNT(*) - SUM(inline) AS lobified_storage
FROM   ( SELECT CASE
                   WHEN 197 + LENGTH(shape..points) <=
                           ( SELECT inline_length
                             FROM   syscat.columns
                             WHERE  colname = 'SHAPE' AND
                                    tabname = 'ROADS' )
                   THEN 1
                   ELSE 0
                END
         FROM   roads ) AS t(inline)

INLINE_STORAGE LOBIFIED_STORAGE  
-------------- ----------------
         89595            21384

---------------------------------------------

Statement number: 2

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds  
Execute Time is:           0.854      seconds  
Fetch Time is:             0.000      seconds  
Elapsed Time is:           0.855      seconds

Buffer pool data logical reads               = 16818
Buffer pool index logical reads              = 19731
Direct reads                                 = 3088
Direct read requests                         = 1544
Direct read elapsed time (ms)                = 18

---------------------------------------------

$ db2 "DROP TABLE roads"
$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 2000 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    1m57.212s
user    0m0.049s
sys     0m0.027s

$ db2batch -d testdb -f test_inline_length.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT SUM(inline) AS inline_storage,
       COUNT(*) - SUM(inline) AS lobified_storage
FROM   ( SELECT CASE
                   WHEN 197 + LENGTH(shape..points) <=
                           ( SELECT inline_length
                             FROM   syscat.columns
                             WHERE  colname = 'SHAPE' AND
                                    tabname = 'ROADS' )
                   THEN 1
                   ELSE 0
                END
         FROM   roads ) AS t(inline)

INLINE_STORAGE LOBIFIED_STORAGE  
-------------- ----------------
        110979                0  

---------------------------------------------

Statement number: 2

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds
Execute Time is:           0.792      seconds
Fetch Time is:             0.000      seconds
Elapsed Time is:           0.792      seconds

Buffer pool data logical reads               = 17337
Buffer pool index logical reads              = 19731
Buffer pool index physical reads             = 0
Direct reads                                 = 2
Direct read requests                         = 1
Direct read elapsed time (ms)                = 0

---------------------------------------------

As can be seen from the results, using a larger inline length results in a significant speed-up of 40% for the import operation and the query on the sample data runs 7% faster. The underlying reason for both differences is the fact that more that 20000 geometries (only about 20% of the data) are lobified with the smaller inline length. DB2 reads (and writes) LOB data directly from the disk. In the second scenario, the majority of the direct reads were not necessary at all because the requests could be satisfied by inlined data that was cached in the buffer pool. Note that the ratio of 20% of lobified data is actually not that bad. If more data could not have been stored inline due to slightly more complex geometries, the demonstrated differences could have been multiplied.


Clustering spatial data

Clustering data according to a certain property is a common and very useful technique to physically organize the data of a table. The underlying observation is that data with certain similar values are very often accessed together. So it makes sense to store similar data close to each other so that access to that data does not have to be scattered across many different pages in the table space but rather to just a few pages right next to each other. What is easier to imagine than clustering spatial data according to their spatial properties, or the distance between the geometries? Spatial queries are one of the best examples to show locality of data access, in other words, geometries that are close to each other in the real world are often accessed together. For example, if you look at a street map of a city, you are interested in all the streets of that city but not the streets of some other city on the other side of the globe. So storing the rows of that city physically close in disk does make a lot of sense.

The way to establish a certain data clustering property in DB2 is to reorganize the table according to an index. However, it is not so easy for spatial data given that the DB2 REORG TABLE command does not understand spatial indexes due to their complex nature. There is an easy way around the deficit by using a column whose values are computed based on the geometry in question. That column needs to have a declared data type on which DB2 can create a native B-Tree index. The generation employs space filling curves to preserve the spatial and topological properties (see the Resources section for the H. Sagan book). We take a point on the geometry -- the centroid -- compute the value for that point on the space filling curve, and store the result in an additional column. Finally, an index on the additional column is created and the table reorganized based on that index.

Defining the clustering column

Before we can actually extend the table that contains the spatial data and add the clustering column, we have to define a function that computes the value of the geometry's centroid on the space filling curve. Fortunately, the DB2 Spatial Extender already provides most of the logic and we just have to tie it together for our purposes. First, there is a method ST_Centroid defined for polygons. Applying this method on the result of ST_Envelope, which always returns a polygon for any non-empty geometry, gives us the single point we need. The next piece is the key generator function of the Z-order based index mechanism that comes with the DB2 Spatial Extender (see the Resources section for information on Index Extensions). The Z-order is actually a space filling curve -- exactly what we had in mind. The information about the key generator function can be gathered with the db2look tool or directly from the DB2 catalog views. You will find that the key generator is defined as shown in Listing 5.

Listing 5. Key generator function of the Z-Order index
CREATE FUNCTION db2gse.GseZordIdxKeyGen ( srsId INTEGER,
      xMin DOUBLE, xMax DOUBLE, yMin DOUBLE, yMax DOUBLE,
      xOffset DOUBLE, xScale DOUBLE, yOffset DOUBLE, yScale DOUBLE )
   RETURNS TABLE ( coordsysId INTEGER, zValue BIGINT )
   SPECIFIC ST_ZixKeyGen
   EXTERNAL NAME 'db2gsefn!gseZOrderIndexKeyGenerator'
   ...@

Now we can create an additional function that will combine the before mentioned functions and produce the scalar value on the space filling z-curve for any given geometry. The definition of the resulting function computeZValue1 is shown in Listing 6 and it can also be found in the file zvalue_fct.sql in the download section. It takes a geometry as input and, additionally, 4 parameters that define how to convert floating point coordinates in X and Y dimensions of the point geometry to integer values from which the Z-value will be computed. Of course, you could use constants directly in the function body instead of parameters if you don't need the additional flexibility. Note that, following the paradigm of the DB2 Spatial Extender, the offset is subtracted from the floating point value before the scale factor is multiplied in. The second function computeZValue in Listing 6 is just a variation of the first function. It uses a different approach to come up with the input arguments for the function db2gse.GseZordIdxKeyGen. The computation of the envelope and the extraction of the centroid of that envelope requires a certain amount of communication between the DB2 engine and the Spatial Extender itself. This communication can be avoided by using the methods ST_MinX, ST_MaxX, ST_MinY, and ST_MaxY to extract the minimum and maximum X and Y coordinates of a geometry, and calculating the coordinates of the controid directly.

Listing 6. Function to generate Z-Value for a geometry
CREATE FUNCTION computeZValue1 ( g db2gse.ST_Geometry,
      xOffset DOUBLE, xScale DOUBLE, yOffset DOUBLE, yScale DOUBLE )
   RETURNS BIGINT
   LANGUAGE SQL
   DETERMINISTIC
   NO EXTERNAL ACTION
   RETURN SELECT t2.zValue
          FROM   TABLE ( VALUES (g..ST_Envelope()..ST_Centroid()) ) AS t1(p),
                 TABLE ( db2gse.GseZordIdxKeyGen ( p..ST_SrsId(),
                            p..ST_X(), p..ST_X(),  p..ST_Y(), p..ST_Y(),
                            xOffset, xScale, yOffset, yScale ) ) AS t2@

CREATE FUNCTION computeZValue ( g db2gse.ST_Geometry,
      xOffset DOUBLE, xScale DOUBLE, yOffset DOUBLE, yScale DOUBLE )
   RETURNS BIGINT
   LANGUAGE SQL
   DETERMINISTIC
   NO EXTERNAL ACTION
   RETURN SELECT zValue
          FROM   TABLE ( db2gse.GseZordIdxKeyGen ( g..ST_SrsId(),
                            (g..ST_MinX() + g..ST_MaxX() / 2),
                            (g..ST_MinX() + g..ST_MaxX() / 2),
                            (g..ST_MinY() + g..ST_MaxY() / 2),
                            (g..ST_MinY() + g..ST_MaxY() / 2),
                            xOffset, xScale, yOffset, yScale ) ) AS t@

Subsequently we will use the second version of the function because it runs dramatically faster as Listing 7 demonstrates. The summary of the db2batch run shows that a lot of time is spend inside the db2agent process, and that's the process that executes the spatial functions ST_envelope and ST_Centroid. Although it seems that ST_MinX and related methods would also incur the mentioned communication overhead, this is not the case because those four methods merely access attributes of the structured types of the geometries and that is handled completely inside the DB2 engine.

Listing 7. Comparing the functions that generate Z-Values
$ db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -idColumn id \
        -commitScope 1500
GSE0000I  The operation was completed successfully.
$ db2batch -d testdb -f test_zvalue_fct.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT computezValue1(shape, -180, 1000, -90, 1000)
FROM   roads

Prepare Time is:           0.000      seconds
Execute Time is:           5.737      seconds
Fetch Time is:             106.706    seconds
Elapsed Time is:           112.443    seconds

---------------------------------------------

Statement number: 2

SELECT computezValue(shape, -180, 1000, -90, 1000)
FROM   roads

Prepare Time is:           0.000      seconds
Execute Time is:           0.130      seconds
Fetch Time is:             2.702      seconds
Elapsed Time is:           2.832      seconds

---------------------------------------------

Summary of Results
==================
                Elapsed             Agent CPU         Rows      Rows
Statement #     Time (s)            Time (s)          Fetched   Printed
1                   112.443             101.000     110979         0
2                     2.832               2.560     110979         0

The final step is to extend the table(s) that shall have their data organized according to the topological information of the geometries. We assume a table named ROADS being already created based on the shapefile europe/roads.shp. The first thought might be to add a new column and use the GENERATED ALWAYS AS ( ... ) syntax to populate and maintain it. However, both variations of the function use sub-selects (and the spatial functions in the first version employ a scratchpad to carry information from one call to the next), and that violates certain restrictions of generated columns. So we have to resort to triggers as in Listing 8. To ensure that all existing rows are handled properly, the UPDATE statement is run. The effects of clustering the rows by their spatial properties rely on accessing just a subset of the data during query time. If a table scan is used, the clustering would not really be beneficial because DB2 still has to read all rows and then the physical ordering is not very important. So we create a spatial grid index that will be used at query time. All of those steps can also be found in the attached file zvalue_clustering_setup.sql.

Listing 8. Extending the table by the Z-value column
ALTER TABLE roads ADD COLUMN z_value BIGINT@

CREATE INDEX z_value_idx ON roads(z_value)@

CREATE TRIGGER roads_zval_insert NO CASCADE BEFORE INSERT ON roads
   REFERENCING NEW AS n
   FOR EACH ROW MODE DB2SQL
   SET n.z_value = computeZValue(n.shape, -180, 1000, -90, 1000)@

CREATE TRIGGER roads_zval_update NO CASCADE BEFORE UPDATE OF shape ON roads
   REFERENCING NEW AS n
   FOR EACH ROW MODE DB2SQL
   SET n.z_value = computeZValue(n.shape, -180, 1000, -90, 1000)@

UPDATE roads
SET    z_value = computeZValue(shape, -180, 1000, -90, 1000)@

REORG TABLE roads INDEX z_value_idx@

Performance comparison

We run a single query, first on unclustered and then on clustered data. The REORG TABLE step between two scenarios will establish the clustering order. To ensure that we got meaningful results, we also verified that the grid index is actually exploited for this query as explained in a later section. Due to the rather small amount of data no significant differences in the execution time could be measured. But a snapshot was collected and the interesting information for the table space USERSPACE1 extracted to highlight the difference. The data in the table was imported -- as usual -- from the shapefile europe/roads.shp, and the inline length was set to 2000. The exact switches for the measurement are listed in the SQL script test_clustering.sql, which is available in the download section.

Listing 9. Testing the effect of clustering spatial data
$ db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 2000 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.
$ db2 connect to testdb
$ db2 "CREATE INDEX roads_grid_index ON roads(shape) \
        EXTEND USING db2gse.spatial_index(0.27, 0.54, 1.6)"
$ db2stop force && db2start
$ db2batch -d testdb -f test_clustering.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1 SELECTIVITY 0.00000001

            *** Tablespace Snapshot ***

Tablespace Name                               = USERSPACE1          
Buffer pool data logical reads                = 13161
Buffer pool data physical reads               = 1949
Total buffer pool read time (ms)              = 39

---------------------------------------------

$ db2 connect to testdb
$ db2 -td@ -f zvalue_fct.sql
$ db2 -td@ -f zvalue_clustering_setup.sql
$ db2stop force && db2start
$ db2batch -d testdb -f test_clustering.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1 SELECTIVITY 0.00000001

            *** Tablespace Snapshot ***

Tablespace Name                               = USERSPACE1          
Buffer pool data logical reads                = 13161
Buffer pool data physical reads               = 749
Total buffer pool read time (ms)              = 16

---------------------------------------------

You will notice that the number of the data physical read operations is significantly lower in the clustered scenario -- only 38% -- compared to the unclustered table. This difference can be fully attributed to the effect of the clustering, in other words, the relevant data is stored close to each other on the physical level so that less data pages need to be accessed during query execution.

Using the Hilbert space filling curve

Clustering based to spatial properties has proven to be highly effective at several customer sites. Of course you can imagine different approaches to come up with the values for the generated clustering column. For example, you could follow the idea of the grid index and use the identifier of a single grid cell that overlaps with the geometry. Also, your approach might heavily depend on your spatial data. The space filling Z-order curve might not be the best choice for geodetic data, given the closed nature of the earth's surface.

For your convenience, we provide an implementation for another space filling curve, the Hilbert curve (see the Resources section for information on the Hilbert curve) in the download section. Studies have shown that the Hilbert curve preserves very well the topological information, even better than the Z-ordering. However, using the Hilbert curve requires that you compile the user-defined function written in C and register it in the database. The compilation can be performed with the help of the bldrtn script, which can be found in the sqllib/samples/cpp directory. The SQL statement to register the function can be found in the file hilbert_fct.sql.

The steps to add an indexed clustering column are very similar to the ones shown in Listing 8. However, this time we are able to use a generated column because no sub-selects occur in the definition of the function computeHilbertVal and no other restrictions of generated columns are violated. Listing 10 shows the specific steps, which can be executed once the function computeHilbertVal is registered in your database. You will notice that we access four attributes of the spatial data type ST_Geometry directly to not run into trouble with the generated column. Accessing those attributes is only possible if the necessary privileges were granted and if you do not use a geodetic coordinate system as the latter could lead to incorrect results due to the differing use of the attributes in the geodetic case. But as was mentioned before, you should pick a different clustering criteria for geodetic data in any case. The two SET INTEGRITY statements are only necessary if your table already contains some data in order to force the generation of the hilbert value for the existing rows. The same steps are also collected in the hilbert_clustering_setup.sql script. You must ensure that enough log space is available for the generation process.

Listing 10. Extending the table by the Hilbert-value column
SET INTEGRITY FOR roads OFF@

ALTER TABLE roads ADD COLUMN hilbert_value BIGINT
   GENERATED ALWAYS AS ( computeHilbertVal(
      (shape..xMin + shape..xMax) / 2,
      (shape..yMin + shape..yMax) / 2,
      -180, 1000, -90, 1000) )@

SET INTEGRITY FOR roads IMMEDIATE CHECKED FORCE GENERATED@

CREATE INDEX hilbert_value_idx ON roads(hilbert_value)@

REORG TABLE roads INDEX hilbert_value_idx@

Now we repeat operations from Listing 9. The results are shown for the clustered case only because the unclustered table yields the same numbers as before. The result shows that indeed the Hilbert curve clusters the data better according to their topology, giving us an additional 1.5% less data physical reads, compared to the Z-ordering. As was already mentioned, using the Hilbert curve might not be an option if you don't want to compile the source code into a shared library yourself.

Listing 11. The performance impact of the clustered Hilbert-value column
$ db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 2000 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.
$ db2 connect to testdb
$ db2 "CREATE INDEX roads_grid_index ON roads(shape) \
        EXTEND USING db2gse.spatial_index(0.27, 0.54, 1.6)"
$ db2 -td@ -f hilbert_fct.sql
$ db2 -td@ -f hilbert_clustering_setup.sql
$ db2stop force && db2start
$ db2batch -d testdb -f test_clustering.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1 SELECTIVITY 0.00000001

            *** Tablespace Snapshot ***

Tablespace Name                               = USERSPACE1          
Buffer pool data logical reads                = 13161
Buffer pool data physical reads               = 723
Total buffer pool read time (ms)              = 16

---------------------------------------------

The theoretical optimal number of pages that need to be read can be calculated by the formula "#rows-returned-by-index * ( #pages-of-table / #rows-in-table )". You can obtain the values for the total number of rows in the table and the total number of pages allocated for the table by querying the columns CARD and NPAGES of the SYSSTAT.TABLES catalog view after you collected the statistics on the table ROADS. The number of rows returned by the index must be used (instead of the number of rows that the query itself returns) because the index only identifies a set of possible candidate rows for which the spatial function -- here ST_Intersects -- needs to be called in order to determine the final result of the spatial predicate. You can use the function ST_MBRIntersects to determine the set of candidates yourself. In our example, we get 6574 candidates from the index scan. The table stores 110979 rows on 11018 pages.

Now we have all the required information and calculate "6574 * ( 11018 / 110979 ) = 652.14". So at least 653 pages would need to be read from disk if the rows were distributed in such an ideal way that only rows accessed by the query are to be found on the loaded pages. At the bottom line we see an additional 11% pages had to be read from the physical device. Note that the unclustered case caused an excess of not quite 200%.


Hints and tips on writing spatial SQL statements

The paradigm of SQL is that the user (or application) just tells the database system what to do and the system itself figures out the best way how to do it. Thus, SQL is a purely descriptive language. The DB2 optimizer tries hard to fulfill this idea and it is generally known to be very good at its job. However, there are still some situations where user intervention can further improve things. Subsequently, we describe a few concepts that you should keep in mind for spatial queries and spatial DML statements.

Process multiple rows

Originally, SQL was (and it still is) a set-oriented language and not a procedural one. The implication is to process all relevant rows together in a set instead of handling each rows, one after the other, individually. A good example are INSERT statements: if more than one row needs to be inserted into the same table, process all those rows together in a single SQL statement instead of firing several statements. This fact is especially interesting for spatial data because pretty much all of the spatial functions are implemented in such a way that multiple executions of a function within the same SQL statement result in a performance benefit. The internal processing of the functions performs some necessary initializations at the first call made to the function, for example to set up memory areas needed for the computation, and subsequent calls take advantage of that established infrastructure before it is torn down in the last call made to the function in the scope of that one statement.

So if you were to execute a separate INSERT statement for each row to be inserted, the initialization and destruction of the mentioned infrastructure is done again and again for each execution, or each row. Combining the statements into a single one avoids excessive initializations and destruction steps, resulting an in overall performance boost. Additionally, processing multiple rows together improves the performance because there are fewer round trips between the application (DB2 client) and the database engine.

Let's demonstrate the effects on a rather simple example where we insert 50 different rows. First, all rows are inserted separately using SQL statements like the first one shown in Listing 12. The second statement in the listing combines all 50 rows in a single insert operation. The idea is, on a descriptive level, to build in the FROM clause a temporary table that consists of the data to be imported before calling any of the spatial functions. Next, all rows of that temporary table are scanned, the spatial function is applied and a new table is constructed whose data is finally inserted.

Listing 12. Approach to single-row versus multi-row operations
-- insert just a single row
INSERT INTO tab
VALUES ( 1, db2gse.ST_LineString(
               'linestring (11.9963970 54.9979739, 11.9947259 55.0000000)',
               1003) )@

-- insert multiple rows at once
INSERT INTO tab
SELECT id, db2gse.ST_LineString(wkt, 1003)
FROM   TABLE ( VALUES
       ( 1, 'linestring (11.9963970 54.9979739, 11.9947259 55.0000000)' ),
       ( 2, 'linestring (11.9872250 55.0000000, 11.9963970 54.9979739)' ),
       ... ) AS t(id, wkt)@

The script test_multi_row.sql available in the download section runs the two different approaches with the same set of data. The data that is being inserted stems originally from the europe/roads.shp shapefile, which we used before. The first approach took 0.050 seconds in total, whereas the combined statement could be completed in 0.012 seconds.

It should be noted that a major part of the time is spent parsing and compiling the SQL statement. If prepared statements are involved, the differences could increase or decrease, depending on your application, system configuration and data involved. However, it is rather unlikely that the first scenario would out-perform the multi-row statement. Furthermore, you should keep in mind that the constructor function ST_LineString does not perform any complicated spatial computations; it is just a single-sweep scan over the well-known text representation combined with a conversion of the coordinates to the internal encoding.

The function also has a very low memory footprint. Also remember, prepared statements can be used for the multi-row insert -- for example in static embedded SQL applications -- and the multi-row insert can be executed several times, taking advantage of the described benefit. The very same technique is used in the spatial import procedures, and that is the reason for the informational message like "Using 342 rows per single INSERT statement" appearing at the beginning of the messages file. During import, the total number of rows is constraint by the maximum possible size of the SQL statement, the commit scope, the total number of rows being imported, or all of these.

The same considerations as for INSERT statements also apply to SELECT or UPDATE statements. For example, the DB2 Spatial Extender stored procedure ST_run_gc geocodes a set of rows. If you specify a value for the commit scope, then the procedure might not be able to process all affected rows at once, but it has to count the rows and execute a COMMIT in between. The naive approach uses a cursor to scan the table and execute a positioned update statement for each row. However, the positioned update runs into exactly the same issue we just discussed, which is that the geocoder has to be initialized for each and every execution again and again.

Given that geocoding is a much more complex functionality than simply constructing a linestring, the performance impact is much more noticeable. So a different solution was found if an identifying column like a primary key is found in the table. We use the DB2 OLAP function row_number() to assign a unique number to each affected row and then run a block-wise searched update, followed by a COMMIT. Listing 13 illustrates that. The inner-most sub-select identifies all rows that need to be updated; the user might have given some conditions to restrict the rows. Additionally, a row number is assigned to each row. The next step filters only those rows, based on the row numbers, which belong to the current block that shall be geocoded. And finally, the outer-most UPDATE feeds those rows to the geocoder function one after the other within the scope of the same SQL statement. That way, the geocoder has to do its initialization only once for a block and not once for each row again.

Listing 13. Using searched update for geocoding
UPDATE <user_table>
SET    location = <geocoder_function> ( <parameters> )
WHERE  id IN ( SELECT t.id
               FROM   ( SELECT id, ROW_NUMBER OVER ( ORDER BY id ) AS rn
                        FROM   <user_table>
                        WHERE  <geocoding_selection> ) AS t
               WHERE  t.rn BETWEEN <first_row_num> AND <last_row_num> )@

Reduce number of calls to spatial functions

When you look at the spatial query that we used before in this article, you will notice that a new geometry was constructed as parameter for the overlap test. Now, the ST_LineString constructor is a deterministic function without any side effects. The DB2 optimizer is aware of those conditions and it could decide that it does not hurt to call the function several times instead of just once. Depending on your system and query, this might be a smart choice, but it could also be a sub-optimal one. For example, in a partitioned environment using DB2's DPF feature, the construction of the linestring on each partition might very well outperform the construction of the geometry on just a single partition and distributing it through table queues to the other partitions where the value is needed. In another scenario, it could be counter-productive to generate the linestring for each and every row with which the spatial overlap test shall be performed. Therefore, it might be worthwhile to restructure your spatial query to ensure just a single call to the constructor function using common table expressions. In Listing 14 you can find first the original query followed be a reformulated one. The SQL script that is executed is included in the download section. Both queries express the same semantics but the second one runs about 9% faster on our system and database configuration.

Listing 14. Rephrasing spatial queries
$ db2batch -d testdb -f test_cte.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads2
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds  
Execute Time is:           0.819      seconds
Fetch Time is:             0.000      seconds  
Elapsed Time is:           0.819      seconds  

---------------------------------------------

Statement number: 2

WITH t(g) AS
   ( VALUES ( db2gse.ST_LineString('linestring(10 50, 20 40)', 1003) ) )
SELECT r.id
FROM   roads2 AS r, t
WHERE  db2gse.ST_Intersects(r.shape, t.g) = 1

Prepare Time is:           0.000      seconds  
Execute Time is:           0.744      seconds
Fetch Time is:             0.000      seconds  
Elapsed Time is:           0.745      seconds  

---------------------------------------------

Using a spatial grid index

So far we discussed many different things to improve performance of spatial operations. We will now discuss the most obvious one, the exploitation of spatial indexes. Here we explain what you have to do to give the DB2 optimizer the choice to use a spatial index. The tuning of the index itself is explained in the next section.

A spatial index is an extended index built on top of DB2's extensible indexing framework (see the Resources section for an article on Index Extensions). Due to the multi-dimensional nature of spatial data, the usual B*-trees in DB2 are not very well suited and specialized indexing mechanisms are provided by the DB2 Spatial Extender. The DB2 index extensions are comprised of three individual pieces:

  • Key generator function to construct index key during INSERT and UPDATE operations.
  • Range producer function to define the search ranges for the spatial index during query execution.
  • Spatial predicates that tell the DB2 optimizer under which conditions a spatial index could be exploited.

All those pieces are already defined by the DB2 Spatial Extender. Besides actually creating a spatial index, you have to be familiar with the last one, however. The spatial predicates are associated with the functions that compare two geometries, namely ST_Contains, ST_Within, ST_Intersects, ST_Crosses, ST_Overlaps, ST_Touches, ST_EnvIntersects, ST_MBRIntersects, ST_Equals, and ST_Distance. None of the other spatial functions can exploit a grid index. Furthermore, a spatial grid index can only be used of one of those functions occurs in the WHERE clause of the query and if at least one of its parameters identifies a column on which a said grid index was defined. This sounds like a lot of conditions, but so far it is straightforward: using the column, the index can be found and the spatial predicate is known using the function.

Additionally, DB2 requires that you follow a basic syntax rule in order to detect the potential application of a spatial predicate. The function call must occur on the left-hand side of a equality comparison with the value 1. An exception is the function ST_Distance, which must be used in a less-than comparison with an arbitrary distance. Listing 15 sketches two correct specifications.

Listing 15. Syntax rules to exploit the spatial index
SELECT ...
FROM   <user_table>
WHERE  ST_Intersects(<indexed_shape_column>, ...) = 1@

SELECT ...
FROM   <user_table>
WHERE  ST_Distance(..., <indexed_shape_column>) < <some_distance>@

If all conditions are met, including the simple syntax rule, it is still not ensured that a spatial grid index is used to satisfy the query. The DB2 optimizer computes different access plans and tries to find the one where the overall execution costs are minimal. To check if a spatial grid index is actually exploited during query time, you should have a look at the access plan. The access plan can be collected with the db2expln command line tool or by right-clicking on your database in the DB2 Control Center and choosing the option Explain SQL. Either way you provide the query and the access plan is generated and shown to you. Figure 3 depicts the access plan for the very query that we used before extensively in this article. If the plan includes a scan of the extended index you will see the EISCAN (highlighted in Figure 3). Additionally, you find the name that was assigned to the grid index right below the EISCAN.

Figure 3. Access plan for a spatial query with grid index scan
Access plan for a spatial query with grid index scan

As was explained before, the DB2 optimizer is not fully aware of all the details of the spatial grid index. In particular, it is rather hard for the optimizer to determine the costs and the selectivity of such an index scan. The approach chosen be the DB2 development team to close this gap immediately resulted in giving the user the ability to provide an estimate for the selectivity of a spatial predicate. To that end, you can append the SELECTIVITY keyword right after the predicate in the WHERE clause, followed by an estimated or guessed selectivity value in the range of 0 and 1. The lower this value is the more likely the optimizer will choose to scan the grid index. Listing 16 gives an example of a query where the optimizer is given the hint that the spatial predicate will only have very few qualifying rows.

Listing 16. Specifying the selectivity for a spatial predicate
SELECT ...
FROM   <user_table>
WHERE  ST_Intersects(<indexed_shape_column>, ...) = 1 SELECTIVITY 0.000001@

Tuning the spatial grid index

The spatial extender provides an index advisor that helps you to tune your spatial index. The advisor is available through the command line tool gseidx and it follows a verbose syntax, very much like SQL itself. The tool cannot only be used to get suggestions for the various grid sizes but also to collect statistics of an existing or planned (virtual) index. So you can extract the information how many index entries would be generated on which grid level if you choose certain grid sizes -- without actually creating and materializing the index. You should note that the recommendations given by the Index Advisor are a starting point for your index optimization efforts.

Listing 17. Sample output of the Spatial Extender index advisor
$ gseidx "CONNECT TO testdb GET GEOMETRY STATISTICS FOR COLUMN roads(shape) \
        USING GRID SIZES (0.5) SHOW HISTOGRAM WITH 10 BUCKETS"

Number of Rows: 110979
Number of non-empty Geometries: 110979
Number of empty Geometries: 0
Number of null values: 0

Extent covered by data:
    Minimum X: -31.257690
    Maximum X: 66.074104
    Minimum Y: 34.824085
    Maximum Y: 72.000000


Grid Level 1
------------

Grid Size                     : 0.5
Number of Geometries          : 110973
Number of Index Entries       : 147461

Number of occupied Grid Cells : 6596
Index Entry/Geometry ratio    : 1.328801
Geometry/Grid Cell ratio      : 16.824287
Maximum number of Geometries per Grid Cell: 257
Minimum number of Geometries per Grid Cell: 1

Index Entries :  1      2      3      4      10    
---------------  ------ ------ ------ ------ ------
Absolute      :  82240  24962  236    3361   174   
Percentage (%):  74.11  22.49  0.21   3.03   0.16  


Grid Level X
------------

Number of Geometries          : 6
Number of Index Entries       : 6


Histogram:
----------
    MBR Size             Geometry Count
    -------------------- --------------------
                0.340000               105777
                0.680000                 4750
                1.020000                  334
                1.360000                   80
                1.700000                   22
                2.040000                    4
                2.380000                    5
                2.720000                    5
                3.400000                    2


$ gseidx "CONNECT TO testdb GET GEOMETRY STATISTICS FOR COLUMN roads(shape) \
        ADVISE GRID SIZES"

Query Window Size:     Suggested Grid Sizes:           Index Entry Cost:
--------------------   -----------------------------   ----------------------
      0.01:             0.27,      0.54,       1.6             8.3
      0.02:             0.27,      0.54,       1.6             8.7
      0.05:             0.27,      0.54,       1.6             9.9
       0.1:             0.27,      0.54,       1.6              12
       0.2:             0.17,      0.51,       1.8              17
       0.5:             0.17,      0.51,       1.8              40
         1:             0.27,      0.68,       1.7             100
         2:             0.43,       1.1,       2.2             290
         5:             0.68,       2.4,       4.8            1300
        10:              1.1,         5,         0            4500
        20:              1.7,        10,         0           15000

If you are familiar with spatial data and the grid index, the results are pretty much self-explanatory. More details on the mechanisms of the spatial grid index and the Index Advisor can be found in the DB2 Spatial Extender User's Guide and Reference (see the Resources section). It should be mentioned that past experience has shown that clustering the data according to its spatial properties and using a properly defined buffer pool has a much more noticeable effect on spatial performance than an extremely fine tuned grid index -- as long as the parameters of the index are generally in the right ballpark.


Choosing the table space type

If you have very frequent data modification operations and less queries, you should focus on good write performance. DB2 places all the data of a database into table spaces. The administrator can choose the type of a table space as well as the containers that make up the table space. The type of a table space can be database managed (DMS) or system managed (SMS), and that decision can have an impact on the write performance for spatial data.

It is recommended that you do choose a DMS table space for your LONG data, in other words, the table space that holds LOBs. The effect will be that lobified spatial data will go to that table space. The reason for that decision lies in the internal workings of DB2. LOBs that go to a DMS table space can be written in a more asynchronous fashion, whereas LOBs written to an SMS table space require synchronous file I/O.

Once you have settled on DMS table spaces, you can furthermore choose between raw devices or file system based containers for the table space. For lobified spatial data, it is usually better to go with the file system. The rationale is this: access to a file system based container goes through the operating system kernel, and the operating system comes with a file system cache to speed up repetitive access to files. An access to a raw device avoids the cache, resulting in direct read and write operations on the physical device. Now, for normal database operation you do not really have to worry about the file system cache because DB2 implements buffer pools that do the necessary caching already. But things are different when it comes to LOBs. Due to the different internal storage model and the potentially huge objects (up to 2GB), no buffer pool is involved here. So the file system cache can help there quite a bit to avoid read and write operations on the disk.

Performance comparison

The effect of different table space types is shown in Listing 18. First the import operation is performed on an SMS table space, once with a small inline length and once with all spatial values being stored inline, and then on a DMS table space, again using different settings for the inline length. The target table space is specified using the tableCreationParameters option. Finally, our spatial query is run against both tables (with the small inline length) to show that query performance is not affected by the table space.

Listing 18. Performance comparison of SMS versus DMS table spaces
$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads_sms \
        -tableCreationParameters "IN userspace1" -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 292 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    3m5.618s
user    0m0.056s
sys     0m0.026s

$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads_sms \
        -tableCreationParameters "IN userspace1" -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 2000 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    1m56.643s
user    0m0.049s
sys     0m0.026s

$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads_dms \
        -tableCreationParameters "IN dms" -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 292 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    0m49.310s
user    0m0.053s
sys     0m0.028s

$ time db2se import_shape testdb -fileName /home/stolze/europe/roads \
        -srsName WGS84_SRS_1003 -tableName roads_dms \
        -tableCreationParameters "IN dms" -createTableFlag 1 \
        -spatialColumn shape -typeName ST_LineString -inlineLength 2000 \
        -idColumn id -commitScope 1500
GSE0000I  The operation was completed successfully.

real    0m38.766s
user    0m0.054s
sys     0m0.024s

$ db2batch -d testdb -f test_tablespace.sql -i complete -s on

---------------------------------------------

Statement number: 1

SELECT id
FROM   roads_sms
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds  
Execute Time is:           0.942      seconds
Fetch Time is:             0.000      seconds  
Elapsed Time is:           0.943      seconds

---------------------------------------------

Statement number: 2

SELECT id
FROM   roads_dms
WHERE  db2gse.ST_Intersects(shape, db2gse.ST_LineString(
          'linestring(10 50, 20 40)', 1003)) = 1

Prepare Time is:           0.000      seconds  
Execute Time is:           0.953      seconds
Fetch Time is:             0.000      seconds  
Elapsed Time is:           0.954      seconds

---------------------------------------------

Inserting the data into a table residing in a DMS table space takes about one-fourth of the time compared to the SMS table space. When interpreting the numbers you have to remember the fundamental differences of DMS and SMS table spaces. A DMS table space is preallocated at the time when the table space is created. That means the pages onto which the data is placed already exist. SMS table space, on the other hand, grow and shrink dynamically during run time and the import operation causes many new pages being allocated and, as part of that process, the table space (and its file) grows. So a major portion of the performance improvement can be attributed to the pre-allocation of the pages in DMS. However, when comparing the run times achieved with different inline length, we see that the improvement with a small inline length, or more geometries being lobified, is about 73% when going from SMS to DMS. Using a large inline length where geometries are lobified, the improvement is only 66%. So additional improvement apparently stems from the better handling of lobified data.


Summary

In this article we have shown several important techniques to improve performance of spatial databases. The steps ranging from basic system tuning over setting the inline length of spatial data, clustering the rows in a table according to spatial properties, tuning the spatial grid index and choosing the proper type of table space were covered. The reasons for certain decisions were explained and some general guidelines given. Additionally, the effect of the single options was demonstrated on a very simple scenario. The results that can be obtained from applying the suggested guidelines can and will depend on the actual data in your database and your overall system and database configuration.


Download

DescriptionNameSize
SQL scripts used in this articlespatial_tuning.zip  ( HTTP | FTP )19 KB

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=95052
ArticleTitle=DB2 Spatial Extender performance tuning
publish-date=10062005