When it comes to assuring optimal performance of enterprise applications, programmers have an advantage in the capabilities of the DB2 optimizer to handle even "problem" SQL statements and give efficient access paths. Nevertheless, poorly coded SQL and application code can still give you performance problems that can be easily avoided by learning a few basic guidelines. I'll show you how the DB2 optimizer works, and give you guidelines for writing SQL that will get the most out of the optimizer. But even with the optimization abilities of DB2, writing efficient SQL statements can be a tricky proposition. This is especially true for programmers and developers new to a relational database environment. So, before we delve into the specifics of coding SQL for performance, let's take a few moments to review SQL basics.
Because of the higher level of abstraction it provides, SQL, unlike procedural languages, lets programmers focus on what data they need, not how to retrieve it. You code SQL without embedded data-navigational instructions. DB2 analyzes the SQL and formulates data-navigational instructions "behind the scenes." These data-navigational instructions are called access paths. Having the DBMS determine the optimal access path to the data lifts a heavy burden off the programmer's shoulders. In addition, the database can have a better understanding of the state of the data it stores, and thereby can produce a more efficient and dynamic access path to the data. The result is that SQL, used properly, can provide for quicker application development.
Another SQL feature is that it's not merely a query language. You also use it to define data structures; control access to the data; and insert, modify, and delete occurrences of the data. By providing a common language, SQL eases communication among DBAs, systems programmers, application programmers, systems analysts, and end users. When all the participants in a project speak the same language, they create a synergy that can reduce overall system-development time.
Arguably, though, the single most important feature of SQL that has solidified its success is its capability to retrieve data easily using English-like syntax. Understand this is much easier than understanding pages and pages of program source code:
SELECT LASTNAME FROM EMP WHERE EMPNO = '000010';
Think about it: When accessing data from a file, the programmer would have to code instructions to open the file, start a loop, read a record, check whether the EMPNO field equals the proper value, check for end of file, go back to the beginning of the loop, and so on.
SQL is, by nature, quite flexible. It uses a free-form structure that lets the user develop SQL statements to suit his or her needs. The DBMS parses each SQL request before execution to check for proper syntax and to optimize the request. SQL statements do not need to start in any given column, and you can string them together on one line or break them apart on several lines. For example, this single-line SQL statement is equivalent to the three-line example I used previously:
SELECT LASTNAME FROM EMP WHERE EMPNO = '000010';
Another flexible feature of SQL is that you can formulate a single request in a number of different and functionally equivalent ways. One example: SQL can join tables or nest queries. You can always convert a nested query into an equivalent join. You can see other examples of this flexibility in the vast array of functions and predicates. Examples of features with equivalent functionality include:
- BETWEEN versus <= / >=
- IN versus a series of predicates tied together with OR
- INNER JOIN versus tables strung together in the FROM clause separated by commas
- OUTER JOIN versus a simple SELECT, with a UNION, and a correlated subselect
- CASE expressions versus complex UNION ALL statements
This flexibility that SQL exhibits is not always desirable, because different but equivalent SQL formulations can deliver wildly different performance. I'll discuss the ramifications of this flexibility later in this article and provide guidelines for developing efficient SQL.
As I mentioned, SQL specifies what data to retrieve or manipulate, but does not specify how the database accomplishes these tasks. This keeps SQL intrinsically simple. If you can remember the set-at-a-time orientation of a relational database, you will begin to grasp SQL's essence and nature. A single SQL statement can act upon multiple rows. The capability to act on a set of data coupled with the lack of need for establishing how to retrieve and manipulate data defines SQL as a non-procedural language
Because SQL is a non-procedural language, a single statement can take the place of a series of procedures. Again, this is possible because SQL uses set-level processing and DB2 optimizes the query to determine the data-navigation logic. Sometimes one or two SQL statements can accomplish tasks that otherwise would require entire procedural programs to do.
The optimizer is the heart and soul of DB2. It analyzes SQL statements and determines the most efficient access path available for satisfying each statement (see Figure 1). DB2 accomplishes this by parsing the SQL statement to determine which tables and columns must be accessed. The DB2 optimizer then queries system information and statistics stored in the DB2 system catalog to determine the best method of accomplishing the tasks necessary to satisfy the SQL request.
Figure 1. DB2 optimization in action.
The optimizer is equivalent in function to an expert system. An expert system is a set of standard rules that, when combined with situational data, returns an "expert" opinion. For example, a medical expert system takes the set of rules determining which medication is useful for which illness, combines it with data describing the symptoms of ailments, and applies that knowledge base to a list of input symptoms. The DB2 optimizer renders expert opinions on data retrieval methods based on the situational data housed in DB2's system catalog and a query input in SQL format.
The notion of optimizing data access in the DBMS is one of the most powerful capabilities of DB2. Remember, you access DB2 data by telling DB2 what to retrieve, not how to retrieve it. Regardless of how the data is physically stored and manipulated, DB2 and SQL can still access that data. This separation of access criteria from physical storage characteristics is called physical data independence. DB2's optimizer is the component that accomplishes this physical data independence.
If you remove the indexes, DB2 can still access the data (although less efficiently). If you add a column to the table being accessed, DB2 can still manipulate the data without changing the program code. This is all possible because the physical access paths to DB2 data are not coded by programmers in application programs, but are generated by DB2.
Compare this with non-DBMS systems in which the programmer must know the physical structure of the data. If there is an index, the programmer must write appropriate code to use the index. If someone removes the index, the program will not work unless the programmer makes changes. Not so with DB2 and SQL. All this flexibility is attributable to DB2's capability to optimize data manipulation requests automatically.
The optimizer performs complex calculations based on a host of information. To visualize how the optimizer works, picture the optimizer as performing a four-step process:
- Receive and verify the syntax of the SQL statement.
- Analyze the environment and optimize the method of satisfying the SQL statement.
- Create machine-readable instructions to execute the optimized SQL.
- Execute the instructions or store them for future execution.
The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that you can send its way?
The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the actual, in-depth details of how the optimizer determines the best access path, but the optimizer is a cost-based optimizer. This means the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer applies query cost formulas that evaluate and weigh four factors for each potential access path: the CPU cost, the I/O cost, statistical information in the DB2 system catalog, and the actual SQL statement.
Guidelines for performance
So, keeping the information about the DB2 optimizer in mind, you can implement these guidelines to facilitate better SQL performance:
1) Keep DB2 statistics up-to-date: Without the statistics stored in the DB2' system catalog, the optimizer will have a difficult time optimizing anything. These statistics provide the optimizer with information pertinent to the state of the tables that the SQL statement being optimized will access. The type of statistical information stored in the system catalog include:
- Information about tables, including the total number of rows, information about compression, and total number of pages;
- Information about columns, including number of discrete values for the column and the distribution range of values stored in the column;
- Information about table spaces, including the number of active pages;
- Current status of the index, including whether an index exists, the organization of the index (number of leaf pages and number of levels), the number of discrete values for the index key, and whether the index is clustered;
- Information about the table space and index node groups or partitions.
Statistics populate the DB2 system catalog when you execute the RUNSTATS or RUN STATISTICS utility. You can invoke this utility from the Control Center, in batch jobs, or using the command-line processor. Be sure to work with your DBA to ensure you accumulate statistics at the appropriate time, especially in a production environment.
2) Build appropriate indexes: Perhaps the most important thing you can do to assure optimal DB2 application performance is create correct indexes for your tables based on the queries your applications use. Of course, this is easier said than done. But we can start with some basics. For example, consider this SQL statement:
SELECT LASTNAME, SALARY FROM EMP WHERE EMPNO = '000010' AND DEPTNO = 'D01'
What index or indexes would make sense for this simple query? ""'First, think about all the possible indexes that you could create. Your first short list probably looks something like this:
- Index1 on EMPNO
- Index2 on DEPTNO
- Index3 on EMPNO and DEPTNO
This is a good start, and Index3 is probably the best of the lot. It lets DB2 use the index to immediately look up the row or rows that satisfy the two simple predicates in the WHERE clause. Of course, if you already have a lot of indexes on the EMP table, you might want to examine the impact of creating yet another index on the table. Factors to consider include:
- Modification impact: DB2 will automatically maintain every index you create. This means every INSERT and every DELETE to this table will insert and delete not just from the table, but also from its indexes. And if you UPDATE the value of a column that is in an index, you also update the index. So, indexes speed the process of retrieval but slow down modification.
- Columns in the existing indexes: If an index
already exists on EMPNO or DEPTNO, it might not be wise to create
another index on the combination. However, it might make sense to
change the other index to add the missing column. But not always,
because the order of the columns in the index can make a big
difference depending on the query. For example, consider this query:
SELECT LASTNAME, SALARY FROM EMP WHERE EMPNO = '000010' AND DEPTNO > 'D01';
In this case, EMPNO should be listed first in the index. And DEPTNO should be listed second, allowing DB2 to do a direct index lookup on the first column (EMPNO) and then a scan on the second (DEPTNO) for the greater-than.
Furthermore, if indexes already exist for both columns (one for EMPNO and one for DEPTNO), DB2 can use them both to satisfy this query so creating another index might not be necessary.
- Importance of this particular query: The more important the query, the more you might want to tune by index creation. If you are coding a query that the CIO will run every day, you want to make sure it delivers optimal performance. So building indexes for that particular query is important. On the other hand, a query for a clerk might not necessarily be weighted as high, so that query might have to make do with the indexes that already exist. Of course, the decision depends on the application's importance to the business-not just on the user's importance.
Index design involves much more than I have covered so far. For example, you might consider index overloading to achieve index-only access. If all the data that a SQL query asks for is contained in the index, DB2 might be able to satisfy the request using only the index. Consider our previous SQL statement. We asked for LASTNAME and SALARY, given information about EMPNO and DEPTNO. And we also started by creating an index on the EMPNO and DEPTNO columns. If we include LASTNAME and SALARY in the index as well, we never need to access the EMP table because all the data we need exists in the index. This technique can significantly improve performance because it cuts down on the number of I/O requests.
Keep in mind that making every query an index-only access is not prudent or even possible. You should save this technique for particularly troublesome or important SQL statements.
SQL coding guidelines
When you write SQL statements that access DB2 data, be sure to follow these three guidelines for coding SQL for performance.Of course, SQL performance is a complex topic, and understanding every nuance of how SQL performs can take a lifetime. That said, these simple rules put you on the right track for developing high-performing DB2 applications.
- The first rule is to always provide only the exact
columns that you need to retrieve in the
SELECT-list of each SQL SELECT statement. Another way of stating this
is, "do not use SELECT *". The shorthand SELECT * means you want to
retrieve all columns from the table(s) being accessed. This is fine
for quick and dirty queries, but is bad practice for application
- DB2 tables might need to change in the future to include additional columns. SELECT * will retrieve those new columns, too, and your program might not be capable of handling the additional data without requiring time-consuming changes.
- DB2 will consume additional resources for every column that requested to be returned. If the program does not need the data, it should not ask for it. Even if the program needs every column, it is better to ask for each column explicitly by name in the SQL statement for clarity and to avoid the previous pitfall.
- Do not ask for what you already know. This
might sound obvious, but most programmers violate this rule at one
time or another. For a typical example, consider what's wrong with
this SQL statement:
SELECT EMPNO, LASTNAME, SALARY FROM EMP WHERE EMPNO = '000010';
Give up? The problem is that EMPNO is included in the SELECT-list. You already know that EMPNO will be equal to the value '000010' because that is what the WHERE clause tells DB2 to do. But with EMPNO listed in the WHERE clause, DB2 will dutifully retrieve that column, too. This incurs additional overhead, thereby degrading performance.
- Use the WHERE clause to filter data in the
SQL instead of bringing it all into your program to filter. This, too,
is a common rookie mistake. It is much better for DB2 to filter the
data before returning it to your program. This is because DB2 uses
additional I/O and CPU resources to obtain each row of data. The fewer
rows passed to your program, the more efficient your SQL will be:
SELECT EMPNO, LASTNAME, SALARY FROM EMP WHERE SALARY > 50000.00;
This SQL is better than simply reading all the data without the WHERE clause and then checking each row to see if the SALARY is greater than 50000.00 in your program.
- Use parameterized queries. A parameterized
SQL statement contains variables, also known as parameters (or
parameter markers). A typical parameterized query uses these
parameters instead of literal values, so that WHERE clause conditions
can be changed at run time. Usually the program is designed such that
the end user can provide the values for the parameters before running
the query. This allows one query to be used to return different
results based on the different values provided to the parameter.
The key performance benefit of parameterized queries is that the optimizer can formulate an access path that can be reused over repeated executions of the statement. This can accrue a large performance gain for the program as compared to issuing a completely new SQL statement every time a new value is required in a WHERE clause.
These rules, though, are not the be-all, end-all of SQL performance tuning-not by a long shot. You might need additional, in-depth tuning. But following the preceding rules will ensure you are not making "rookie" mistakes that can kill application performance.
Specific database application development tips
Regardless of the programming language and environment you use to build DB2 applications, there are certain techniques and guidelines you can deploy to help assure good performance when accessing DB2 data.
Be sure to issue the COMMIT statement frequently in your application. A COMMIT statement controls the unit of work. Issuing a COMMIT records all the work since the last COMMIT statement "for keeps" in the database. Before issuing the COMMIT, you could roll back the work using the ROLLBACK statement. When you modify data (using INSERT, UPDATE, and DELETE statements) but don't issue a COMMIT, DB2 will hold and maintain locks on the data-which can cause other applications to time out waiting to retrieve locked data. By issuing COMMIT statements when work is complete, and ensuring that the data is accurate, you free up that data for other applications to use.
Also, build your applications with usage in mind. For example, proceed cautiously when a particular query returns thousands of rows to the end user. Rarely will more than a few hundred rows be useful for an online interaction between a program and an end user. You can use the FETCH FIRST n ROWS ONLY clause on your SQL statements to limit the amount of data returned to a query. For example, consider this query:
SELECT EMPNO, LASTNAME, SALARY FROM EMP WHERE SALARY > 10000.00 FETCH FIRST 200 ROWS ONLY;
This query will return only 200 rows. It does not matter if more than 200 rows would have qualified; DB2 will signal end of data with a +100 SQLCODE if you try to FETCH more than 200 rows from the query. This approach is useful when you want to limit the amount of data to return to your program.
DB2 supports another clause called OPTIMIZE FOR n ROWS, which will not limit the rows returned to the cursor, but can be helpful from a performance perspective. Use the OPTIMIZE FOR n ROWS clause to tell DB2 how to process the SQL statement. For example:
SELECT EMPNO, LASTNAME, SALARY FROM EMP WHERE SALARY > 10000.00 OPTIMIZE FOR 20 ROWS;
This tells DB2 to try to fetch the first 20 rows as quickly as possible. This is useful if your Delphi application displays rows retrieved from the database 20 at a time.
For read-only cursors, make the cursor unambiguous by using the FOR READ ONLY clause. For example:
SELECT EMPNO, LASTNAME, SALARY FROM EMP WHERE SALARY > 10000.00 FOR READ ONLY;
Understanding the basics of SQL coding for performance will give your enterprise applications an immediate performance boost. But I've only scraped the tip of the iceberg. You will need to learn about increasingly complex types of SQL, including joins, subselects, unions, and more. You'll also need to learn how best to write these SQL statements, and how to discover the access paths DB2 chose to satisfy your SQL requests. Indeed, there is much more to learn. But content yourself with the knowledge that you have embarked on the path of getting the most out of DB2 SQL.
- IBM DB2 Web site: http://www.ibm.com/software/data/db2/
- DB2 Magazine: http://www.db2mag.com
- International DB2 User's Group: http://www.idug.org
- The SQL Reference for Cross-Platform Development (IBM): http://www.ibm.com/developerworks/db2/library/techarticle/0206sqlref/0206sqlref.html
- Developing for DB2: http://www.db2mag.com/db_area/archives/2002/q2/eaton.shtml
- Existence Checking in DB2: http://www.dbazine.com/custard1.html
- A Quick Reference for Tuning DB2 Universal Database EEE: http://www.ibm.com/developerworks/db2/library/techarticle/0205parlapalli/0205parlapalli.html