DB2 SQL development: Best practices
What every developer should know
I personally agree that the majority of performance issues in DB2 relational database development are caused by program code. Of all the IT shops for which I have provided performance and tuning consulting work in, most of the performance issues are directly related to modifying application and SQL code, adding and altering indexes, or improving data statistics. Improvements are mostly found in the area of application and SQL code.
I have found that a relatively high percentage of developers really don't know the performance issues involved with SQL programming or how to fix them. And in defense of those developers, there is just too much pressure to get changes and new code into production, not allowing for any time for performance tuning. But there are a number of basics that all developers should know and shops need to educate in. This can save IT costs and minimize production incident reporting.
SQL 101 rules
All developers should know what I call a few of the SQL 101 rules:
- Never put SQL scalar functions on columns in the WHERE predicates. For example: WHERE YEAR(HIREDATE) = 2009 should be recoded as WHERE HIREDATE BETWEEN '2009-01-01'and '2009-12-31'. When placing SQL scalar functions on columns in the SELECT portion of the SQL statement does incur some minimal overhead, but applying it to a column in the WHERE clause causes the predicate to become stage 2 non indexable.
- Same applies for mathematics. For example: WHERE HIREDATE – 7 DAYS > :HV-DATE should be recoded as WHERE HIREDATE > :HV_DATE + 7 days. Having the mathematics on the host variable is not a problem.
- Code only the columns needed in the SQL SELECT. Having extra columns can have an effect on the optimizer's choice of index only processing, larger sort costs, and at times the join type chosen.
- Watch out for any ORDER BY, GROUP BY, DISTINCT, UNION, INTERSECT, and EXCEPT. These may cause sorts to occur in the processing. Make sure they are truly needed in the query.
- Minimize SQL requests from your application program. This is especially true in batch processing where the number of statements executed is typically high. Every time a SQL request is sent to DB2, there is overhead involved because it has to be sent from one address space in the operating system (for example Cobol or Java™) to the DB2 address space. In general, developers should minimize these requests by reducing the number of times they open and close cursors, execute select statements, and so on. DB2 has multi-row processing for developers specifically for this reason, where you can process more than one row at a time. Developers need to code more relationally and less procedurally.
All developers should know how to read a DB2 Explain. There are many pieces of information that come out of the Explain, but you should know how to read the basics. For example:
- Is there indexing being used or a table space scan?
- Which index is chosen, and what are the matching columns?
- Are there any sorts going on? Why? What is the number of rows going into the sort?
- Are there any Stage 2 predicates?
If there is a table space scan occurring, you need to figure out why the DB2 optimizer may have chosen this access path:
- Is the table involved small? Sometimes DB2 chooses to perform a scan over an index because of a small number of rows.
- Do the Runstat statistics say the number of rows is small, even though there may be a large number of rows in the table? Check the tables involved and see when the last Runstats utility was run.
- Are there any appropriate indexes to match the SQL predicates? What are the indexes on the tables?
- Is the query retrieving a large number of rows from the table in question, or does DB2 think there will be a large number of rows returned? What DB2 thinks the number the rows returned for a predicate is called a Filter Factor and comes out in the Explain as a percentage. The filter factor can be seen in the Visual Explain output of the Data Studio tool, or by querying the DSN_PREDICAT_TABLE after an Explain is executed.
- Are the SQL predicates involved with the table written in a poorly coded, non-indexable way? For example, scalar function on the column in predicates?
- Did the optimizer choose a non-clustering index? What is the amount of data to be returned? Sometimes, the optimizer might see that a table contains 1500 pages of data and the query returns a high enough number of rows that end up accessing a high percentage of those pages.
You need to look and see which index was chosen, and then further look in the Explain to see how efficient the index is being probed: There is a column in the Explain output that states the matching columns used in the chosen index. The more columns that get matched, the faster the index probing. If the Explain states 0 matching columns, then an index scan is occurring. This is typically not a good access path, you should research to understand why there are no matching columns.
If there are any sorts occurring, understand why and see if you can eliminate them. Sorts are as expensive as their size. Sort processing in DB2 can be very fast, but obviously the larger number of rows and the size of the rows makes a difference. If there are sorts occurring, and the number of rows is small, then that is unlikely the cause of a performance issue, but the most efficient sort is the one that never executes. Questions to ask: Can the sort be eliminated? Does it need to have a Distinct, Group By, Order By, Union, Intersect, Except, and so on? Can you eliminate any of these or can you rewrite the query differently to avoid a sort?
Stage 1 versus stage 2 predicates can make a difference in the performance of a SQL query. DB2 uses two stage architecture in processing of SQL predicates. In general, stage 2 predicates do not perform as well and consume extra CPU resources. The IBM Data Studio tool highlights any stage 2 predicates in a query, or the DSN_PREDICAT_TABLE can be queried after an Explain is executed. Many stage 2 predicates can be rewritten as a stage 1 predicate. If you search on DB2 Summary of Predicate Processing, you will be taken to the IBM manuals that list all the stage 1 and stage 2 predicates.
You should always code with two goals in mind. The first and most important is to get the data right. Whatever data you are processing or retrieving needs to be correct. Data integrity and correct reporting is number one in processing. The second goal is to get the processing executed as quickly as possible. Many times, you might lose sight of the second goal. You either do not know what to do to get a process to run faster, don't have time to worry about it, and/or blame other pieces of your environment. And often, you might think that the time spent processing was good compared to the number of rows processed, not ever being educated on how fast DB2 can process data if everything is right. The one thing (standard) that you should always know is how much data was processed during the runtime. Did the program process 10,000 rows of data or 10 million? How many OPEN CURSORS, SELECTS, UPDATES, DELETES, and so on occurred? The first question when there is a performance issue is typically: How much data was processed? Every program should display or log somewhere the amount of data processed on each execution.
And if you are working on a query or program for performance, where do you start? I would suggest following these steps:
- Check every predicate. Are they indexable and stage 1? Can you rewrite them more efficiently? Any stage 2 predicates?
- Can you rewrite any predicate different and contain the same logic? Sometime even rewriting 1 predicate may send the optimizer down a different access path.
- Check the number of times SQL requests are being sent to DB2. Is there any way to rewrite/redesign the program to minimize the number of requests being processed?
- Check the DB2 statistics. Are they current and correct? Often in test environments, the Runstat statistics are not current.
- Check the DB2 Explain output. Is there a tablespace scan? Any sorts? Any index processing with 0 matching column? Most times, developers want to see index processing chosen for their queries.
- Can you rewrite the query differently and obtain the same results? Sometimes queries can be written two, three, four, or more different ways to achieve the same output and they don't always optimize the same. One way can be much faster than another.
- Any subqueries involved? Try rewriting an IN subquery to an EXISTS subquery. At times, they optimize differently.
- Are there multiple subqueries? Make both subqueries the same type (correlated versus non-correlated) and then put most restrictive one first.
I have worked on many extremely large projects with many developers (locally and outsourced) and find it surprising how many developers do not know the basics of SQL development and the DB2 Explain tool. In my experience, projects that have SQL programming standards and guidelines and code walkthroughs save CPU costs and incident reporting when implemented in production. Often, management says there is no time for developers to execute DB2 Explains and perform code walkthroughs. I say there is definitely time and we will still be done by last Tuesday. I also often see poorly written code in production. I think it is the responsibility of the IT shops to educate their developers (whether local or out sourced) to the basics of SQL development. It all adds up to CPU costs. I have found that a little bit of training, shop coding standards, and program walkthroughs can totally change the culture of IT development departments, along with minimizing performance issues.
In summary, the following is a short list of SQL optimization coding standards and guidelines that you should follow:
- Do not code functions on columns in predicates.
- Do not code mathematics on columns in predicates.
- Watch out for any tablespace scans.
- Minimize the number of times DB2 SQL statements are sent.
- Watch out for any sorts.
- Only code the columns needed.
- Only sort on the columns needed. No need to ORDERY BY EMPNO, LASTNAME when you can ORDER BY EMPNO.
- Watch out for Union versus Union all. Union causes a sort for uniqueness. Same for INTERSECT and EXCEPT.
- Code stage 1 predicates only. Rewrite any stage 2 predicates.
- Stay away from 'OR' logic if possible in connecting predicates. Boolean term predicates are best.
- Watch out for the 'LIKE' predicate. 'Begins With' logic is indexable. 'Contains' is not indexable. 'Ends With' is not indexable.
- Do not code 'Not Between'. Rewrite it as > HV or < HV.
- Use 'Fetch First XX Rows' whenever possible.
- Use FOR FETCH ONLY on all read only cursors.
- All Case logic should have an 'else' coded. This will eliminate DB2 returning nulls by default if all the Case conditions are not met.
- Hard code values in predicates whenever possible. Have special statistics on these hard code values if there are uneven distributions of values.
- Make sure cardinality statistics exist for all columns in all tables.
- Code most restrictive to least restrictive predicates in order.
- Stay away from 'Not' logic if possible.
- Code 'Not Exists' over 'Not In'. Both are stage 2 predicate, but 'Not Exists' typically outperforms the not in, especially if the list is long.
- Minimize the number of times cursors are opened and closed.
- Refer to DB2 for z/OS Best Practices.
- Refer to DB2 for z/OS: Technical Resources.
- Refer to DB2 Developers Guide by Craig Mullins.
- Refer to DB2 SQL Tuning Tips for z/OS Developersby Tony Andrews.
- Download the IBM Data Studio client.