Use SELECTIVITY to supply missing information

Some query predicates are inherently difficult for the optimizer to analyze, yet effective optimization depends on accurate information about the data being selected. When other statistics are unavailable, careful use of the SELECTIVITY clause can give the optimizer the information it needs to produce the best access plan for a query.

The optimizer uses a variety of techniques to evaluate and estimate the number of rows a query predicate will select. This information may come from key range estimates, from column statistics, or from inferences based on the cardinality of the data. Sometimes, however, the optimizer has no way to determine how a predicate will apply to the underlying data. In these cases, the SELECTIVITY clause can be added to difficult predicates to provide the extra information that the optimizer needs.

Using SELECTIVITY to correct inaccurate estimates

One common type of incomplete information involves data that is transformed before it is used in a predicate. The transformation could be performed by a user defined function or it could be performed by a built in function. In either case, something about the function prevents the optimizer from making inferences about how the transformation will affect the data.

Consider the following query:

 SELECT EXCHANGE, S.SYMBOL, NAME, CURRENT_POSITION 
  FROM SECURITIES S INNER JOIN POSITIONS P ON S.SYMBOL = P.SYMBOL 
  WHERE RISK_SCORE(EXCHANGE, S.SYMBOL) > .8 
    AND EXCHANGE = 'NYSE' 
    AND CURRENT_POSITION > 1000 
ORDER BY EXCHANGE, S.SYMBOL

Because the optimizer does not know the internal logic of the RISK_SCORE function, it uses a default selectivity value. The optimizer will generally assume 33% of the rows satisfy the predicate RISK_SCORE(EXCHANGE, S.SYMBOL) > .8. If this assumption is not accurate, it could cause the optimizer to select an inappropriate query plan. To mitigate this, a user who understands the behavior of the RISK_SCORE function and knows that only 1% of the rows in the table will satisfy the predicate can rewrite the WHERE clause as:

WHERE RISK_SCORE(EXCHANGE, S.SYMBOL) > .8 SELECTIVITY .01
  AND EXCHANGE = 'NYSE' 
  AND CURRENT_POSITION > 10000

Another common cause of missing information is when the query selects a small number of rows that have been added to the table very recently. These rows may not yet be reflected in the statistical information automatically maintained by the statistics manager, leading the optimizer to believe that the query selects no rows. As in the above example, the solution is to add a SELECTIVITY clause that accurately represents the percentage of rows selected by the predicate.

To determine when the use of SELECTIVITY could be helpful for a query, use Visual Explain to identify operations with significantly inaccurate Percent Selectivity attributes under the Estimated rows selected and query join info heading. In many cases, you may have a reasonable intuition for the correct selectivity for the operation. In cases where you need additional information, either Run and Explain or use Visual Explain on a plan that is cached in the plan cache, since these options provide you with information about the actual number of rows processed by each operation. These values are given as Actual Rows Selected Per Plan Step Iteration and can be compared to the estimated Rows Selected Per Plan Step Iteration. Keep in mind that a given operation could process multiple predicates that have been logically combined. Since SELECTIVITY applies to an individual predicate, the effect of SELECTIVITY on the overall Percent Selectivity of the operation can be difficult to predict. In general, the effect of adding SELECTIVITY is easiest to predict and understand when the operation has only a single simple predicate (for example, a UDF.)

Using SELECTIVITY to reduce plan volatility

The optimizer uses a variety of factors when re-validating whether a cached access plan should be re-used. One key factor is the selectivity of the query's predicates. The same predicate with different host variable values may select a very different number of rows and thus run best with a different access plan. For this reason, the optimizer will not re-use a cached plan that it estimates will select a significantly different number of rows. Instead, it will re-optimize the query with the new values.

Under some circumstances, you may prefer to use a single plan for all host variable values rather than relying on the dynamic behavior of the optimizer. If the plan is changing due to the selectivity of a host variable value changing, the SELECTIVITY clause can help to lock in a preferred plan.

For example, consider the following query where the plan changes depending on the given value (for example, 1000 rows vs 1,000,000 rows are returned.)

SELECT ... FROM EMPLOYEE WHERE SALARY > ?

If the preferred plan is known to occur when SALARY > 40000 and it is known that 20% of the rows are selected by this predicate, then this query can be rewritten as:

SELECT ... FROM EMPLOYEE WHERE SALARY > ? SELECTIVITY .2

When running this query, changes to the host variable value will no longer influence the optimized plan.

Considerations for using SELECTIVITY

While SELECTIVITY can be useful when applied carefully, it can also produce some surprising results. Used without care, SELECTIVITY can cause more harm than benefit.

The optimizer integrates multiple pieces of information - including I/O costs, predicate selectivity, and correlation between columns - to form a coherent model of the data and the environment. By overriding the internal model, a user-provided SELECTIVITY value may introduce inconsistencies into and reduce some of the benefits of the model. As a result, SELECTIVITY can lead to less accurate estimates of I/O costs. It may also prevent the optimizer from detecting correlation between related columns and thereby reduce the accuracy of the estimates for the overall set of predicates.

Furthermore, by locking in a single value, SELECTIVITY prevents the optimizer from automatically responding to changes in the underlying data. As the data grows and changes, the SELECTIVITY value may become inaccurate, requiring you to occasionally re-evaluate and adjust it.

For these reasons, SELECTIVITY is best reserved for use in the limited situations described in this section. Before using SELECTIVITY to solve a query performance problem, follow the other recommendations in this document. For example, creating a new index or column statistic might help the optimizer not simply with one but with many queries. Keep in mind that a hard-to-optimize query may be a sign that re-writing the query or improving the data model is a better long-term strategy. SELECTIVITY should be a last resort when no other options exist for providing the optimizer with the needed information.