Blog
DB2utor

Categories

May 13, 2008

How to Write Efficient SQL (Part 2)

Last week I reviewed concepts and terms that are used to explain how DB2 filters and retrieves rows of data based on different types of predicates. I also offered some general rules for coding efficient SQL.

To write more efficient SQL, you must understand how DB2 processes the predicate. DB2 has two sets of rules that dictate the order of predicate evaluation: The first rule set is used to describe the order of processing by stage; the second set describes the order of processing within a stage.

Rule Set 1
1) Process all indexable predicates and apply key column filter to the index space. First process predicates that match on a key column within an index. Next, stage 1 predicates that haven't been picked as matching index but still refer to columns within the index.
2) Other stage 1 predicates are applied. After index processing, other predicates are applied to the data in the table space.
3) Stage 2 predicates are applied on the returned data rows from stage 1.

Rule Set 2
1) All equal predicates are evaluated. Predicates in the form (IN list where the in list has only 1 value) or (column BETWEEN value1 and value1).
2) All range predicates and predicates in the form of “column IS NOT NULL” are evaluated.
3) All other predicate types are evaluated.

Rule sets 1 and 2 are processed in the given order, regardless of the order in which you've written the predicates within the query. Once these two sets of rules are processed, the order of the predicate within the query is processed, so you do have some control over the order of evaluation.

The exception to this rule is when you write a subquery. DB2 always evaluates a non-correlated subquery before a correlated subquery, unless DB2 correlates, de-correlates, or transforms the subquery into a join. New to DB2 V9 is the capability to look at the cost of processing the subquery and rewrite the query to be either correlated or non-correlated, regardless of how you original wrote the subquery.

A general rule to remember is when you form a compound statement built by combining two simple predicates with an OR operator, the result of the operation takes on the characteristics of the simple operator evaluated last. For example, if two indexable stage 1 predicates are combined with an OR operator, then each predicate is evaluated as stage 1. However, if a stage 1 predicate is combined with a stage 2 predicate with an OR operator, then the predicate is evaluated at stage 2. Any predicate that's associated with a DECFLOAT data type is stage 2.

The DB2 9 for z/OS Performance Monitoring and Tuning Guide has some excellent examples that show how predicates are processed.

For the following examples of predicate evaluation, I used the sample table DSN8810.EMP with a new non-unique index I defined on (SALARY,BONUS,COMM).

*WHERE SALARY = value AND BONUS = value

Both predicates are stage 1 and the compound predicate is indexable. A matching index scan could be used with SALARY and BONUS as matching columns.

*WHERE SALARY = value OR BONUS = value

Both predicates are stage 1, but not Boolean terms. The compound is indexable. However, multiple index access is unavailable because we don't have an index with a leading column on BONUS. For single-index access, SALARY and BONUS can be only index screening columns.

*WHERE E.SALARY IN (SELECT E2.SALARY
FROM SBTTEST.EMPL E2
WHERE E2.EMPNO = '123456'
AND E2.EMPNO = E1.EMPNO
)

The correlated subquery is a stage 2 predicate. However, DB2 might transform the query from correlated to non-correlated subquery during processing, in which case the predicate is stage 1 indexable.

The key to writing efficient SQL comes with learning the rules for indexable stage 1-2 predicates and how a compound predicate can change from what you thought was a stage 1 predicate to a stage 2. Once you understand this, you can see why the order in which you write the predicates can be important.

Most people will tell you that the order doesn't matter because the DB2 optimizer is very smart and will rewrite the query for you. There's some truth to this, and of course the Optimizer gets better with each DB2 release. However, if you understand why the order is important and practice writing SQL queries with these rules in mind, your queries will run efficiently even in situations where DB2 can't rewrite the predicates. The general rule is to write the query with the highest-filtering predicates first. This  reduces the number of rows that the next predicate must evaluate.

Filter Factor

The higher filtering is determined by what's known as the filter factor. A predicate's filter factor is always a number between 0 and 1. This number represents an estimate of the number of rows in the table where the predicate is true. When the predicate is true, these rows qualify and are passed to the next step of evaluation.

For example, say you have an employee table with an "M" or "F" value in the sex column. In the absence of column statistics, DB2 will determine that SEX = "F" has a filter factor of .5, or 50 percent.

The three main components that determine the filter factor are:

1) The constant value of the value in the predicate.
2) The operator in the predicate.
3) Statistics on the column in the predicate.

The variable in the above list that most people miss is statistics. You must run the RUNSTATS utility to collect the needed column cardinality statistics used by the optimizer to choose a good access path.

When no statistics are collected on the column being used in the predicate, DB2 determines the number of rows that qualify based on these defaults:

Predicate Type Filter Factor
========================= ===================
COL = constant 1/25
COL <> constant 1 – (1/25)
COL IS NULL 1/25
COL IS NOT DISTINCT FROM 1/25
COL IS DISTINCT FROM 1 – (1/25)
COL IN (constant list) (number of constants/25)
COL OP constant 1/3
COL LIKE const 1/10
COL BETWEEN const1 and const2 1/10

When you have statistics collected, the formula changes to:

Predicate Type Filter Factor
========================= ===================
COL = constant 1/COLCARDF
COL <> constant 1 – (1/COLCARDF)
COL IS NULL 1/COLCARDF
COL IS NOT DISTINCT FROM 1/COLCARDF
COL IS DISTINCT FROM 1 – (1/COLCARDF)
COL IN (constant list) (number of constants/COLCARDF)
COL OP constant interpolation formula
COL LIKE const interpolation formula
COL BETWEEN const1 and const2 interpolation formula

Example: I have 500,000 rows in the employee table with the last name column COLCARDF = 1728.

What's the filter factor for this predicate?

WHERE LASTNAME = 'value'

The default with no statistics collected is calculated as:

1/25 or (.04 * 500,000) = 20,000 rows qualify

The calculation with statistics would be:

1/1728 or (0.000578 * 500,000) = 289 rows qualify

Interpolation Formula

You may have noticed the term interpolation formula in the above chart. For range predicates, DB2 uses this formula to estimate the ratio of the number of values in the range to the number of values in the entire column of the table.

In the following  formula, the HIGH2KEY and LOW2KEY values are taken from SYSIBM.SYSCOLUMNS for the column and table referenced. The value of total entries is calculated as HIGH2KEY – LOW2KEY.

* When operator < or <=, where constant is not a host variable
(constant value – LOW2KEY value) / total entries

* When operator > or >=, where constant is not a host variable
(HIGH2KEY – constant value) / total entries

* When operator LIKE or BETWEEN
(high constant value – low constant value) / total entries

Column Statistics

DB2 can collect two kinds of distribution statistics:

* Frequency – New to DB2 V9, this is the percentage of rows in the table that contain the value for a column or a set of columns. Frequency statistics are collected by using the new DB2 9 RUNSTATS HISTOGRAM option.
* Cardinality – This is the number of distinct values in a set of columns. Cardinality statistics are collected by using the RUNSTATS FREQVAL option.

Don't get overwhelmed by all of these formulas -- just remember that RUNSTATS is your friend. RUNSTATS is the utility used to collect column statistics. The DB2 Optimizer uses these stats to determine how to access the data. The best way to improve the accuracy for a good access path is to collect statistics; if your data is skewed, make sure you run with the FREQVAL and HISTOGRAM options.

Today we looked at the order of evaluation of the predicate and how filter factors are used by DB2 to estimate the number of rows that qualify by that predicate. Putting predicates that have a high filter factor first reduces the number of rows that must be processed by the next predicate in the query.

Next week I'll wrap up this series with some problem queries and solutions that make the queries more efficient. I'll also demonstrate how I use tools like Visual Explain for DB2 V8 and Optimization Service Center (OSC) for V9 to quickly figure out filter factors to help improve the speed at which you can write efficient SQL.