October 05, 2010

Star Schema and Pairwise Join

DBAs who design and organize data warehouses should take the time to read the IBM Redbook, "Enterprise Data Warehousing with DB2 9 for z/OS." In addition to offering a detailed explanation of data warehousing's evolution on z/OS, it provides a great description of star schema database design.

Star schema is commonly used when a database's primary function is to allow users to answer BI questions. The design is typically composed of a large fact table and several smaller dimension tables. These dimension tables contain values that are given IDs. These IDs are stored in the fact table in place of the real values.

One performance issue that can emerge from star join processing (JOIN_TYPE = 'S') is that DB2 filters the dimension tables before the fact table. To deal with this, these tables must be brought into memory (indicated with ACCESS_TYPE = 'T' in the plan_table), and dedicated virtual memory pools must be created to support the star join operation. The DB2 V8 administration guide has a formula to help estimate the size of the virtual memory pool based on the number of concurrent star joins along with the number of work files and the estimated number of rows in the work files. As you can imagine, as dimensions grow larger in number and size, you need to support and the size of these dimensions can have a negative impact on performance.

DB2 9 introduced pairwise join (JOIN_TYPE = 'P') to support dynamic index ANDing. The pairwise join considers each dimension as an independent object to match the fact table on single column indexes.

Pairwise join offers these benefits:
•    DB2 filters the dimension tables before accessing the fact table.
•    Poorly filtering processes on dimension tables are automatically terminated.
•    Greater parallelism.
•    The ability to fall back on RID-list failure by writing the RID pool to a work table space and continue processing.

To take advantage of pairwise join, a single column index must be created on the fact table. The STARJOIN, SJMXPOOL and SJTABLES zparms must also be set. (See the Redbook for much more about these parameters.)

To enable pairwise join:
•    Make sure all DB2 for z/OS maintenance is on. The minimum level is PUT0903.
•    Enable star schema features and pairwise processing (zparm EN_PJSJ = ON).
•    See PK76100 APAR/PTF – Rebind static SQL

If you like tuning SQL, you'll want to use the EXPLAIN command to see how DB2 is accessing the data. The DB2 Version 9.1 for z/OS Performance Monitoring and Tuning Guide has some very good information on designing and tuning star schema processing.

The pairwise join EXPLAIN output found in the PLAN_TABLE is much more complex than typical join processing. A sample SQL statement and EXPLAIN output is found on page 605 of the guide.

(Note: The ACCESS_TYPE = "MI" should only occur once -- the Guide errantly displays "MI" multiple times. I'm told IBM will soon correct this. The "MI" is the process of merging the RID list with pairwise join. DB2 is really doing multiple MI functions to merge all the RID lists using pairwise join; however, IBM wants to simplify the explain output, so MI is shown only once to represent all MI processing.)

More star schema and pairwise join resources:
•    APAR to enable pairwise joins (APAR PK76100)
•    List of APARs needed for pairwise processing
•    IDUG 2008 -- DB2 z/OS Optimizer: What have you done for me lately?
•    HoldData for DB2 9.1 PUT Level 0903
•    IBM presentation on DB2 9 Optimizer and sparse index processing used by the optimizer
•    IBM Redbook: Enhancing SAP by Using DB2 9 for z/OS

If you work with star schema processing on z/OS, please share your experiences in Comments.