October 18, 2011

Collation Key an Obscure but Helpful Function

Prior to the proliferation of client/server Web-based applications, it was simpler. When a customer name ("Coleman") was entered into DB2, we knew it would be stored in all caps ("COLEMAN"). Upper-case. However, unless your application converts the data before it's inputted into DB2, this is no longer true. Data is stored as is: Coleman could be "Coleman," "COLEMAN" or maybe even "coleman."

And if you didn't happen to input the data yourself, you might not know how to access it. Really, it's an age-old problem, but fortunately, there are ways to make database searches less rigid. One option is to use the UPPER() function on both the column and the comparison string entered for the search.

SELECT account_information
FROM customer_tbl

Of course, this only works when the data is entered correctly. If the data entry person mistypes the name (e.g., "cole man"), UPPER won't help. That's why you should take advantage of the collation key (COLLATION_KEY) function. I've tested it, and it ignores extra spaces and allows searches in upper- or lower-case. This function can also be used in sorting to ensure data is grouped together.

Say, for instance, that the LASTNAME column is set to "HAAS," which is a row on the sample table DSN8910.EMP.  The search is for "Haa s." Even with the inadvertent space, the search returns the row for the customer named "HAAS."

SELECT e.lastname                                   
FROM DSN8910.emp e                                  
where collation_key(e.lastname, 'UCA400R1_AS_LSV_S2')  
          = collation_key('Haa s'     ,'UCA400R1_AS_LSV_S2')

To be honest, the documentation for this function is extremely poor. The SQL Reference Guide sends you to "Support for Unicode: Unicode Services," but I couldn't find a thing in this manual.

So I stuck with SQL reference, and on pages 303-304 there's a keyword reference table that defines the collation key codes:

    UCA400R1:   tells DB2 to use conversion services UCA version 4.0.1.
    AS   : Alternate Shifted
    LSV : Locale – use Swedish linguistic conventions
    S2   : Strength Secondary

When used in tandem, these codes tell DB2 to ignore spaces, punctuation marks, symbols and letter cases, using the Swedish linguistic conventions.

This the most helpful (really, the only helpful) documentation I've found. The DB2 10 SQL Reference Guide doesn't have much on collation key, either.

If you can get your mind around the different codesets in the reference table, the collation key function can help you handle string comparisons and sorting collating issues. (Note: When using functions such as UPPER and COLLATION_KEY in a where predicate on a column on the table, remember that DB2 can't use the index unless it's created with an expression. For more about this, see DB2 9 Index with Expressions.)

If you're familiar collation key and its related codes, please post in Comments. We all could benefit from your experience.