November 15, 2011

Storing Data in Unicode

I recently heard a podcast on codepage, unicode and internationalization. Dan Luksetich and Susan Lawson interviewed IBM distinguished engineer and DB2 developer genius Chris Crone.

Chris discusses the implications a CCSID and codepage have on data translation and how the choice of codepage can impact storage as well as CPU consumption for translation services. From there he gets into UTF-8 and UTF-16, and how each can impact the performance of your Java applications; the different types of drivers needed to connect to the database when using a remote versus a local application, and the tradeoffs that must be made when, for example, local-running COBOL batch programs need the data in EBCIDIC and Java-based applications need the data in Unicode.

After listening to this podcast I wanted to learn more about Unicode and the implications of storing data as UTF-8 or UTF-16, so I went to the place I normally go to get more information: IBM Redbooks. Sure enough, I found "DB2 UDB for z/OS Version 8: Everything You Ever Wanted to Know... and More."

Even though this publication came out with DB 2 V8, it's still a great starting point if you're planning to store data in Unicode. I began with chapter 6, "Unicode in DB2 for z/OS." Chapter 6.1 (Conversion Basics) covers conversion services and gives some of the performance implications of using multiple CCSIDs in an SQL statement. In the podcast Chris talks a bit about this as well. He also gets into ramifications that incorrect coding can have on your result set. The Redbook, meanwhile, shows you how to write SQL code that won't cause performance degradation.

I followed up with Dan because I had a question on JDBC drivers. I like the way he summarizes this topic: type 4 drivers "talk" in UTF-8; type 2 drivers talk in UTF-16. In general, if a remote application is accessing a database encoded with UTF-8, use type 4 drivers; for a local application encoded with UTF-16, use type 2 drivers. For a local-running COBOL program, use UTF-16 if possible. Mixed application environments, understandably, don't have a simple solution. In these cases you must examine the frequency of execution and weigh the costs. Chris does point out during the podcast that a locally connected type 4 driver can take advantage of the hypersockets, thus mitigating the overhead of translation.

If you have experience in the area of tuning Java applications and using JDBC driver, please register your thoughts in Comments.