July 22, 2008

Batch Tuning: Checkpoint Restart

This is the fourth in a five-part series on tuning batch jobs. In part three I covered reducing I/O.

Checkpoint Restart is a facility that's implemented with database applications to "commit" atomic units of work and, at the point of commit, record information to restart the program from that commit point should the application fail before the next commit point.

Figure 1 Download db2utor_072208_fig_1.bmp  shows DB2 along with QSAM and VSAM file processing. For this sample, the top row shows the normal processing of 4 hours for this job. The second row shows the application failed 3:50 (3 hours and 50 minutes) into the job. The job rolls back for up to 7:40. It takes 1-2 hours to fix the problem, and then you must rerun the job from the beginning, which takes about 4 hours to complete.

The last row shows what happens with Checkpoint Restart. When the job runs for 3:50 and fails, instead of rolling back 7:40, it only rolls back 5-15 minutes based on the commit frequency. It still takes 1-2 hours to fix the problem, but now it only takes 10-20 minutes to finish running the job from where it failed.

For years I've participated in application design and coding reviews. My job as a DBA is to ensure programs use proper SQL coding standards and that the SQL will perform well. Another aspect of the review process is to ensure that poorly written programs won't move into production and negatively impact other applications. 

As a consultant who's worked with many different clients, I noticed that some companies understood Checkpoint Restart's importance and implemented standards requiring checkpoint processing in all batch jobs. However, others didn't get it, and when I tried to suggest and promote the concept of Checkpoint Restart, I was usually met with resistance. Why? My theory is that these companies didn't have a tool that could do Checkpoint Restart for them, and they didn't have time to write one.

However, in shops where I helped design and implement Checkpoint Restart, the resistance soon faded. Developers figured out that it actually made testing during development much faster when they could just restart from where the program failed, rather than start the test from the top. So while the primary benefit of Checkpoint Restart is that it reduces outage time in production, a side benefit is it improves the speed and quality of application testing during the development stage.

You may ask how this fits with performance and tuning. When a program that performs a lot of insert, update and delete processing encounters an error, all of the work performed must be rolled back to the last commit point or to the beginning of execution. DB2 considers rollback work a priority, so it allocates more resources to complete the task. All other tasks receive fewer resources until the rollback is completed. This negatively impacts all other programs executing while DB2 is rolling back. So you'll see a slowdown and an increase in overall elapse times.

Next week we wrap this series up with designing for continuous availability.