July 29, 2008

Batch Tuning: Continuous Availability

This is the final installment in a five-part series on tuning batch jobs. In part four I covered Checkpoint Restart.

Previously I looked at ways to reduce elapsed times, CPU usage and I/O. Now we'll cover ways to improve continuous availability.

By continuous availability, I refer to how your batch job is designed to run with other batch jobs and/or online transactions. The part on Checkpoint Restart touches on this topic. The program can commit more often so it can run with other programs, and if it fails, it can quickly recover and start from where it leaves off.

Here are some design techniques that can help you in your efforts to achieve continuous availability:

  • Retry logic--If your application receives a timeout sqlcode -911 with reason code 00C9008E, you may be able to retry the SQL statement and continue. The other application may have issued a commit and released the locks that are holding you up. I'd have the program retry the SQL 2-3 times before issuing a final ABEND. On the other hand, if you receive a deadlock sqlcode -911 with reason code 00C90088, you should abort and see if you can rewrite your application or the one with which you deadlock. In either case, the aim is to process the tables in the same order or change the scheduled time of processing. The reason code is available in the SQLCA field SQLERRD(3). Most programmers don't provide this information. I highly recommend returning this information with your error message. Knowing whether it's a deadlock or a timeout makes it easier to pinpoint the problem, and make changes accordingly.
  • COMMITs--Taking commit points in a long-running job may increase CPU and log writes, but you gain the following:
    • Reduced lock contention especially in a data sharing environment
    • Improved the effectiveness of lock avoidance especially in a data sharing environment
    • Reduces the elapsed time for DB2 system restart following a system failure
    • Reduces the elapsed time for a unit of work to rollback after an application failure or explicit rollback.
    • Allows concurrency with utilities like online REORG.
  • Lock avoidance--DB2 provides a lock-avoidance mechanism. It does just what its name implies. DB2 reads the data without taking a lock on the page. If your application updates the page, then the lock is taken. Lock avoidance is accomplished by using these bind options:       CURRENTDATA(NO) and ISSOLATION(CS)                                          This is designed to reduce locking contention and greatly reduce your CPU overhead. When using CURRENTDATA(NO) and ISSOLATION(CS), consider changing system parameter SKIPUNCI to YES. By doing this, the uncommitted inserts are treated as if they never occurred. Your program doesn't wait for the outcome of the uncommitted inserts, but instead just skips over them.
  • Dirty read--The best lock-avoidance feature is running your application with ISSOLATION(UR) or uncommitted reads. You can read data that's being updated, deleted and inserted even before the unit of work is committed. This option is very fast and causes little contention because DB2 doesn't take a lock at the row or page level. I've seen this used as a silver bullet because it greatly reduces lock contention problems and can provide a high reduction in CPU usage and improvements in overall elapse times. However, use ISSOLATION(UR) with extreme caution. Ensure the end user can deal with logical inconsistencies in the data which can occur.
  • Optimistic concurrency control--When an application is using optimistic concurrency control, locks are taken immediately before a read operation and released immediately. Update locks are taken immediately before the update and then held to the end of transaction. This is all controlled through the use of RID and the row change timestamp token. I wrote about this process in a previous DB2utor blog post.
  • Close cursor -- In most batch jobs, you define cursors that hold locks past a commit point. Once you know you've completed processing for a cursor, you should close it immediately--otherwise shared locks are being held and could cause other jobs to fail.To close a cursor when you're done with it seems obvious, yet many developers don't do this and allow the locks to be held until the program completes. This can cause some deadlock/timeout problems because you have a job waiting to perform an update while another job is running and holding onto the page with a shared lock.
  • Sequence numbers--You may need to insert rows with a unique sequence number. Years ago we'd design the program to do a SELECT MAX(seq_id) FROM TABLE, which would cause contention when another transaction updated the sequence. But now you can take advantage of a DB2 sequence. DB2 doesn't wait for a transaction that's incremented a sequence to commit before allowing another transaction to increment the sequence again.

IBM continues to improve DB2 by providing new performance availability features like sequence numbers, lock avoidance and optimistic concurrency control. But it's up to you to figure out how these new features can function in your environment and help you reduce the cost of maintaining your legacy applications. While some features, like ISSOLATION(UR), don't require any coding changes, using sequences instead of SELECT MAX(seqid) does. Hopefully you designed the application and encapsulated this process so you only have to change it in one place. Sequence is a great feature that reduces both the cost of running the application and the risk of outages due to contention on a sequence table.

Remember: You can reduce elapsed times (by scheduling single-step jobs), CPU (by selecting only needed columns) and I/O (by denormalizing when needed). You can also take steps to minimize failure with Checkpoint Restart and increase availability by designing for continuous availability using optimistic concurrency control and lock avoidance. So you shrink your batch windows, and get more done in less time.