The following PowerUp blog entry was written by Jim Oberholtzer, owner of Agile Technology Architects, LLC, a consulting firm that assists companies in the implementation and maintenance of an agile/SCRUM development environment.
Recently a customer called with an emergency--two disk units in the same RAID set had failed, taking the company’s POWER4 model 825 (V5R4M5) machine completely down. I was asked to help rebuild the system. While the hardware issues were being dealt with, I started to work with the customer to get ready for the system restore. I asked the customer if I could see the recovery plan; the company didn’t have one. However, a backup strategy involving a full system save every weekend with a cumulative save every night was in place, so we would be able to get the system back to within about 12 hours of the crash. This was good news, but it struck me that once again I ran into a customer without a plan in the face of a disaster. The programs that managed the backup were about six years old and had not been modified in that time. I wondered what would be missing from the backup plan. My other concern was that beyond the weekly/nightly backup nothing was being done to the system to keep it up to date, it just kept running, until it didn't.
We backed into a recovery plan based on the backups that were available and started to rebuild the system. The restore went as planned and most of the data was recovered. Then the real news hit the customer, all of the spool files were gone. Remember all the customer did for nightly backups was a custom program that hadn’t been touched in some time. Despite the critical nature of the spool files, the customer hadn’t updated the backup routines to include the newer features of even V5R4, so the customer lost business critical data. As I started investigating the system a bit further, I discovered the customer owned the enterprise edition of IBM i, which meant that Backup Recovery Media Services (BRMS) was available on the system fully licensed, but it wasn’t being used. That wasn’t the only portion of the system the customer was unaware of. There were others as well. Then the last thing I noticed, the PTF/group levels were more than two years old.
When senior management came for the after action meetings, they were quite angry that the system had been lost and were adamant about leaving the platform for a “more reliable Microsoft system.” At this point I asked them how long they had had the oldest Intel system in the data center. Under three years I was told. The company swaps all Intel servers out at least every 36 months. So they were comparing the reliability of a system that had run faithfully in the data center for nearly eight years, without a failure (until now) and that boasts better than a 99-percent uptime and that received nearly no attention except an operating system upgrade to V5R4 from V5R2, to Microsoft systems where patches were applied weekly and the hardware was swapped out every 36 months. Four full-time employees run the Intel environment. One developer/operator/system administrator/business analyst handles the most business critical system in the data center. Hmm, you just have to wonder how the math adds up. They wouldn’t listen to the point that the IBM system had cost much less to manage over the time they had it than an equivalent number of Intel servers. Sadly, this is the story of another former IBM i customer on their way to a massive increase in TCO.
So the take away from this is four key points:
- I don't care how old your system is--from shiny and new to old and tired--build a recovery plan that fits the business needs, and then create the backup routines from that recovery plan. Most importantly review the recovery plan yearly for updates!
- Know what software you’re licensed to use, and use it! BRMS in this case would have made the recovery much easier, and while on its own it wouldn’t have saved the spool files, chances are the administrator might have at least thought about them when the new features were announced or the recovery plan was reviewed.
- Keep current on PTFs. In this case hardware PTFs were missing on this system that directly affected RAID cache, and I suspect that was the root cause of the initial failure in the first place.
- When our favorite systems are being compared to other technologies, be ready for the discussion. Research IBM i's real TCO vs. other systems in your environment. Understand the criticality of the data on the system, and never forget what the "i" stands for--integrated. How many Intel servers does it take to match up to just one Power Systems server running IBM i? Lots and lots.