By Frank Rodegeb, IT Management Consultant in IBM's High Availability Center of Competency
As part of our discussion on Architecture Matters we will have a series on high availability (HA)–what it is and how to go about achieving it. I plan to discuss a number of myths, inhibitors, fundamental concepts and other considerations regarding HA and availability management (AM). I look forward to sharing our experience and examining different approaches together.
Let me first introduce myself. I have been with IBM for 47 years with the last 23 years in an IT management consulting role specializing in HA and service management. I am currently a member of IBM's High Availability Center of Competence where I facilitate HA assessment workshops and provide service management expert support for clients worldwide. I've held a variety of positions within services, technical sales and operating system development, primarily in a customer facing, problem solving role. I have developed and continue to maintain and teach an AM seminar to IBM consultants and customers.
Information technology plays an integral role in the corporate strategy and IT service availability is critical. With IT services directly facing customers, outages and other service disruptions can tarnish the company image and be very costly to the business. Highly available IT services are a business essential and can be a competitive advantage. However, IT often faces a number of availability and service quality challenges in meeting the business needs:
- Are you meeting your availability objectives, but the business is still unhappy?
- Do you measure service availability, or is it really just component availability?
- Are you able to restore service quickly to meet business requirements?
- Do you really have an HA architecture or are you dependent on component reliability?
- Do you need to negotiate planned (or scheduled) downtime with your business users to make upgrades or apply maintenance?
- Are you making trade-offs between availability and cost or performance?
If you answered yes to any of these questions, you may find these discussions useful.
In this initial discussion I'll start with definitions, because there are many perceptions of what is meant by HA. Some people think HA is defined by some number of nines expressing the percentage of up time (e.g., four nines or 99.99 percent). One of our competitors suggests four nines is HA today. I find it difficult to put a number behind the definition since the number is changing over time as business requirements increase and technology continues to improve. And, frankly, if a service is down during any business-critical period, it really doesn't matter how many nines there are. In my mind, HA is a concept. I like the definition developed by a group of SHARE members: The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and masks unplanned outages from end-users and customers.
Let's examine some of the key terms to help us better understand what it's really telling us. First, what is meant by service? In this context service is what IT is charged with providing to business users and their customers–it's IT's mission to manage the information needs of the business and provide information services. To provide service to business users a system then must include all components necessary to collect, store, analyze and distribute the enterprise's information, and the information entrusted to them by their customers. This means a system is made up of not only the components we traditionally think of as infrastructure such as processors, storage and system software, but also includes applications, network, data and even people. I interpret the term “masks unplanned outages from end-users” to mean that unplanned component failures should not impact end users. I would, therefore say HA is the attribute of a system to provide service by isolating unplanned component failures from the business users.
So, I feel it's important to recognize that any discussion about HA is really about service availability. Certainly, that's where I intend to focus this series of posts, along with discussing any questions and comments you may have. While components may have availability characteristics, I do not plan to talk about availability in the context of components. Rather, when looking at components we should be thinking about the factors that can impact service availability--reliability, recoverability and serviceability.
Continuous availability is defined as a combination of HA and continuous operations: The attribute of a system to deliver non-disruptive service to the end user seven days a week, 24 hours a day across several time zones (there are no planned or unplanned outages). What this means to me is that a continuously availability system must provide the capability to remove and isolate any component at any time for whatever reason, whether for planned maintenance or unplanned failures, while maintaining service to the business without disruption. Obviously, this requires redundancy of every component at every layer, but it requires a whole lot more.
Do you have concerns that prevent your company from delivering highly available IT services? Over the next several months we'll discuss several HA- and AM-related topics to address many of the challenges noted above. We'll explore some of the underlying causes that must be addressed before these concerns and challenges can be overcome. We'll discuss good technology and management practices for HA, seven steps to IT process improvements and more. I'd also like to hear your suggestions on topics of interest to you.
Which of the factors affecting how users perceive availability (reliability, recoverability and scope of impact) can have the most impact on overall availability? In the next HA article we'll discuss where we can focus our time to achieve the most benefit in improving overall availability.