July 22, 2014

Lessons from a Physical Server Move

A customer planned to use Live Partition Mobility (LPM) to move running workloads from frame 2 to frame 1. The steps were: shutdown frame 1, physically move frame 1, recable frame 1 and power it back on, then use LPM to bring the workload from frame 2 to frame 1, and, finally, repeat the process to physically move frame 2.

The task at hand was simple enough, but there was a problem. The physical server that was being moved had been up for 850 days. Do not make the mistake of moving a machine that's been running continuously for more than two years without first logging in and checking on the server's health. Furthermore, make sure you've setup alerting and monitoring of your servers.

I got a call after step one of the customer's plan was complete and the damage had been done. Nonetheless, much can be learned from this episode.

Was errpt showing tons of unread errors? Yes. Had the error log been looked at? No. Had someone cleared the error log before support got involved with the issue? Yes. Was support still able to help? Yes. When you send a snapshot to IBM support, they can access the error log even if it's been cleared from the command line, assuming those errors have not been overwritten in the actual error log file in the meantime.

Were there filesystems full? Yes. In this case one of the culprits was the /var/opt/tivoli/ep/runtime/nonstop/bin/ script, which wrote a file -- /dev/null 2>&1 -- that filled up the / filesystem.

To make matters worse, the machines are part of a shared storage pool, and after the physical move frame 1 would not rejoin the shared storage pool (SSP) cluster. This left only two of four VIO servers as part of the SSP.

It turned out that after the physical move, the network ports weren't working. As a result, Multicast wasn't working. At least getting Multicast back up was easy enough. However, the two VIO servers were still unable to join the cluster, and the third VIO server on frame 2 (vio3) had protected itself by placing rootvg in read-only mode as it logged physical disk errors. So from a 4-VIO server cluster, only one was actually functional, and that one had its own issues. If things weren't fixed quickly, production would be impacted.

The problem with the one operable VIO server was, because it switched to read-only, SSP errors were occurring whenever someone tried to start or stop any of the cluster nodes. In other words, it was keeping the cluster in a locked state:

            clstartstop -start -n clustername -m vio3
            cluster_utils.c get_cluster_lock 6096 Could not get lock: 2
            clmain.c cl_startstop 3030 Could not get clusterwide lock.

Fortunately, rebooting the third VIO server cleared up this issue. And with that, the other VIO servers came back into the SSP cluster. Ultimately, the customer was able to use LPM to move clients to frame 1, which had already been physically moved. This allowed the customer to then shut down frame 2 and physically move it as well.

So what have we learned? Check your error logs. Check your filesystems. Schedule the occasional reboots of your machines. Make sure you're applying patches to your VIO servers and LPARs. Make sure you have good backups.

Finally, note that in this instance, having the capability to perform LPM operations really made a huge difference. Despite the severity of these problems, the users of these systems had no idea that anything had been going on at all.

July 15, 2014

System Monitoring Shouldn't Be Neglected

What are you doing to monitor your systems from both the hardware and OS levels? Are you using a commercial product? Are you using an open source product? Are you using hand-built scripts that run from cron? Are you using anything?

Have you logged into your HMC lately? Does anything other than green appear in the system status, attention LEDs or Serviceable Events sections of the display? Countless times I've seen machines where the HMC messages were being ignored. Is your HMC set up to contact IBM when your servers run into any issues?

When your machines have issues, are you deluged with alerts? One customer I know of had a script that monitored their machine and sent emails when errors were detected. During one event, the PowerHA system actually failed over because the node became unresponsive due to the volume of errors being generated and the way the script was written. This forced the customer to go into the mail queue and clean up a huge number of unsent messages. Then they had to go into the email client and clean up all of the messages they'd received. Finally, they had to schedule downtime to fail the application back to the node it was supposed to be running on.

I know of multiple customers that simply route error messages to a mail folder -- and then never bother checking them. What's the point of monitoring a system if you never analyze the information you collect?

How diligent are you about deactivating monitoring during periods of scheduled maintenance? In many organizations where a help desk monitors systems, cycles are wasted because techs are so often called to follow up on alerts and error messages triggered by scheduled events.

Of course there are other impacts that can result from neglecting systems. If internal disks are going bad, and you're not monitoring and fixing them, eventually you will lose your VIOS rootvg (assuming that's how you have it set up). And just as some customers will ignore the system monitoring messages they collect, other customers don't take action on hardware events that are being logged. Having robust hardware that notifies you when it needs maintenance is only useful if you actually heed the notifications.

Deploying your OS and installing your application is relatively simple, but along with that we must make decisions and take actions to manage and maintain these systems during the operational production phase of service. Sure, everyone is busy, and some tools cost money -- but try explaining that to someone who cares when production goes down.

On a totally unrelated topic, I want to acknowledge that AIXchange is having a birthday. Seven years ago this week -- July 16, 2007 -- the first article was posted on this blog. Many thanks to everyone who takes the time to read this blog, and special thanks to those who have suggested topics. I welcome your input, and it does make a difference.

Here's to the next seven years.

July 08, 2014

Webinars Cover the World of AIX

Hopefully you regularly listen to the AIX Virtual User Group webinars, either live or on replay. Recent sessions have been devoted to the POWER8 server announcements, Linux on Power and SRIOV.

If you're outside of the U.S., you should know that similar webinars are taking place worldwide. For instance, there's the IBM Power Systems technical webinar series that originates from the U.K. This group's next event, which is set for July 16, covers PowerVKM. Dr. Michael Perzl is the presenter, and as someone who's already working with PowerVKM, I look forward to what he has to say.

Previously, this group presented "More tricks of the Power Masters," which, as you might imagine, was an hour-long session consisting of tips and tricks for using IBM Power Systems hardware. Thirty-eight total replays of these sessions can be found here. Specifically, I recommend this video of several presentations by Gareth Coates. Gareth is an excellent speaker who's always on the lookout for tips he can use in future sessions, and he mentioned that he is on the lookout for IBM i content as well. (He'll be sure to give you credit for your help.)

As I've mentioned on numerous occasions, there's little I love more than learning, finding and sharing AIX tips and tricks. With that in mind, please indulge me while I cite some specific information that's available in the "Power Masters" videos:

* For starters, to force a refresh of the operating system level information on the HMC, run:


            lssyscfg –r lpar –m <managed system> --osrefresh


(In addition, Power Masters offers good info on performing HMC updates from the network, which I've also written about here and here.)

* To find out how many virtual processors are active on my system, use the kdb command (and use it carefully):

            echo vpm | kdb

* To protect AIX processes when AIX is out of memory, use:

            vmo –o nokilluid=X

* To test your RSCT connection, use:

          /usr/sbin/rsct/bin/rmcdomainstatus –s ctrmc

Some other Power Masters topics:

* Using Live Partition Mobility checklists. (I wanted to point this out so I have a reason to add that FLRT now has LPM checks available.)

* viosbr (which I've also covered here).

Some of the other information presented was first used in a session that took place in 2013, called Power "Ask the Experts." I covered that here.

Of course there's much, much more on not just AIX but also IBM i topics, so check out the Power Masters videos on YouTube. And if you don't already, be sure to tune into the AIX Virtual User Group and IBM Power Systems technical series webinars.

July 01, 2014

We're Not the Only Techies

As I've noted previously, I work with Boy Scouts. Recently I took a group of boys to an airport to work on their aviation merit badge.

We found a pilot who was willing and able to spend time on a Saturday with the troop. He invited the scouts to visit a maintenance and training facility and spend time on an airplane simulator.

Although he had interesting information to share, I quickly figured that, as a pilot, he hadn't spent a lot of time creating PowerPoint presentations. Prior to taking the scouts to the hangar so they could learn how to conduct a pre-flight inspection of an aircraft, he showed them a presentation covering the merit badge requirements. At one point, he clicked on what he hoped was a link to a video, but it turned out he had inadvertently made a screen capture of the video rather than an actual link to it. (Not that this issue wasn't easily addressed; he ended up going directly to YouTube and showing us things like this.)

But indeed, our pilot guide did admit that he hadn't used PowerPoint in years. On top of that, during the presentation, the overhead projector had an issue. For those of us who spend our time in meetings and conference rooms, fixing projector issues is second nature. Once again though, he wasn't immediately sure what to do.

All of us -- even the scouts themselves -- were pretty smug about our computer and projector knowledge at this point. Then we went into the next room and got into the simulator. Long story short:  I'm not cut out to land an airplane, or even to keep one riding smoothly through the air. So we all have our different skills. Frankly, as long as my pilots are experts at flying, I'll excuse their shortcomings when it comes to using software programs and projectors.

Of course the scouts, most of whom have considerable experience with computer games, made me feel even more inept on the simulator. A lot of those kids had a pretty light touch on the airplane controls and managed a reasonably good landing on the first try.

As an AIX pro, I'm generally surrounded by others with similar professional backgrounds. Quite possibly, it's the same for you. But we should all keep in mind that while most people need computers to do their jobs, they don't live and breathe technology the way that many of us do.

Ultimately, my day at the airport reminded me that, even if most people don't know computers like we do, we're far from the only smart folks out there doing challenging, technical work. And thank goodness for all these people and their unique specialties, because you really wouldn't want to see me at the controls of your plane.

June 24, 2014

More POWER8 Docs

I love reading about new computing technologies, particularly the latest IBM Power Systems releases. It doesn't hurt that, as a consultant, I have opportunities to work with the newest hardware, but even if that wasn't the case, I'd still want to know everything about what's coming out of IBM. I guess I'm like those folks who read automotive magazines, even though I don't plan on buying a new Tesla anytime soon.

With this in mind, I'd like to point you to three new IBM documents -- draft Redpapers -- that cover the recently unveiled POWER8 models.  All three publications are scheduled to be finalized by the end of this month.

As you might expect, given that the models have many of the same features, there's some overlap in the information presented. For instance, this is the table of contents for all three publications:

            Chapter 1. General description
            Chapter 2. Architecture and technical overview
            Chapter 3. Virtualization
            Chapter 4. Continuous availability and manageability

So if you read these Redpapers back to back, you might have a case of déjà vu. Nonetheless, I believe the information is well worth your time.

Let's start with redp5097, which covers the 4U models, the S814 and the S824. As a reminder, the S in the model number stands for scale out, the 8 stands for POWER8, the 1 or 2 stand for the number of sockets, and the 4 stands for 4U.

Redp5098 covers the S812L and the S822L. Again, as a reminder, S for scale out, 8 for POWER8, 1 or 2 for the number of sockets, and 2 for 2U. L designates that these are Linux-only servers. I wrote about my experiences with the S822L here.

Finally, there's redp5102, which covers the S822. For completeness, the S is scale out, the 8 is POWER8, the 2 is 2 socket and the 2 is 2U.

At the bottom of the splash page for each publication there's a link to a blog post that lists five things to know about the IBM POWER8 architecture. I suggest checking this out as well.

So what are your plans to run POWER8 in your shop?