July 29, 2014

Can We Talk? Yes, and it's So Much Easier Now

A friend living overseas recently emailed me. He was having issues with an older HACMP cluster and wanted another set of eyeballs to check it. At the time I happened to be talking with a PowerHA guru, so I invited him to take a look as well.

Our small troubleshooting group reminded me of the people who work on their cars in their driveway. At least in my formative years, the sight of someone tinkering with a car would inevitably draw curious neighbors eager to see the mechanic do his thing. In this case, the attraction was an old HACMP cluster that -- via a WebEx session -- my guru friend and I could examine from several time zones away.

I'm still amazed at the relative ease with which it is now possible to communicate with anyone, anywhere. I have family members in South Africa. Years ago they actually sent a telegram to my door because they couldn't reach me on the phone. (Not that transnational phone service was inherently unreliable in those days, but occasionally calls didn't get through.) Surprised as I was to discover that telegrams still existed, it was the best alternative for delivering time-sensitive information at that time.

Awhile back, I sent them a magicJack VOIP system so they could have a local U.S. number. This means that any time I want I can pick up the phone and make what's essentially a free phone call to the other side of the world.

Admittedly, VOIP technologies aren't yet completely reliable. My friend with the HACMP cluster experienced issues with his VOIP solution. We tried IM, but weren't satisfied waiting for each side to type out messages. Ultimately, he opted to call me on his cell phone. Of course that wasn't free, but calling internationally is much cheaper than it was even a few years ago.

As for the HACMP issue, it was fairly straightforward. A change had been made in the environment. Someone added NFS to the cluster nodes, but not to the HACMP resource groups. The admin then decided to remove NFS, but didn't remove it completely. As a result, the cluster was out of sync, and HAMP wouldn't start at the next failover:

            ERROR: The nodes in resource group HA_RG are configured with more than one NFS domain. All nodes in a resource group must use the same NFS domain.

            Use the command 'chnfsdom <domain name>' to set the domain name.

With this error message pointing us in the right direction, the issue was quickly resolved.

We're fortunate enough to work with some impressive technology, and that includes the older systems that continue to function effectively. But do you ever stop and really think about the amazing communication capabilities we have these days? Do you just take it for granted that these devices that fit in our pockets and purses allow us to interact in realtime with people from around the world for a relatively low cost and with very little effort?

July 22, 2014

Lessons from a Physical Server Move

A customer planned to use Live Partition Mobility (LPM) to move running workloads from frame 2 to frame 1. The steps were: shutdown frame 1, physically move frame 1, recable frame 1 and power it back on, then use LPM to bring the workload from frame 2 to frame 1, and, finally, repeat the process to physically move frame 2.

The task at hand was simple enough, but there was a problem. The physical server that was being moved had been up for 850 days. Do not make the mistake of moving a machine that's been running continuously for more than two years without first logging in and checking on the server's health. Furthermore, make sure you've setup alerting and monitoring of your servers.

I got a call after step one of the customer's plan was complete and the damage had been done. Nonetheless, much can be learned from this episode.

Was errpt showing tons of unread errors? Yes. Had the error log been looked at? No. Had someone cleared the error log before support got involved with the issue? Yes. Was support still able to help? Yes. When you send a snapshot to IBM support, they can access the error log even if it's been cleared from the command line, assuming those errors have not been overwritten in the actual error log file in the meantime.

Were there filesystems full? Yes. In this case one of the culprits was the /var/opt/tivoli/ep/runtime/nonstop/bin/ script, which wrote a file -- /dev/null 2>&1 -- that filled up the / filesystem.

To make matters worse, the machines are part of a shared storage pool, and after the physical move frame 1 would not rejoin the shared storage pool (SSP) cluster. This left only two of four VIO servers as part of the SSP.

It turned out that after the physical move, the network ports weren't working. As a result, Multicast wasn't working. At least getting Multicast back up was easy enough. However, the two VIO servers were still unable to join the cluster, and the third VIO server on frame 2 (vio3) had protected itself by placing rootvg in read-only mode as it logged physical disk errors. So from a 4-VIO server cluster, only one was actually functional, and that one had its own issues. If things weren't fixed quickly, production would be impacted.

The problem with the one operable VIO server was, because it switched to read-only, SSP errors were occurring whenever someone tried to start or stop any of the cluster nodes. In other words, it was keeping the cluster in a locked state:

            clstartstop -start -n clustername -m vio3
            cluster_utils.c get_cluster_lock 6096 Could not get lock: 2
            clmain.c cl_startstop 3030 Could not get clusterwide lock.

Fortunately, rebooting the third VIO server cleared up this issue. And with that, the other VIO servers came back into the SSP cluster. Ultimately, the customer was able to use LPM to move clients to frame 1, which had already been physically moved. This allowed the customer to then shut down frame 2 and physically move it as well.

So what have we learned? Check your error logs. Check your filesystems. Schedule the occasional reboots of your machines. Make sure you're applying patches to your VIO servers and LPARs. Make sure you have good backups.

Finally, note that in this instance, having the capability to perform LPM operations really made a huge difference. Despite the severity of these problems, the users of these systems had no idea that anything had been going on at all.

July 15, 2014

System Monitoring Shouldn't Be Neglected

What are you doing to monitor your systems from both the hardware and OS levels? Are you using a commercial product? Are you using an open source product? Are you using hand-built scripts that run from cron? Are you using anything?

Have you logged into your HMC lately? Does anything other than green appear in the system status, attention LEDs or Serviceable Events sections of the display? Countless times I've seen machines where the HMC messages were being ignored. Is your HMC set up to contact IBM when your servers run into any issues?

When your machines have issues, are you deluged with alerts? One customer I know of had a script that monitored their machine and sent emails when errors were detected. During one event, the PowerHA system actually failed over because the node became unresponsive due to the volume of errors being generated and the way the script was written. This forced the customer to go into the mail queue and clean up a huge number of unsent messages. Then they had to go into the email client and clean up all of the messages they'd received. Finally, they had to schedule downtime to fail the application back to the node it was supposed to be running on.

I know of multiple customers that simply route error messages to a mail folder -- and then never bother checking them. What's the point of monitoring a system if you never analyze the information you collect?

How diligent are you about deactivating monitoring during periods of scheduled maintenance? In many organizations where a help desk monitors systems, cycles are wasted because techs are so often called to follow up on alerts and error messages triggered by scheduled events.

Of course there are other impacts that can result from neglecting systems. If internal disks are going bad, and you're not monitoring and fixing them, eventually you will lose your VIOS rootvg (assuming that's how you have it set up). And just as some customers will ignore the system monitoring messages they collect, other customers don't take action on hardware events that are being logged. Having robust hardware that notifies you when it needs maintenance is only useful if you actually heed the notifications.

Deploying your OS and installing your application is relatively simple, but along with that we must make decisions and take actions to manage and maintain these systems during the operational production phase of service. Sure, everyone is busy, and some tools cost money -- but try explaining that to someone who cares when production goes down.

On a totally unrelated topic, I want to acknowledge that AIXchange is having a birthday. Seven years ago this week -- July 16, 2007 -- the first article was posted on this blog. Many thanks to everyone who takes the time to read this blog, and special thanks to those who have suggested topics. I welcome your input, and it does make a difference.

Here's to the next seven years.

July 08, 2014

Webinars Cover the World of AIX

Hopefully you regularly listen to the AIX Virtual User Group webinars, either live or on replay. Recent sessions have been devoted to the POWER8 server announcements, Linux on Power and SRIOV.

If you're outside of the U.S., you should know that similar webinars are taking place worldwide. For instance, there's the IBM Power Systems technical webinar series that originates from the U.K. This group's next event, which is set for July 16, covers PowerVKM. Dr. Michael Perzl is the presenter, and as someone who's already working with PowerVKM, I look forward to what he has to say.

Previously, this group presented "More tricks of the Power Masters," which, as you might imagine, was an hour-long session consisting of tips and tricks for using IBM Power Systems hardware. Thirty-eight total replays of these sessions can be found here. Specifically, I recommend this video of several presentations by Gareth Coates. Gareth is an excellent speaker who's always on the lookout for tips he can use in future sessions, and he mentioned that he is on the lookout for IBM i content as well. (He'll be sure to give you credit for your help.)

As I've mentioned on numerous occasions, there's little I love more than learning, finding and sharing AIX tips and tricks. With that in mind, please indulge me while I cite some specific information that's available in the "Power Masters" videos:

* For starters, to force a refresh of the operating system level information on the HMC, run:


            lssyscfg –r lpar –m <managed system> --osrefresh


(In addition, Power Masters offers good info on performing HMC updates from the network, which I've also written about here and here.)

* To find out how many virtual processors are active on my system, use the kdb command (and use it carefully):

            echo vpm | kdb

* To protect AIX processes when AIX is out of memory, use:

            vmo –o nokilluid=X

* To test your RSCT connection, use:

          /usr/sbin/rsct/bin/rmcdomainstatus –s ctrmc

Some other Power Masters topics:

* Using Live Partition Mobility checklists. (I wanted to point this out so I have a reason to add that FLRT now has LPM checks available.)

* viosbr (which I've also covered here).

Some of the other information presented was first used in a session that took place in 2013, called Power "Ask the Experts." I covered that here.

Of course there's much, much more on not just AIX but also IBM i topics, so check out the Power Masters videos on YouTube. And if you don't already, be sure to tune into the AIX Virtual User Group and IBM Power Systems technical series webinars.

July 01, 2014

We're Not the Only Techies

As I've noted previously, I work with Boy Scouts. Recently I took a group of boys to an airport to work on their aviation merit badge.

We found a pilot who was willing and able to spend time on a Saturday with the troop. He invited the scouts to visit a maintenance and training facility and spend time on an airplane simulator.

Although he had interesting information to share, I quickly figured that, as a pilot, he hadn't spent a lot of time creating PowerPoint presentations. Prior to taking the scouts to the hangar so they could learn how to conduct a pre-flight inspection of an aircraft, he showed them a presentation covering the merit badge requirements. At one point, he clicked on what he hoped was a link to a video, but it turned out he had inadvertently made a screen capture of the video rather than an actual link to it. (Not that this issue wasn't easily addressed; he ended up going directly to YouTube and showing us things like this.)

But indeed, our pilot guide did admit that he hadn't used PowerPoint in years. On top of that, during the presentation, the overhead projector had an issue. For those of us who spend our time in meetings and conference rooms, fixing projector issues is second nature. Once again though, he wasn't immediately sure what to do.

All of us -- even the scouts themselves -- were pretty smug about our computer and projector knowledge at this point. Then we went into the next room and got into the simulator. Long story short:  I'm not cut out to land an airplane, or even to keep one riding smoothly through the air. So we all have our different skills. Frankly, as long as my pilots are experts at flying, I'll excuse their shortcomings when it comes to using software programs and projectors.

Of course the scouts, most of whom have considerable experience with computer games, made me feel even more inept on the simulator. A lot of those kids had a pretty light touch on the airplane controls and managed a reasonably good landing on the first try.

As an AIX pro, I'm generally surrounded by others with similar professional backgrounds. Quite possibly, it's the same for you. But we should all keep in mind that while most people need computers to do their jobs, they don't live and breathe technology the way that many of us do.

Ultimately, my day at the airport reminded me that, even if most people don't know computers like we do, we're far from the only smart folks out there doing challenging, technical work. And thank goodness for all these people and their unique specialties, because you really wouldn't want to see me at the controls of your plane.