Blog
AIXchange

Categories

August 19, 2014

More Resources for AIX Newbies

As I've noted previously, there are more newcomers to the AIX platform than you might imagine. A company may acquire an AIX system through a merger or replace an old Solaris or HP-UX box with a current IBM Power Systems model. As a result, one of their IT pros suddenly becomes the AIX guy. So, now what? How does an AIX newbie get up to speed with virtualization and AIX?

 I've mentioned the QuickSheets and QuickStarts from William Favorite. I've also highlighted conferences, classes and free monthly user group meetings that you can look into. Recently though, I was pointed to this old IBM web page featuring various AIX learning resources. I call it old because some of the links no longer work, but what's still available is surprisingly useful.

Some of the material covers concepts from AIX 5.3, but even much of this information remains valid today. It's also nice that some of the links take you to current Redbook offerings and IBM training courses.

The working links cover:

* AIX security and migration (this is AIX 5.3 material)

* Virtualization introduction

* Systems Director

* Power Systems Redbooks (updated here)

* IT technical training

* IBM business partner training

* IBM professional certification

On a related note, I've always believed that the simplest thing employers can do to help their IT staff members get started with AIX or any operating system that's new to them is to invest in a small lab/sandbox machine and HMC.

I'm continually amazed to see companies spend big bucks on the latest hardware and software, but then neglect to foot the bill for additional test systems. It's great that some companies devote an LPAR or two to testing, but you can only do so much in that environment. (In addition, there can be pressure to repurpose virtual test labs into running other production workloads. Then before you know it, the production needs grow so critical that these LPARs are made offlimits to reboots and testing.)

With Windows and x86 Linux servers especially, it's relatively easy and cheap to get access to test machines. I also know of people who've purchased old Power hardware on eBay just to have something that they can run AIX on.

With actual test boxes, you can safely reboot servers, install firmware and upgrade operating systems without touching production. If you make a mistake on a test system, not only haven't you hurt anything, you've learned a valuable lesson.

How do you learn, and keep learning? How do you stay current with your skills? If your machine is happily running along and you have little need to touch it, how can you ever expect to be able to support the machine when an issue hits?

August 12, 2014

Connecting Your HMC to IBM Support

You've been asked to connect your HMC to IBM Support. The network team wants to know about the different connectivity options. They need to know which IP addresses must be opened across the firewall.

What do you do? First, read this:

 "This document describes data that is exchanged between the Hardware Management Console (HMC) and the IBM Service Delivery Center (SDC) and the methods and protocols for this exchange. This includes the configuration of Call Home (Electronic Service Agent) on the HMC for automatic hardware error reporting. All the functionality that is described herein refers to Power Systems HMC version V6.1.0 and later as well as the HMC used for the IBM Storage System DS8000.

"Outbound configurations are used to configure the HMC to connect back to IBM. The HMC uses the IBM Electronic Service Agent tool to connect to IBM for various situations including reporting problems, reporting inventory, transmitting error data, and retrieving system fixes. The types of data the HMC sends to IBM are covered in more detail in Section 4."

Included are diagrams that show different scenarios for sending data to IBM, including with/without a proxy server, using a VPN, or even using a modem (though IBM does recommend Internet connectivity). Specific options including pass through server connectivity, multi-hop VPN, and remote modem. IBM states that there are no inbound communications; all communications are outbound only.

Further, IBM explains why your machine may need to "call home":

            * To report to IBM a problem with the HMC or one of the systems it's managing.

            * To download fixes for systems managed by the HMC.

            * To report to IBM inventory and system configuration information.

            * To send extended error data for analysis by IBM.

            * To close an open problem.

            * To report heartbeat and status of monitored systems.

            * To send performance and utilization data for system I/O, network, memory, and processors.

There's also a list of the files that are sent to IBM, and the authors point out that no client data that is sent to IBM.

On that note, here's IBM's statement on data retention:

"When Electronic Service Agent on the HMC opens up a problem report for itself, or one the systems that it manages, that report will be called home to IBM. All the information in that report will be stored for up to 60 days after the problem has been closed. Problem data that is associated with that problem report will also be called home and stored. That information and any other associated packages will be stored for up to three days and then deleted automatically. Support Engineers that are actively working on a problem may offload the data for debugging purposes and then delete it when finished. Hardware inventory reports and other various performance and utilization data may be stored for many years.

"When the HMC sends data to IBM for a problem, the HMC will receive back a problem management hardware number. This number will be associated with the serviceable event that was opened. The HMC may also receive a filter table that is used to prevent duplicate problems from being reported over and over again."

Finally, there's this list of the IP addresses that need to be allowed across any firewalls. All connections use port 443 TCP:

            Americas

            • 129.42.160.48

            • 129.42.160.49

            • 207.25.252.200

            • 207.25.252.204

 

            Non-Americas

            • 129.42.160.48

            • 129.42.160.50

            • 207.25.252.200

            • 207.25.252.205

 

IBM adds that when an inbound remote service connection to the HMC is active, only these ports are allowed through the firewall for TCP and UDP:

            * 22, 23, 2125, 2300 -- These ports are used for access to the HMC.

            * 9090, 9735, 9940, 30000-30009 -- These ports are used for Web-based System Manager             (POWER5).

            * 443, 8443 -- These ports are used for Web-based user interface (POWER6).

            * 80 -- This port is used for code downloads.

Take a few moments to read this document. Or, even better, send it to your network team so they can read it for themselves.

August 05, 2014

On Going Dark

I have another quick story involving my work with Boy Scouts.

Each summer we try to get the older boys involved in some high adventure activities. Last year this included target shooting (shotguns and .22 caliber rifles), archery, spelunking, rappelling and hatchet throwing. I didn't bring my laptop, but with my cellphone I could check in with the office and answer emails. Really, it was the best of both worlds. I was able to camp out, but at the same time I could help out the people I work with.

This summer's adventures consisted of backpacking, canoeing and canyoneering. Everything went smoothly in our case, though we do know members of the troop that had to be rescued around the time we were out.

For me, the main difference between this year and last was that I didn't have cellphone coverage during our recent trek into the Arizona mountains. Honestly, I'm not sure this was a bad thing.

Where we were, there was absolutely no cellular coverage of any kind (though just 20 miles down the mountain, the service was fine). Of course when you're responsible for the well being of a bunch of kids, you'd prefer to have a means of instant communication should an emergency arise. The troop leaders were talking about satellite phones. Perhaps next year we'll look at something like this or this.

However, just looking at it from a work perspective, what would you do? Would you be OK knowing that cell phone service was a 15-20 minute drive away, or do you need to be constantly in touch? I will admit that I like to know what's going on, not only in my world but in general. I had no way of checking headlines or sports scores or emails. I was completely cut off.

And yet, I think I enjoyed it. 

It takes awhile to truly unplug, and I might have gone through some withdrawal symptoms initially upon losing my access. Eventually though, I felt relieved. Since I knew that checking in wasn't an option, I could focus on enjoying the trip. Since I couldn't check messages, I didn't feel guilty about not responding to them. The "out of office" auto message option exists for a reason, after all. I was finally, truly, away.

For another perspective on what it's like to go a few days without having a working Internet in your hands, Jon Paris and Susan Gantner share this story about "going dark" during a cruise.

I guess there's something to be said for being unplugged, especially if you're out in nature. Even though I returned to tons of messages, when I got back I was recharged and ready to get back to work.

How about you? When you go on vacation, do you escape from technology?

July 29, 2014

Can We Talk? Yes, and it's So Much Easier Now

A friend living overseas recently emailed me. He was having issues with an older HACMP cluster and wanted another set of eyeballs to check it. At the time I happened to be talking with a PowerHA guru, so I invited him to take a look as well.

Our small troubleshooting group reminded me of the people who work on their cars in their driveway. At least in my formative years, the sight of someone tinkering with a car would inevitably draw curious neighbors eager to see the mechanic do his thing. In this case, the attraction was an old HACMP cluster that -- via a WebEx session -- my guru friend and I could examine from several time zones away.

I'm still amazed at the relative ease with which it is now possible to communicate with anyone, anywhere. I have family members in South Africa. Years ago they actually sent a telegram to my door because they couldn't reach me on the phone. (Not that transnational phone service was inherently unreliable in those days, but occasionally calls didn't get through.) Surprised as I was to discover that telegrams still existed, it was the best alternative for delivering time-sensitive information at that time.

Awhile back, I sent them a magicJack VOIP system so they could have a local U.S. number. This means that any time I want I can pick up the phone and make what's essentially a free phone call to the other side of the world.

Admittedly, VOIP technologies aren't yet completely reliable. My friend with the HACMP cluster experienced issues with his VOIP solution. We tried IM, but weren't satisfied waiting for each side to type out messages. Ultimately, he opted to call me on his cell phone. Of course that wasn't free, but calling internationally is much cheaper than it was even a few years ago.

As for the HACMP issue, it was fairly straightforward. A change had been made in the environment. Someone added NFS to the cluster nodes, but not to the HACMP resource groups. The admin then decided to remove NFS, but didn't remove it completely. As a result, the cluster was out of sync, and HAMP wouldn't start at the next failover:

            ERROR: The nodes in resource group HA_RG are configured with more than one NFS domain. All nodes in a resource group must use the same NFS domain.

            Use the command 'chnfsdom <domain name>' to set the domain name.

With this error message pointing us in the right direction, the issue was quickly resolved.

We're fortunate enough to work with some impressive technology, and that includes the older systems that continue to function effectively. But do you ever stop and really think about the amazing communication capabilities we have these days? Do you just take it for granted that these devices that fit in our pockets and purses allow us to interact in realtime with people from around the world for a relatively low cost and with very little effort?

July 22, 2014

Lessons from a Physical Server Move

A customer planned to use Live Partition Mobility (LPM) to move running workloads from frame 2 to frame 1. The steps were: shutdown frame 1, physically move frame 1, recable frame 1 and power it back on, then use LPM to bring the workload from frame 2 to frame 1, and, finally, repeat the process to physically move frame 2.

The task at hand was simple enough, but there was a problem. The physical server that was being moved had been up for 850 days. Do not make the mistake of moving a machine that's been running continuously for more than two years without first logging in and checking on the server's health. Furthermore, make sure you've setup alerting and monitoring of your servers.

I got a call after step one of the customer's plan was complete and the damage had been done. Nonetheless, much can be learned from this episode.

Was errpt showing tons of unread errors? Yes. Had the error log been looked at? No. Had someone cleared the error log before support got involved with the issue? Yes. Was support still able to help? Yes. When you send a snapshot to IBM support, they can access the error log even if it's been cleared from the command line, assuming those errors have not been overwritten in the actual error log file in the meantime.

Were there filesystems full? Yes. In this case one of the culprits was the /var/opt/tivoli/ep/runtime/nonstop/bin/cas_src.sh script, which wrote a file -- /dev/null 2>&1 -- that filled up the / filesystem.

To make matters worse, the machines are part of a shared storage pool, and after the physical move frame 1 would not rejoin the shared storage pool (SSP) cluster. This left only two of four VIO servers as part of the SSP.

It turned out that after the physical move, the network ports weren't working. As a result, Multicast wasn't working. At least getting Multicast back up was easy enough. However, the two VIO servers were still unable to join the cluster, and the third VIO server on frame 2 (vio3) had protected itself by placing rootvg in read-only mode as it logged physical disk errors. So from a 4-VIO server cluster, only one was actually functional, and that one had its own issues. If things weren't fixed quickly, production would be impacted.

The problem with the one operable VIO server was, because it switched to read-only, SSP errors were occurring whenever someone tried to start or stop any of the cluster nodes. In other words, it was keeping the cluster in a locked state:

            clstartstop -start -n clustername -m vio3
            cluster_utils.c get_cluster_lock 6096 Could not get lock: 2
            clmain.c cl_startstop 3030 Could not get clusterwide lock.

Fortunately, rebooting the third VIO server cleared up this issue. And with that, the other VIO servers came back into the SSP cluster. Ultimately, the customer was able to use LPM to move clients to frame 1, which had already been physically moved. This allowed the customer to then shut down frame 2 and physically move it as well.

So what have we learned? Check your error logs. Check your filesystems. Schedule the occasional reboots of your machines. Make sure you're applying patches to your VIO servers and LPARs. Make sure you have good backups.

Finally, note that in this instance, having the capability to perform LPM operations really made a huge difference. Despite the severity of these problems, the users of these systems had no idea that anything had been going on at all.