If You're Going to Shoot Trouble,
You'd Better Know Where to Aim
REPRINTED FROM WWW.CRAMSESSION.COM
This week, I want to detour just slightly from my usual article format, which might be described as hypertechnical (I will assume that's a word, since the Florida high court ruled it so – at least until the U.S. Supremes say differently).
I want to discuss a topic that is really at the core of the skillset required to do the job of a network administrator. This applies whether you're an MCSE, a CNE, a Linux pro, a Solaris specialist, a Mac man, or the Keeper of the Mainframe. The topic is troubleshooting: recognizing, pinpointing, and solving the problems that occur in an environment of interconnected computers and other devices.
Students in our certification classes, who have no experience in the industry, have often asked me something along the lines of: "Okay, I know I need to know all this stuff – memorize all possible user authentication methods and be able to describe the process that a DHCP server goes through when it hands out an IP address and know the difference between host name resolution and NetBIOS name resolution. But assuming I do learn it all, and pass the tests, and actually get hired, what I want to know is: what does a network administrator do all day?”
It’s a fair question, and if you had to sum up the answer in 25 words or less (about 24 words less, to be exact), it would be “troubleshooting.” The sad fact is, like a cop (which I was in my Life Before IT) or a doctor (which my husband was during the same), a net admin is most popular when things go wrong. No one ever pages you at 2:00 a.m. to tell you “the network is running smoothly” (and you’d probably shoot them the next morning if they did).
9-1-1 R Us
Consider yourself an Emergency Services professional – your primary duty (whatever else the job description might say) is to be there when the proverbial biological waste matter hits the oscillating instrument. To boldly go where the users – even the “power” type – and the managers and the “paper pros” dare not tread. To roll up your sleeves, unholster your pistola (or your LANalyzer), and wade into the danger zone of downed servers and dropped packets, the Land of the Lost Connectivity. No matter how fancy your title, how big your office, or how fat your paycheck, when it gets down to the bottom line, you’re the “fix-it” man (or woman). It’s your responsibility to find out what’s wrong and set it right. To “just make it work right,” as one user I know frequently pleads in frustration.
And that is why you must understand the troubleshooting process.
The good news is that network troubleshooting is just like any other type of problem-solving, and the problem-solving skills you may have picked up in other areas of life transfer with full credit. Likewise, the problem-solving techniques you learn in the IT field will come in handy when you face difficulties off the job.
Big IT companies like Microsoft consider problem-solving
to be such an important job qualification that many of them include “brain
teaser” type questions in their “technical” interviews. A famous (or infamous)
example is the old fox-chicken-grain dilemma. It goes like this:
You’re on one side of a river and you have a fox, a chicken and a bag of grain that you need to get to the other side. You must take them across by boat, but only one item can go in the boat with you at a time. You can’t leave the fox and chicken alone together, because the fox will eat the chicken. You also can’t leave the chicken and grain alone together; the chicken will eat the grain. How do you get them all across the river?
The solution is so simple that many candidates overlook it. Although the question stipulates that you can only have one of the three in the boat at a time, it does not say the fox, chicken and grain can only travel one way. Once you realize this, it becomes fairly obvious that you can accomplish your task by taking the chicken across, going back to get the fox, dropping the fox off and taking the chicken back with you in the boat, dropping the chicken off on the first shore and taking the grain, which you can deposit with the fox and then go back to retrieve the chicken.
That’s the “correct” answer – but one of my students was a little more creative. He first dutifully quoted the solution above, then said “okay, that’s the Microsoft answer for the test. Now in real life, what you’d do is use a third party hardware solution, such as putting the fox and chicken in plastic pet carriers.”
When I finished laughing, I told him that if it had been a job interview, I’d have hired him on the spot. Knowing the traditional answers (that’s called your knowledge base) is absolutely essential to being a good troubleshooter. But if you’re to be a great troubleshooter, you also have to be able to think even further outside the box – creativity has solved many problems that were thought to be unsolvable.
So let’s look first at some tried-and-true problem-solving methods.
Most of the “helping professions” concern themselves with solving problems, albeit of widely differing types. It’s interesting that the problem solving models used in different fields almost always break down to the same basic steps, even though those steps may be given different names.
In law enforcement, for instance, we relied on a problem solving methodology known as SARA. This translated to:
In other words, when a police officer approaches a situation, he/she is advised to first scan the area, observe what’s going on, get an overall view of the scene. This step is commonly known as “using your observation skills.” In this first phase of an investigation, the officer gathers information. That is done not just by observing, but also by talking to witnesses, and collecting physical evidence.
Next, the officer should analyze the information gathered in the previous step. Spot patterns, put clues together to come up with a logical evaluation of what the problem is. The third step is to respond, that is, take action of some sort to “fix” the situation. Finally, you must assess the effects of your action, and determine whether your solution solved the problem. If not, you go back to step one and start over.
It’s pretty obvious that this system can be applied to network administration as well. When we’re called to the ”scene of the crime,” we must first get all the info we can about the problem. This means listening to the user’s description of what happened, using our own observation skills to note what’s on the screen, and perhaps even gathering physical ”evidence” such as a printout of an error log file. Then we analyze this information.
Let’s suppose that in step one, the user told us that he rebooted his Windows 2000 Professional computer and now he can’t connect to any servers on the network, and our observation skills told us that when we use ping and run ipconfig it appears that the computer’s TCP/IP stack is working but its IP address is 169.254.0.2 and its subnet mask is 255.255.0.0. This address is not on the same subnet as the other computers on the network, which use addresses in the 192.168.1.0 network with a mask of 255.255.255.0.
Now let’s analyze our data. We know what the problem is. The computer isn’t communicating with the other computers on the network because its IP address is wrong. But how did this happen? We must consider possible reasons. Could the user have gone into the TCP/IP properties and changed his IP address for some reason? That’s possible, but when we go back for more information gathering (something that you’ll find yourself doing quite often during the analysis stage), we find that the computer is configured to be a DHCP client. Aha! Now we have our suspect – the DHCP server. An inexperienced investigator might deduce that it’s started handing out addresses from the wrong scope all of a sudden.
Right suspect, wrong crime. An experienced investigator (administrator) will know that the problem is indeed with the DHCP server, but it’s not guilty of giving the client its odd IP address. Here’s where your knowledge base comes in. If you’ve done your homework, you know that the 169.254.0.1 through 169.254.255.254 range is reserved for the use of APIPA, the Automatic Private IP Addressing feature of Windows 98 and 2000. APIPA kicks in when a DHCP client attempts to contact a DHCP server for an address and doesn’t find one. If the client can’t reach a DHCP server, it will assign itself an address from the APIPA range, as a temporary measure until a DHCP server is located. Now we know what’s really going on here: when the computer rebooted, it tried to obtain an address via DHCP and came up empty, so it assigned itself an APIPA address. The real problem then is that the DHCP server is down – or was at the time the client rebooted.
How do we respond to this problem? Well, first we need to check out that DHCP server and get it back online. Once we have the server up and running again, there are three ways to “fix” the client computer with the APIPA address. The simplest way is simply to wait. A computer that has self-assigned an APIPA address will try every five minutes to contact a DHCP server and get itself a “real” address. If for some reason we didn’t want to wait that long (instant gratification is the order of the new millennium, after all), we could use the /renew switch with the ipconfig command to force the client to send out a DHCP Discover message, seeking a new address. Finally, rebooting the computer will cause it to look for the DHCP server again when Windows loads.
We’re almost done – but don’t forget the assessment step. It’s important to make sure our actions had the desired effects, so be sure to check the TCP/IP configuration again to see that a proper IP address has been assigned, and ensure that the computer can now connect to other servers on the subnet.
I used the troubleshooting language that I learned years ago in policing. But the same procedure for solving problems is used in other professions; only the terminology is different. A physician might call the first, scanning step examination, the second analytical phase diagnosis, the third response stage treatment, and the final assessment step follow-up.
The point is: it doesn’t matter what you call the steps. What matters is that you follow them. This methodical process works – for cops, doctors, and network administrators.
Knowing the process is the first step in learning to troubleshoot network problems. But there’s more to it than that. Many rookie administrators are very good at diligently performing the first step, gathering all the information they can find. It’s in the analysis stage that their system breaks down – not because they haven’t properly questioned their witnesses and used their observation skills and gathered their evidence, but because they don’t have enough experience yet to know which clues are important.
Any seasoned detective will tell you it’s often the seemingly inconsequential bit of information that holds the key to the puzzle. Here’s an example. One of my readers (I’ll call him Scott – mostly because his name is Scott) recently wrote about some “strange” behavior on his network. He had set up a Windows 2000 server as a domain controller on his home network. He implemented DHCP on the server and gave it an address range of 192.168.0.3 through 192.168.0.25. He set up the other computers on the LAN (running Windows 98 and RedHat Linux) as DHCP clients.
The clients received IP addresses, but the addresses were not within the scope he had set on the DHCP server. Instead they were 192.168.0.26 and above. He tried deleting the scope and recreating it, to no avail. He searched TechNet for articles about DHCP address assignments outside the assigned scope, with no luck. He poured over the DHCP troubleshooting guide and found nothing useful.
When he described the problem, though, he mentioned casually – as an afterthought, really – that the W2K server was also sharing its Internet connection as an ICS host. This didn’t seem very important to him – but as soon as I heard that, I knew what his problem was. Internet Connection Sharing in Windows 2000 is not designed to be used on a network that has a DHCP server, and Scott had just experienced firsthand the reason why. ICS is a “lite” version of NAT (Network Address Translation) that is very easy to set up, but it uses a set of predefined parameters that can’t be changed. One of these is the fact that the ICS computer functions as a DHCP “allocator” (in class, I like to call it the DHCP alligator; the image of a fierce croc spitting out IP addresses seems to help my students remember this).
The range of IP addresses that are distributed by the ICS allocator are those on the 192.168.0.0 network. There is no way to change this; if you need to use a DHCP server, you’ll have to forego ICS and use its “big brother,” NAT. NAT is configured through the RRAS console (ICS is enabled in Network and Dialup Connections Settings). NAT does allow you to change the range of IP addresses handed out by its allocator, or you can configure NAT to disable its own allocation service and use the network’s DHCP server.
The “smoking gun” that led me to solve Scott’s mystery was his offhand comment about using ICS. You’ll be surprised how often it’s one of those “oh, by the way, I don’t know if this matters, but …” pieces of information that will make all the pieces fall into place.
Summing up: being a good network administrator requires that you be a good troubleshooter. And that means following the proven systematic process that good detectives, diagnosticians, and problem solvers of all kinds have used and refined over the years.
Good problem solvers first lay a solid foundation by acquiring a knowledge base – learning “just the facts,” studying the cases others have solved before them – and then by exercising their creativity to look beyond the obvious. A great problem solver, like anyone who does anything else extraordinarily well – loves what he/she does. If you enjoy the challenge of a good mystery, the thrill of sorting through all the data to come up with a logical explanation, and then “proving” your hypothesis right or wrong, you just might have what it takes to be a great network engineer.
Acknowledgments: A big “thank you” to Scott Robbins for providing both the outside-the-box “fox and chicken” solution and the ICS scenario I used in this article.