Taming Technology: The Case of the Vanishing Problem
Author's Note
I am trying to compile a body of work on an overall theme that I call Taming Technology. (I have wrestled with several different names, but that is my title du jour.) The theme deals with troubleshooting, problem solving, problem avoidance, and analysis of technology failures.
Case studies are an important part of this ambitious project. "Installing Fedora Core 10" is an example. It deals with my attempts to perform a network kickstart install of Fedora Core 10.
Part 1, Lessons from Mistakes, outlines my plans, briefly compares CD installation with network installation, and proceeds to early success and unexpected debacle.
In Part 2, Bootable Linux, I discuss some of the uses of the Knoppix Live Linux CD, before explaining how I used Knoppix to gather information needed for a network installation. I also use Knoppix to diagnose the problems encountered in Part 1. This leads to the discovery that I have created even more problems. Finally, I present an excellent solution to the major problem of Part 1.
In Part 3 and 4, I finally get back on-track, and present detailed instructions for installing Fedora Core 10 over a network, using PXE boot, kickstart, and NFS. Part 3 details PXE boot including troubleshooting; Part 4 takes us the rest of the way.
I think it's important for people to realise that they are not alone when they make mistakes, that even so-called experts are fallible.
Isaac Asimov wrote a series of robot stories. (I don't imagine for one moment that I'm Isaac Asimov.) To me, the most captivating facet of these stories was the unanticipated consequences of interactions between the Three Laws of Robotics. I like to think that I write about the unanticipated consequences of our fallible minds: we want X, we think we ask for X, but find we've got Y. Why?
[ Highly amusing coincidence: a possible answer to Henry's question is contained in the "XY Problem" - except for the terms X and Y being reversed. -- Ben ]
In my real life, I embark on a project, something goes wrong, there is
often the discovery that the problems multiply. After several detours,
I finally get a happy ending.
-- Henry Grebler
The Cowboy materialised at the side of my desk. I wondered uneasily how long he had been standing there.
"When you've got some time," he began, "can I get you to look at a problem?"
"Tell me about it," I replied.
He looked around, and pulled up a chair. This was going to be interesting. His opening gambit is usually, "Quick question," almost invariably followed by a very long and complicated discussion that gives the lie both to the idea that the question will be short and the implication that it won't take long to answer.
To me, a quick question is something like, "Is today Monday?" The Cowboy's "quick" questions are about the equivalent of, "What's the meaning of life?" or "Explain the causes of terrorism."
If he had decided to sit down, how momentous was this problem?
By way of preamble, he conceded that The Russian had been trying to install Oracle on the machine in question. He wasn't sure if there was a connection, but now The Russian couldn't run VNC. It turned out that, when he tried to run a vncserver, he got a message like:
no free display number on suseq
Sure enough, when I tried to run a vncserver on suseq I got the same message. I used ps to tell me how many vnc sessions were actually running on this machine (ps auxw | grep vnc); a small number (3 or 4).
When I tried picking a relatively high display number, I was told it was in use:
vncserver :13 A VNC server is already running as :13
Similarly for a really really high number:
vncserver :90013 A VNC server is already running as :90013
This perhaps was enough information to diagnose the problem (it is, in hindsight), but seemed too startling to be believable.
To get a better understanding of why vncserver was behaving so strangely, I decided to trace it[1]. I used a bash function called 'truss'[2], so the session looked a bit like this:
truss vncserver New 'X' desktop is suseq:6 Starting applications specified in /home/henryg/.vnc/xstartup Log file is /home/henryg/.vnc/suseq:6.log
This was even more unbelievable - if I traced vncserver, it seemed to work! (It doesn't really matter whether it actually worked, or just seemed to work; tracing the application changed its behaviour.) This may be a workaround, but it explained nothing and raised even more question than were on the table going in.
At this point, I suggested to The Cowboy that it did not look like the problem would be solved any time soon, and he might as well leave it with me. I also thought a few minutes away from my desk wouldn't hurt.
After making myself a coffee, I went over to Jeremy and told him about the strange behaviour with trace. I wasn't really looking for help; just wanting to share what a weird day I was having.
I went back to my desk to do something that would succeed. There are times during the task of problem-solving when you find yourself picking losers. Whatever you try backfires. Several losses in a row tend to send your radar out of kilter. At such times, it's a good idea to get back onto solid ground, to do anything that is guaranteed to succeed. Then, with batteries recharged after a couple of wins, your frame of mind for tackling the problem is immensely improved.
A few minutes later, Jeremy came to see me. He had an idea. At first, I could not make any sense of it. When I understood, I thought it was brilliant - and certainly worth a try.
He suggested that I trace the bash command-line process from which I was invoking vncserver.
So, in the xterm window where I had tried to run vncserver, I did
echo $$ 5629
The number is the process id (pid) of the bash process.
In another xterm window, I started a trace:
truss -p 5629
Now, back in the first window, I invoked vncserver again:
vncserver no free display number on suseq
So, Jeremy's idea had worked! (Trussing bash rather than vncserver had allowed me to capture a trace of vncserver failing.)
I killed the truss and examined the output. What it showed was that the vncserver process had tried many sockets and failed with "Address already in use", e.g.:
bind(3, {sa_family=AF_INET, sin_port=htons(6099), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
The example shows an attempt to bind to port 6099. This was near the bottom of the truss output. Looking backwards, I saw that it had tried ports 6098, 6097, ... prior to port 6099. Clearly vncserver had cycled through a fairly large number of ports, and had failed on all of them.
Why were all these ports busy? I tried to see which ports were in use:
netstat -na
Big mistake! (I should have known.) I was flooded with output. How much output? Good question:
netstat -na | wc
After a very long time (10 or 15 seconds - enough time to issue the same command on another machine), it came back with not much less than 50000! That's astronomical! On the other machine with the problem, there were more than 40000 responses - still huge. A typical machine runs about 500 sockets.
After that, the rest was easy. The suspicion was that Oracle was hogging all the ports. That's probably not how it's meant to behave, but I don't have a lot of experience with Oracle.
I suggested to The Russian that he try to shut down Oracle and see if that brought the number of sockets in use down to a reasonable number; If not, he should reboot.
In either case, he should then start a trivial monitor in an xterm:
while true do netstat -na | wc sleep 1 done
and then do whatever he was doing to get Oracle going. As soon as the answer from the monitor rose sharply, he would know that whatever he had been doing at that time was probably responsible. If he was doing something wrong, he might then have an indication where to look. If not, he had something to report to the Oracle people.
Lessons
1. Not all exercises in problem-solving result in a solution with all the loose ends tidied up. In this case, all that had been achieved is that a path forward had been found. It might lead to a solution, it might not. If not, they'll be back.
2. Never underestimate the power of involving someone else. In this case, the someone else came back and provided me with a path forward.
Often, in other cases (and I've seen this dozens of times, both as the presenter and as the listener), the person presenting the problem discovers the solution during the process of explaining the problem to the listener. I have been thanked profusely for helping to solve a problem I did not understand. It is, however, crucial that the listener give the impression of understanding the presenter. For some reason, explaining the problem to the cat is just not as effective.
Notes
[1] "Tracing" is an activity for investigating what a process (program) is doing. Think of it as a stethoscope for computer doctors. The Linux command is strace(1) - "trace system calls and signals."
[2] After years of working in mixed environments, I now tend to operate in my own environment. In a classic metaphor for "ontogeny mimics phylogeny", my abbreviations and functions contain markers for the various operating systems I've worked with: "treecopy" from my Prime days in the early '80s, "dsd" from VMS back in the mid '80s, "truss" from Solaris in 1998, etc.
In this case, I used my bash function, "truss", which has all the options I usually want adjusted depending on the platform on which the command is running. Under Solaris, my function "truss" will invoke the Sun command "truss". Since this is a SUSE Linux machine, it will invoke the Linux command "strace" (with different options). It's about concentrating on what I want do rather than how to do it.
Talkback: Discuss this article with The Answer Gang
Henry was born in Germany in 1946, migrating to Australia in 1950. In his childhood, he taught himself to take apart the family radio and put it back together again - with very few parts left over.
After ignominiously flunking out of Medicine (best result: a sup in Biochemistry - which he flunked), he switched to Computation, the name given to the nascent field which would become Computer Science. His early computer experience includes relics such as punch cards, paper tape and mag tape.
He has spent his days working with computers, mostly for computer manufacturers or software developers. It is his darkest secret that he has been paid to do the sorts of things he would have paid money to be allowed to do. Just don't tell any of his employers.
He has used Linux as his personal home desktop since the family got its first PC in 1996. Back then, when the family shared the one PC, it was a dual-boot Windows/Slackware setup. Now that each member has his/her own computer, Henry somehow survives in a purely Linux world.
He lives in a suburb of Melbourne, Australia.