By Rupert Goodwins, 20 September 2004 10:00
NEWS A bug in a Microsoft system compounded by human error was ultimately responsible for a three-hour radio breakdown that left hundreds of aircraft aloft without guidance on Tuesday, according to a report in the LA Times.
Nearly all of Southern California's airports were shut down and five incidents where aircraft broke separation guidelines were reported. In one case, a pilot had to take evasive action.
The newspaper said that a Microsoft-based replacement for an older Unix system needed to be reset every thirty days 'to prevent data overload', as a result of problems found when the system was first rolled out. However, a technician failed to perform the reset at the right time and an internal clock within the system subsequently shut it down. A back-up system also failed.
Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.
To prevent a recurrence of the problem before the software glitch is fixed, Laura Brown, an FAA spokeswoman, said the agency plans to install a system that would issue a warning well before shutdown.
Microsoft UK was not immediately available for comment.
Rupert Goodwins writes for ZDNet UK
Comments
There are 33 comments. Join the discussion
1. Marcin
Yes, a Microsoft add "Make a name for yourself" is very appropriate here. What would it be? A "killer application" maybe?
2. anonymous
So now anytime somebody writes a bad program that happens to run on Windows it is automatically Microsoft's fault?
Exactly how is any of this Microsoft's fault? AFAIK, Microsoft doesn't make Air traffic control software, another company does. Bad programmers exists in the Unix, Linux, and Mac worlds too. Why was the Unix system replaced then if it was already perfect? Why wasn't the problem just fixed in the first place? Just another example of a so called news reporter twisting the facts around just so they can mention Microsoft and get people to read an otherwise insignificant news article. Rupert Goodwins just proves he lacks any credibility by writing things like this.
3. anonymous
Your report is very misleading, and pretty vague, and almost assuredly untrue. For example, you stated:
". . . a Microsoft-based replacement for an older Unix system needed to be reset every thirty days 'to prevent data overload', as a result of problems found when the system was first rolled out."
Assuming that the Windows server in question was fully updated with the latest Service Packs and Critical Updates, then the ONLY reason this would occur is if there was an application running on it that had a memory leak! A "memory leak" would indicate that the program WAS NOT CODED PROPERLY!
Furthermore, you said, " . . . However, a technician failed to perform the reset at the right time and an internal clock within the system subsequently shut it down. A back-up system also failed."
My, I wonder what OS that "Backup System" was running . . .
So--let me get this straight:
1.) A Microsoft Windows server replaced a UNIX machine, and presumably there was SOME application running on it that was necessary for the Radio system to work properly.
2.) Since the only reason that a Windows server needs to be rebooted is because of a memory leak, which can only be caused by an improperly-coded application, and it would appear that the application in this case is none other than the one that is necessary for the radio system to work properly.
That would indicate that the fault and blame lies DIRECTLY in the lap of whomever CREATED that application!!
Oh, but then it's SO EASY to shift the blame to SOMEONE ELSE, ESPECIALLY if you HATE Microsoft, and of course, Microsoft makes a HANDY "Whipping Boy", doesn't it?
Maybe you should be a little more responsible in your reporting, and gather all the FACTS before you start "reporting" things like this.
Also, MAYBE whomever coded that application should take responsiblity for their poor code writing--don't you think?
4. vaibhav
and people are going to blame this on microsoft, instead of the guy who is supposed to reboot the system at time. and i guess people will also blame microsoft for the backup system failing...
5. anonymous
Nice spin to try and make this look like Microsoft's fault when it clearly was not.
The LA Times head line reads:
THE NATION; Human Errors Silenced Airports; A controllers union official describes 'harrowing' incidents in the sky, but the FAA insists the radio system failure posed no threat.
Excerpt from article: "But they said the quirk in the system, known as Voice Switching and Control System, is a "design anomaly" that should have been corrected after it was discovered last year in Atlanta."
Seems the problem was with the Control System and not Microsoft Software.
Nice try to through mud in the face of Microsoft. Seems spin articles do work on the weak minded.
6. anonymous
someone writes a crappy app and runs it on windows and it's m$ft's fault. yeah, windows might have problems, but this is sleazy reporting at best.
7. anonymous
Application developer of that system writes a buggy piece of software that happen to run on MS's OS and you guys scream "Microsoft software caused" it? Might as well call it an act of God, cause he's ultimately responsible for it all. What a joke.
8. John Watkins
article doesn't make sense. Title blames Microsoft (grab those page clicks, eh?) and the article talks about the FAA should have fixed their software before deployment.
If it is the FAA application stack with memory\handle leak etc, the article title is VERY misleading and should be corrected.
9. Sean Rosenthal
For all of you MS fans, Windows has its flaws, which is why MS constantly has updates, patches, etc. and are, in some cases, worse than the original issue,...anyone remember SP2 for NT4? Hmmm? Or, SP2 for XP? LOL.
The bottom line is, you should always use some form of Unix when dealing with a truly enterprise app. esp. if there are SLA's associated with it.
Windows Servers are just not equipped for high volume work, the HAL coupled with a bloated Shell is only the beginning of issues. They should stick with a Desktop OS, and apps, and let the Unix world handle Servers and Enterprise Services.
Anyone remember when they bought hotmail and tried to put 5 million email addresses and all that traffic on Windows? Seems to me that it was running on a Sun Sparc,...but maybe I made that up too.
10. anonymous
This is one of the worst spin articles I've read, aside from all the pro-Kerry BS...
First, to Mr. Rosenthal: The HAL is not a problem. If it was, my system would have problems if I installed windows on it and just played solitaire all day without being connected to a network of any kind - not going to happen... Please understand what the root of the problem is before you improperly place blame. The improperly-written application is the obvious culprit here, as that's the only thing that causes a memory leak on any released copy of an MS OS.
To the editor:
Get your facts straight and watch what your writers are saying, or you may just end up like CBS - scrambling for credibility.
11. anonymous
To echo some of the other comments here, this article is a biased anti-Microsoft rant. Read the article and pull out the facts and you see that Microsoft isn't at fault (for this problem at least). Microsoft bugs keep me in a job, but this isn't one of them. It's obvious that a faulty application, human error, as well as failure of the backup system (OS unspecified) is what caused the radio problems.
12. Sean Rosenthal
Ummm, does your Solitaire try to access any hardware? You must have the Enterprise Version of Solitaire SP3 with Hotfixes.
I have my facts straight.
And it could have been the problem Mr. Anonymous, if it were a DOS app.
13. anonymous
This is advertising, not journalism
Check out the banner add next to this article--Sun Microsystems.
14. anonymous
Disingenuous article.
It's an application, developed by the FAA and running on a Microsoft OS. The application bombs due to memory leaks, and so the workaround is to restart it every 30 days. Someone forgot to restart it.
It's not a "Unix vs. Microsoft" argument, it's a "completed code vs. uncompleted code" argument, and the FAA acknowledges that in other articles.
Feh.
15. Joe Thompson
This *is* in fact Microsoft's fault (sorry guys).
As reported elsewhere (http://www.techworld.com/opsys/news/index.cfm?NewsID=2275) the flaw was such that the system shut down every 49.7 days. This is a time-counter overflow bug from all the way back in Windows 95. Assuming other reports of the 49.7-day interval are correct, it's a Windows bug, not a memory leak or a bug in the application.
Others have speculated that the bug still exists in Windows 2000, but my guess is that somewhere a Win95 system is controlling something critical (we're talking about a federal contract here, where any deviation from the approved configuration often has to undergo approval all over again).
16. Boston Class.
Quite frankly, as it stands. There isn't enought information to say weither the blame is all MS fault, or apps. fault, or somewere inbetween. Software nowdadys is a complex mix of custom code, and prebuilt parts. Some from MS, some from other vendors. We may find when it comes down to it, that everyone's to blame. Doesn't make for good press, but may be closer to the truth.
17. Andrew Edmonds
What's with all the Microsoft apologists here? From everything I read it was at least partly Microsoft's fault due to an old bug in the OS (going back to Win95) and not just some pooly written application. Everyone thinks it's spin trashing Microsoft like Microsoft has gotten some raw deal and is a scapegoat. And anyone who thinks Microsoft is to blame is just gullible. Did Microsoft hire all you people to post in its defense? I would never trust a mission critical system like this to Microsoft (or Linux -- for those who assume I'm biased the other way). All I can say is what were they thinking? Buy me a bus ticket.
18. ms sucks
Great you people always know what you
are writing about! Let us see.. Open
the link if you dare:
http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/q216/6/41.asp&NoWebContent=1
Oh please. Stupid programmer. He should not have used a system function that is likely to hang the comp. The programmer is to blame, not Microsoft.
The bug was well known and it even
has index 216641 in MS knowledge base!
That was an easy find! I've already
read about 5000000 articles there
myself. I guess all of you guys
knew it and _never_ used the buggy system call. How entirely lame.
19. GeneralFault
I know that older windows systems (win95) had a bug that causes the system to halt after ~48 days. Could this be related? Were all of the systems running win2K? It seems to me that if the problem were due to a memory leak that the time to failure would vary with the use of the application. In other words it is unlikely that it would always be 48 days.
Also, I am not sure how Linux handles memory leaks that overflow the system limits, but it does not seem like an impossible problem to render harmless. I would expect that a server level OS would try to handle the problem gracefully. One way may be to timestamp the last read on every so many blocks of memory. The smaller the block the larger the performance strike. When the memory management layer in the OS detects that the memory is going to overflow and lock up the system, it could free the memory that was read (note, read not written) the longest time ago. I'm sure that this would cause problems, but it may mitigate larger bugs like a complete system halt. Does anyone know how UNIX or Linux handles this?
Also, does anyone know why LAX switched to Windows?
Aside from all of that, I believe that it is much safer to use open source on large critical single function systems. This is simply because the controlling entity (the FAA in this case) can open the source and find out exactly why LAX was out of commission for 2+ hours. With a windows based system it is much more difficult. Yah yah, the radio system was probably custom and therefore they most likely have access to the source for that. However the windows API and other third party libraries used in the system is a closed black magical box. They have to simply assume that it works as expected. This is the same reason (or more precisely one of the reasons) that closed source electronic voting systems are a bad idea
20. GeneralFault
I would like to refer all of you to
http://support.microsoft.com/default.aspx?scid=kb;en-us;823273
This is a known issue in Microsoft Windows2000.
The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
21. anonymous
This fault is in windows itself, not the Air traffic control software. It's related to an old windows 95 bug, where after 49.7 days, a 32 bit signed int counting milliseconds since start up overflows, and this causes a crash. Search the microsoft knowledge base for 49.7 and you'll see it's a microsoft bug. They had previously been working around it by rebooting every 30 days, not exactly an elegant solution. The UNIX solution didn't need this workaround.
22. John McNair
Hmmm, I can hardly believe what I'm reading. The bug that caused this was the infamous 49.7 days bug in Windows 95. That version maintained an internal counter of milliseconds since the last reboot that overflowed after 49.7 days. No amount of skill in either software engineering or system administration can keep a Windows 95 machine running for more than that length of time.
The foolhardiness is in placing such a notoriously broken operating system in control of any vital systems. This is even more poignant in an environment where upgrades need to happen slowly. We do not want air traffic control running the most bleeding edge forced upgrade from Redmond.
So how is this OS bug "obviously" the fault of poor application coding?
23. Andy
Don't blame the writer. Here's a quote from the LA times article about the same thing:
"When the system was upgraded about a year ago, the original computers were replaced by Dell computers using Microsoft software. Baggett [vice president, Air Traffic Controllers' union] said the Microsoft software contained an internal clock designed to shut the system down after 49.7 days to prevent it from becoming overloaded with data."
The report's pretty much vague BS, but I'm guessing that very little information about the system itself was forthcoming from official sources.
24. anonymous
The Tandem computers that used to run the voice switching and control system became unsupportable and were replaced by a dual Dell server running Win2k.Seems kind of scary to me, since the NAS runs on AIX, and other serious systems run on Solaris.
25. Ron Martell
The closest thing I can find to what is being reported is Knowledge Base article 823273 http://support.microsoft.com?kbid=823273 which indicates that degraded performance but not a total shutdown would result. A hotfix for this has been available since January 2004.
There is another 49.7 day issue with Server 2000 but that one only affects printing.
26. anonymous
My God, you have to be insane to trust people's lives to Microsoft software! Stunning indeed!
27. anonymous
I am not sure about Windows, but Mac OS X comes with a disclaimer that states it shouldn't be used for critical applications that could result in injury or loss of life, and so on. For that kind of application, a proprietary system should be designed and scratch tested extensively before being implemented.
28. MikeW
"Windows Rejuvenation"
From ZDnet.com:
>>
IBM also has been working for more than a year on a feature called software rejuvenation for Windows servers. Unfortunately, servers using Windows must be restarted periodically because of problems such as memory "leaks"--when computing processes claim memory but don't return it when done.
<<
29. Jay Carney
not that bill cares, he's got his money from the system.
30. Johnny Marr
EXECUTIVE SUMMARY:
I was unfortuante enough to read most of the crap in the above comments [and, yes, it was the purple faced, splittle flecked rubbish that you'd expect whenever you mention Microsoft or Apple or UNIX].
Here's the dope: A life-critical Windows server failed because it hadn't been patched. Loads of techno geeks saw this as an opportunity to argue on the internet.
31. Frank Thynne
Regardless of whether the cause of the problem is a memory leak in an application, a problem remains that Windows appears to be vulnerable to faulty applications. An OS that is vulnerable in this way has no place in a safety-critical system, and I don't expect that Microsoft would claim that Windows is suitable.
The fault lies largely with the system designers who thought that Windows might be a suitable platform for the system.
32. valdir leite
I´m sure this is a Microsoft FAUL. Theres no mention to a bug in application.
Why must technicias to perform resets every 30 days ?
Is it an application or Operating System concern ?
Windows is a plague. Not for use in critical situations.
33. anonymous
Did anyone consider blaming the Dell computers? Or maybe the combination of the two?