Peter Cochrane's Uncommon Sense: Why does technology fail?

It's the machines' fault... and ours

By Peter Cochrane, 21 January 2005 08:00

COMMENT Everyone's had their day ruined by a computer crash at one time or another - but why does this happen in the first place? Peter Cochrane examines the causes and offers a solution to flaky tech.

How come technology seems to fail at the most critical times? There you are making good progress toward an important deadline, overcoming all obstacles and having a winning day, when - crunch - the printer stops functioning. Better still, your PC crashes. And then at that very instant the boss appears to ask how things are going and will he be getting his report in less than an hour?

What gives, is technology bating us or is this really the norm?

The answer to this sometimes frustrating conundrum comes in two parts - the understandable and the distinctly quirky.

First, the understandable. When we get married, start a company or make any major life changes, we tend to re-equip. That is, we buy the TV, oven, fridge, washing machine, dryer and vacuum cleaner all at the same time.

Strange as it might seem all of these white and brown goods are designed to broadly the same lifetime specification. The Mean Time To Failure (MTTF) is around five years and the Mean Time To Death (MTTD) is around eight years. Ergo multiple and near simultaneous failures are to be expected - they actually have been built-in. In other words, if we buy 10 items at once, there is a pretty good chance that two or more will fail at a similar time.

This mechanism also applies to our automobiles, computers and other IT equipment - and everything else that is mass-produced. So buying a PC, printer, scanner and back-up drive all at once puts us in the same vulnerable position. Of course, the amount of use and abuse also influences the actual outturn of the MTBT and MMTD. Add to this the variability between manufacturers, suppliers and maintainers as well as that unpredictable commodity - software - and the stage is set.

Another factor: When is your car most likely to fail? The day after it has been in the repair shop. Once the repairman has been inside the box it is far more likely to become unreliable. This is true of everything you own. It is certainly true of any and all software upgrades and installs for computers. Hence the old adages - 'leave well enough alone' or 'if it ain't broke don't fix it'.

Some generally unseen mechanisms can often lead to a cascade of technology failures as well. In all complex systems a single point mechanism can lead to multiple failures and, conversely, multiple small failures can see a single dominant failure. For example, a network hub failure may see the loss of an internet connection, printer and scanner. At the same time, opening one more application when almost all the RAM is full, the hard drive is severely fragged and the mouse is dirty can cause a total system freeze.

And now for the quirky mechanisms - us. One of our first problems is recognising what is going wrong when tech failures occur and then diagnosing the cause. Very often we are not all that good at it. We tend to jump to the wrong conclusion and, especially when tired and stressed, make mistakes and compound problems through bad decisions and actions.

Add to all of this the fact there are a lot of us networked together, with different competence levels, all trying to achieve different objectives and you have a disaster in the making.

We also have run up our load of company, domestic and leisure activities to a point where almost everything is on a critical path. There is no slack - no room for error or failure. In a way we assume our technology will not let us down.

Why? Because much of our technology - heat, light, power, communication, transport - is reliable. Sure, IT is still flaky but it's an awful lot better than it was 20 years ago and continues to improve. With our current mindset, any and all failures come at a critical time because everything we do is critical. We have no back-up, no standbys or no extra members of staff who can pick up the ball.

Is there a solution? Yes. But it means becoming less efficient and building in some slack. I don't want to brag because what follows is a bit extravagant. But because a lot of people live in my home, effectively two families under one roof, I now have two washing machines, dryers, irons and kitchens - and four vacuum cleaners. Domestically I am in good shape - but I am not advocating this as a solution. A sharing agreement with a close friend or neighbour makes for a far more economic solution if it can be arranged.

On the IT front, if you want reliability you have to spend money on dual machines, hard drives, printers, scanners and everything else. And never upgrade software or install a new OS or application on all of your machines simultaneously. Do it sequentially, establishing stability a stage at a time.

Overall my most effective investments have been in back-up hard drives, both internal and external, plus several no-break power supplies. If a power glitch or outage occurs, my systems keep running. This single measure has saved me much grief and paid for itself many times over. And it was the least expensive of all my precautions - just $100 or so for battery backup for my server, router, hubs, drives, PC and peripherals.

On one level I stand in awe that modern society and technology works at all but on another I can see all the inefficiencies. In the end failure is endemic and part of the learning process. We just need to continuously minimise the overall impact. And believe me, IT is getting better.

Whoops, there goes another light bulb.

Written after my printer ran out of black ink, a light failed in my office and my ISP went down for half an hour. All extremely rare events but grouped in the same hour. Column completed within the next hour and despatched to silicon.com via my Wi-Fi link.

Comments

There are 6 comments. Join the discussion

  1. 1. John Hewett

    "IT is still flaky but it's an awful lot better than it was 20 years ago".

    What rubbish. 20 years ago I worked on systems with a MTBF (software or hardware) of 18 months. None of today's systems ever approach this. The reason is simple, new bells and whistles sell software, reliability doesn't.

  2. 2. Dick Winchester

    I have to say I'm still amazed at how easy it was to set up my home wireless network and how incredibly reliable it's been. We have 3 PCs and two laptops at home plus printers/scanners etc. Took me no more than 10 mins per machine and about 15 mins playing with the router and bingo - it all worked. Even more surprising when the router and network cards are from different manufacturers. I anticipate failures though so have a UPS for the PCs and we back up everything onto a 2Tb RAID 5 NAS box from Ecobyte. Overkill? No way.. best investment I ever made especially when one of the laptop HDDs committed hari kari shortly afterwards. Darn thing was only five years old!!

  3. 3. Fergal O'Leary

    The essence of IT failures is that which we cannot find in the domain of the real world. In life as in death there is nothing which one can take for ones self to be granted. The next IT disaster is always around the corner at the end of the street on the left

  4. 4. anonymous

    What was that last comment about death and the street?

  5. 5. Niall Connell

    What the hell is that O'Leary fella talking about??

  6. 6. Rudiger O'Toole

    O'LEARY. That doesn't make any sense at all. Save it for a meta-physics lecture.

Post your comment

In order to post a comment you need to be registered and logged in.

Log in or create your silicon.com account below

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy.

Questions about membership? Find the answers in the Membership FAQ