By Tony Lock, 4 December 2008 09:30
COMMENT
IT systems are notorious for failing just when they're needed most. Freeform Dynamic's Tony Lock asks: is there anything organisations can do to fix this?
A recent report by Freeform Dynamics shows that IT systems fail. What's more, a quick glance at the chart below shows they fail far more frequently than one might expect in today's 'high availability' environments.
Such lack of resilience might be surprising if one just went by the content of press releases from many of the leading IT vendors, especially those in the virtualisation markets.
So what's going wrong and who needs to step up and take responsibility? Are IT pros and business analysts getting something wrong or are the IT vendors selling lots of pups?
![]()
As we can see from the chart above, all components essential for application delivery are prone to failure. Most frequently cited are software failures followed in second place by network failures or performance degradation, with the failure of physical components trailing back in third place.
Despite much public beating of chests, power outages and brownouts have yet to cause application failure as often as any of the other three areas addressed. So the inference is that software, hardware and network failures account for the vast majority of systems interruptions.
One thing that is clear is that service disruptions also occur as a consequence of human interventions triggering an interruption to application availability. The figure below highlights that while hardware, software and networking considerations are important in ensuring service availability, it is essential that operational management processes and practices are also suitable to the quality of service desired for each application.
It must be said frankly that if the people and process side of system availability is not addressed, the chances are that systems will fall over, probably time after time.
This all brings up the question: in which area should an organisation seek to add additional resilience to its application delivery?
Is it a question of spicing up the software side of things, acquiring better hardware or getting more resilient networking in place?
Well naturally enough, in the ideal world, all three factors would receive more than adequate attention long before an application is due to enter live service thereby allowing its requirements for performance, availability and recovery/protection to be more than adequately delivered.
Alas, as we know all too well, this is the real world where, as the figure below shows, such attention to service availability is not considered early enough in projects.
If we consider the area of employing software-based solutions to enhance application availability, a lot of attention, especially from the vendor and channel community, is focused on applying a number of virtualisation technologies to the problem. It is fair to say the many flavours of virtualisation currently on offer, especially in the x86 server space, are being promoted as an answer to delivering greater service availability.
A closer look at such offerings, essentially any of the operating system/hypervisor virtualisation technologies or one of the different approaches of application virtualisation solutions available, highlights that the simple solutions mostly deliver faster recovery after failure rather than preventing failure in the first place.
A good first step perhaps but it is only now that the effective management of virtualised solutions is beginning to offer the high levels of pro-active availability that such established platforms as the mainframe manage with ease. It is also worthwhile noting that these virtualisation solutions should not overshadow the need to ensure that the application itself is written in a way that enhances availability and can interoperate with new virtualisation solutions to raise service availability.
As has been mentioned, over the course of the last few years software systems, especially virtualisation offerings, have rather stolen the limelight when it comes to adding resilience to applications. However it is still a fact that utilising hardware platforms that are designed with application availability in mind can also deliver great value.
These solutions are frequently overlooked as today there is a tremendous volume of marketing claiming that industry standard systems are good enough. For many applications this is true but if you need to go that extra step along the road of application resilience, server and storage hardware, fault-tolerant or resistant systems deserve consideration, especially if keeping applications running rather than simply being able to restart them quickly in the event of failure is an issue.
A serious question needs to be asked of the vendors: do they really understand what is important for organisations when it comes to application availability? Or do many of them actually believe that 'virtualisation' is the marketing storm that solves all problems at once?
At the moment it appears the latter is so. And beware - it appears cloud computing may soon be proffered as the next answer to application availability, life the universe and everything.
Tony Lock is programme director at Freeform Dynamics.


Comments
There are 3 comments. Join the discussion
1. Roger Huffadine
Resilience is truly a design issue - the reason why people don't have sufficient resilience in their systems is simple:
Lack of Vision - the inability to imagine a bad outage.
Lack of money - the inability to pay for resilience.
Lack of competition - the "we are no worse than our nearest rival"
The only systems that have resilience in built are those involving 'safety of life' and these are often plagued by Lack of Money / Lack of Vision.
Even when people know that a disaster caused by inadequate in built resilience is certain to result in a public inquiry there are still accountants and senior executives who will risk people's lives rather than spend money - after all so very few of them go to prison for killing innocent people that it is worth the risk.
2. Radical Meldrew
I've been involved with building various networks for more than 20 years and have seen some horrible abominations in the computer rooms of several blue chip companies - those elegant designs go out of the window when it comes to implementation. A network can only be as good as its design AND build. Another worrying factor is thetendency to go for higher spec upgrades whilst ignoring the basics. I've seen massive networks, poorly designed and riddled with single points of failure and the answer is - a hardware upgrade! A network is only as good as the underlying infrastructure, it's after all just a series of integrated, clocked interfaces sending out information. Adding to this, modern equipment has developed an amorphous capability to reconfigure itself to automatically accommodate and fix network failures - those 'temporary' fixes are often ignored and become a permanent network feature - definitely not a good long term idea.
Financial constraints are another factor; networks tend to grow by necessity rather than design and the budget is always carefully scrutinised for potential savings. Most of these savings are made by sacrificing resilience. The obvious answer is a scalable solid core that is up to the job, well built and not as specified by the board or company accountant.
I would also like to add that the face of modern IT seems to have changed over the last decade; it's acquired an elitist attitude which is certainly to its detriment. Experience counts for little and you're not allowed to play unless you exude confidence and spin - even better - you've got a degree in something or other. I must confess that I also have a degree but, unfortunately for me, mine wasn't in techno bull***t.
3. Ian Heron
Todays software designs are far too complex to ever be resilient. Most failures can be traced to human error, either in design, build, test, deployment or execution. There are now new platforms emerging that go back to basics in order to offer higher levels of availability. Their true horizontal scalability model combined with single object test and deploy capabilities offer the first hope foe sensible computing since the eighties