High-Availability

Fault tolerance
Alta disponibilità Elevada disponibilidade Verfügbarkeit Haute disponibilité Alta disponibilidad

Introduction to Reliability

No matter what service is being performed by a computer system, users must have confidence in how the system operates in order to be able to use it under good conditions. The term "reliability" characterises how trustworthy a computer system is.

A failure is when a service does not function properly, i.e. a state of operation that is abnormal or, more precisely, not in accordance with specifications. From the user's point of view, a service has two statuses:

  • appropriate service, i.e. in accordance with expectations
  • inappropriate service, i.e. not in accordance with expectations

A failure is attributable to an error, i.e. a local dysfunction. Not all errors lead to service failure.

There are several ways to limit service failure:

  • Error prevention, which consists of avoiding errors by anticipating them
  • Fault tolerance, the goal of which is to provide a service that is in accordance with specification despite errors by introducing redundancy
  • Error elimination, aiming to reduce the number of errors through corrective actions
  • Error prediction, by anticipating errors and their impact on service

Introduction to High-Availability

"High-availability" is all the measures that aim to guarantee service availability, i.e. ensure around-the-clock operation of a service.

The term "availability" refers to the probability that a service is operating properly at a given time.

The term "reliability", which is also sometimes used, refers to the probability that a system is operating normally over a given period of time. This is called "continuity of service".

Availability is most often expressed by the availability rate (a percentage), which is measured by dividing the time the service is available by the total time. Availability is most often expressed by the availability rate (a percentage), which is measured by dividing the time the service is available by the total time.

Availability Rate Length of Downtime
97% 11 days
98% 7 days
99% 3 days and 15 hours
99,9% 8 hours and 48 minutes
99,99% 53 minutes
99,999% 5 minutes
99,9999% 32 seconds

Risk Evaluation

Indeed, the failure of a computer system can cause losses in productivity and money and even material and human losses in certain critical cases. Thus, it is necessary to evaluate the risks tied to the dysfunction (failure) of one of the components of a computer system and anticipate the means and measures to be used to avoid the incidents or to reestablish service in an acceptable amount of time.

As everybody knows, there are numerous ways in which a network computer system can fail. The causes of failures can be broken down as follows:

  • Physical causes (these can be natural or criminal in nature):
    • Natural disaster (flood, earthquake, fire)
    • Environment (bad weather, humidity, temperature)
    • Material failure
    • Network failure
    • Power cut
  • Human causes (these can be intentional or accidental):
    • Design error (software bug, poor network provisioning)
  • Human causes (these can be intentional or accidental):
    • Design error (software bug, poor network provisioning)
  • Operational causes (these are linked to system status at a given moment):
    • Software bug
    • Software failure

All of these risks can have different causes such as the following:

  • Intentional maliciousness

Fault Tolerance

Since it is impossible to totally prevent breakdowns, one solution consists in setting up redundancy mechanisms by duplicating critical resources.

The ability of a system to operate despite the failure of one of its components is called fault tolerance.

When one of the resources breaks down, the other resources take over in order to give system administrators the time to find a solution to the problem. This is called "Fail-Over Service" (FOS).

Ideally, in the case of material failures, the faulty material elements should be hot swappable, i.e. capable of being extracted and replaced without service interruption.

Backup

Setting up a redundant architecture ensures that system data will be available but does not protect the data against user-introduced errors or against natural disasters such as fires, floods or even earthquakes.

Therefore it is necessary to set up backup mechanisms (ideally remote) in order to guarantee data perenniality.

Moreover, a backup mechanism can also be used for archival storage, i.e. saving data in a state that corresponds to a given date.

Last update on Thursday October 16, 2008 02:43:18 PM.This document entitled « High-Availability » from Kioskea (en.kioskea.net) is made available under the Creative Commons license. You can copy, modify copies of this page, under the conditions stipulated by the licence, as this note appears clearly.

Best answers for « High Availability » in :
Clusters Show The Concept of Clusters A "cluster" is an architecture made up of several computers that form nodes, where each node is able to operate independently. There are two main types of clusters: High-availability clusters spread a workload over a large...
HIGH TECH GIFT IDEAS FOR CHRISTMAS ShowHIGH TECH GIFT IDEAS FOR CHRISTMAS Picture, sound, photo, videos Mobile phones Music Informatics Home Appliances Toys and Games Below is a list of high technology components that will surely suit as Christmas gift Picture,...
USB host controller ShowUSB host controller When connecting a USB 2.0 device, Windows displays the following message: Install a high speed USB host controller or This device will function at reduced speed if you do not have high-speed controller installed...
Download Drivers Realtek High Definition Audio for 2000/XP ShowDescription Designed by RealTek Drivers Groups, the application is well known worldwide also. Very powerful to use, Driver Realtek High Definition Audio is an application that will allow you to have the best sound on your computer....
Download Drivers Realtek High Definition Audio for Vista ShowDriver Realtex High Definition Audio for Window Vista is compatible with the following chipsets: • Realtek ALC260 • Realtek ALC262 • Realtek ALC267 • Realtek ALC268 • Realtek ALC269 • Realtek ALC272 • Realtek ALC273...
Serial ATA (SATA or S-ATA) ShowIntroduction The Serial ATA standard (S-ATA or SATA) is a standard bus allowing high-speed storage peripherals to be connected to PC computers. The Serial ATA standard was introduced in February 2003 in order to compensate for limitations of the...
Formatting - Formatting a hard drive ShowBefore trying to understand formatting, you first need to understand how a hard drive works. Many people do not distinguish low-level formatting (also called physical formatting) from high-level formatting (also called logical formatting). Even...