High availability cluster

High Availability Cluster (HA) . Today, organizations are increasingly dependent on their information systems, and obviously they want to be secure and remain available as long as possible.

For any company, an interruption of its information systems is a serious problem.

Summary

[ hide ]

  • 1 Effects due to the interruption of an information system
  • 2 Types of cluster
  • 3 Availability
  • 4 Calculation of availability
  • 5 The reasons for implementing a high availability cluster are
  • 6 High availability settings
  • 7 Operation of a high availability cluster
  • 8 Basic elements and concepts in the operation of the cluster
    • 1 Resource and Resource Groups
    • 2 Intercom
    • 3 Heartbeat
    • 4 Split-Brain Scenario
    • 5 Resource Monitoring
    • 6 Restart Resources
    • 7 Migration of Resources
    • 8 Dependency between resources
    • 9 Node preference
    • 10 Communication with other systems
    • 11 Fencing
    • 12 Quorum
  • 9 Sources

Effects of interruption of an information system

  • Direct costs associated with the repair of the information system (parts to repair or replace, freight, technical services, etc.).
  • Additional working hours for the systems department that has to repair the fault.
  • Productivity losses or even lost working hours by employees who depend on the system.
  • Loss of income, from sales or services that have been stopped.
  • Indirect costs: customer satisfaction, loss of reputation, bad publicity, mistrust of employees, etc.

Availability is a measure of how ready you are to use a computer system, while reliability is a measure of your ability to stay operational over time without failure.

Potential system failures are hardware component errors , operating system errors or crashes , application errors.

A High Availability Cluster is a set of two or more machines that are characterized by maintaining a series of shared services and constantly monitoring each other.

Cluster types

  1. High infrastructure availability: If a hardware failure occurs in any of the machines in the cluster, the high availability software is capable of automatically starting the services in any of the other machines in the cluster (failover). And when the failed machine recovers, the services are migrated back to the original machine (failback). This automatic service recovery capacity guarantees the high availability of the services offered by the cluster, thus minimizing the perception of failure by users.
  2. Application High Availability: If a hardware or application failure occurs on any of the machines in the cluster, the high availability software is capable of automatically starting services that have failed on any of the other machines in the cluster. And when the failed machine recovers, the services are migrated back to the original machine. This automatic service recovery capacity guarantees the integrity of the information, since there is no loss of data, and also avoids inconvenience to users, who do not have to notice that a problem has occurred.

Availability

Availability is the degree to which an application or service is available when and how users expect. Availability is measured by an end user’s perception of an application. End users experience frustration when their data is not available, and they do not understand or are able to differentiate the complex components of a global solution. Reliability, recovery, continuous operations and error detection are characteristics of a high availability solution.

  1. Reliability: Reliable hardware components of an HA solution, reliable software, including database , web servers , and applications, is the critical part of a high availability solution implementation.
  2. Recovery: There can be many options to recover from failure if one occurs. It is important to determine what types of failures may occur in your high availability environment and how to recover from these failures in the time that meets business needs. For example, if an important table is removed from the database, what steps would you take to recover it? Does your architecture offer the ability to recover in the time specified in a service level agreement (SLA)?
  3. Error detection: If a component in your architecture fails, then rapid detection of that component is essential in recovering from possible unexpected failure. While you may be able to quickly recover from a power outage, if it takes another 90 minutes to figure out the problem, then you can’t satisfy your SLA. Monitoring the state of the work environment requires reliable software to quickly view and notify the Database Administrator (DBA) of a problem.
  4. Continuous operations: Continuous access to your data is essential, no matter how small or non-existent the system down time, to carry out maintenance tasks. Activities such as moving a table from one side to another within the database, or even adding new CPUs to your hardware must be transparent to the end user in an HA architecture.

Availability calculation

In a real system, if one of the components fails, it is repaired or replaced by a new component. If this new component fails, it is replaced by another one, and so on. The fixed component is considered in the same state as a new component. During its useful life, one of the components can be considered in one of these states: Running or in Repair ; Running status indicates that the component is operational and the under repair means it has failed and has not yet been replaced by a new component.

In case of defects, the system will be working in repair mode, and when the replacement is made it will return to the working state. Therefore, we can say that the system has, during its life, an average time to present failures (MTTF) and an average time to repair (MTTR). Its life time is a succession of MTTFs and MTTRs, as it fails and is repaired. The useful life of the system is the sum of MTTFs in MTTF + MTTR cycles already lived.

In simplified form, the availability of a system is said to be the relationship between the life span of this system and its total lifetime. This can be represented by the formula below:

Availability = MTTF / (MTTF + MTTR)

In evaluating a High Availability solution , it is important to consider whether possible planned stops are seen as failures in the MTTF measurement.

Today, choosing the right hardware and software correctly makes it relatively easy to design a system with 98% availability. But the change from 98% to 99% and from here to 99.9999% is a complex task and at the same time supposes an exponential increase in the total cost of the system. In practice, a compromise is reached between the intended availability and the affordable cost.

The reasons for implementing a high availability cluster are

  • Increase availability
  • Improve performance
  • Scalability
  • Fault tolerance
  • Recovery from failures in acceptable time
  • Reduce costs
  • Consolidate servers
  • Consolidate storage

High availability settings

The most common settings in high availability cluster environments are the active / active configuration and the active / passive configuration.

  • Active / Active Configuration

In an active / active configuration, all servers in the cluster can run the same resources simultaneously. In other words, the servers have the same resources and can access them independently from the other servers in the cluster. If a system node fails and becomes unavailable, its resources remain accessible through the other servers in the cluster.

The main advantage of this configuration is that the servers in the cluster are more efficient since they can all work at the same time. However, when one of the servers is no longer accessible, its workload passes to the remaining nodes, leading to a degradation of the overall level of service offered to users.

The following figure shows how both servers are active, providing the same service to different users. Clients access the service or resources transparently and are unaware of the existence of several servers forming a cluster.

  • Active / Passive Configuration

A high availability cluster, in an active / passive configuration, consists of a server that owns the cluster resources and other servers that are able to access those resources, but do not activate them until the owner of the resources is no longer. available.

The advantages of the active / passive configuration are that there is no service degradation and that the services are only restarted when the active server stops responding. However, a disadvantage of this configuration is that the passive servers do not provide any type of resource while they are waiting, making the solution less efficient than the active / active type cluster. Another disadvantage is that it takes time for systems to migrate resources (failover) to the standby node.

High availability cluster operation

In a high availability cluster, the cluster software performs two fundamental functions. On the one hand, it intercommunicates all the nodes with each other, continuously monitoring their status and detecting faults. On the other hand, it manages the services offered by the cluster, having the ability to migrate these services between different physical servers in response to a failure.

Basic elements and concepts in the operation of the cluster

Resource and Resource Groups

Traditionally, a service is understood as a set of processes that are executed at any given time on a server and operating system. The latter provides the processes with the necessary resources to perform their task: file system , network interfaces , cpu time , memory, etc.

In a high availability cluster, the cluster software abstracts and makes the services of a specific host independent . Enabling them to move between different servers transparently for the application or users.

The cluster software allows defining resource groups, which are all those resources required by the service. These resources will be the service startup scripts , a filesystem, an IP address, etc.

Intercommunication

The cluster software manages services and resources on the nodes. But in addition, it must continuously maintain a global vision of the configuration and state of the cluster. In this way, when a node fails, the rest know what services should be restored.

Since communication between cluster nodes is crucial for cluster operation, it is common to use a specific channel such as a separate IP network or serial connection, which cannot be affected by security or performance issues.

Heartbeat

The cluster software knows the availability of physical equipment at all times, thanks to the heartbeat technique. The operation is simple, each node periodically informs of its existence by sending the rest a “signal of life”.

Split-Brain scenario

In a split-brain scenario, more than one server or application belonging to the same cluster try to access the same resources, which can cause damage to those resources. This scenario occurs when each server in the cluster believes that the other servers have failed and tries to activate and use those resources.

Resource monitoring

(Resource Monitoring)

Certain HA clustering solutions not only allow monitoring if a physical host is available, they can also track resources or services and detect the failure of these.

The administrator can configure the periodicity of these monitors as well as the actions to be carried out in case of failure.

Restart Resources

When a resource fails, the first measure that cluster solutions take is to try to restart that resource on the same node. This involves stopping an application or releasing a resource and then activating it again later.

Some implementations do not allow a single resource to be restarted, and what they do is a complete restart of an entire group of resources (service). This can take a long time for services such as databases.

Resource Migration

(Failover)

When a node is no longer available, or when a failed resource cannot be successfully restarted on a node, the cluster software reacts by migrating the resource or group of resources to another available node in the cluster.

In this way, the downtime for the possible failure is minimal, and the cluster will continue to provide the corresponding service.

Dependency between resources

Usually for the cluster to provide a service, not only one resource is needed, but several (virtual ip, file system, process), which is known as a resource group. When a service is started or stopped, its resources have to be activated in the proper order as some depend on each other. Cluster software must allow these dependencies to be defined between resources as well as between groups.

Node preference

(Resource Stickiness)

In cluster configurations with multiple nodes, it is common to distribute the services to be provided among the different servers. In addition, the servers may have different hardware characteristics (cpu, ram memory) and we are interested in that, for an ideal state of the cluster, certain services always run on a certain server.

This behavior is defined by node preference in the definition of each resource.

Communication with other systems

The cluster has to monitor not only that a server and its services are active, it must also verify that, facing the users, said server is not disconnected from the network due to the failure of a hose, switch, etc.

Therefore the cluster software must verify that the nodes are reachable. A simple method to achieve this is to verify that each node has the router or gateway of the user network accessible.

Fencing

In HA clusters there is a situation where a node stops working correctly but is still up, accessing certain resources and answering requests. To prevent the node from corrupting resources or responding with requests, clusters fix it using a technique called Fencing.

The main function of Fencing is to let this node know that it is operating in a bad state, remove its assigned resources so that it can be served by other nodes, and leave it in an inactive state.

Quorum

To prevent a Split-Brain scenario from occurring, some HA cluster implementations introduce an additional communication channel that is used to determine exactly which nodes are available in the cluster and which are not. Traditionally it is implemented using the so-called quorum devices, which are usually an exclusive shared storage volume (disk heart beating). There are also implementations that use an additional network connection or a serial connection. The latter has distance limitations and is currently deprecated.

 

Leave a Comment