Agglomeration methods

Agglomeration methods . In statistics, an agglomerative classification initially starts with the objects, which are progressively merged to form successive partitions that contain all the original objects. In a divisive classification, the total set Ω is started, which is progressively subdivided into small clusters.

Summary

[ hide ]

• 1 General
• 2 Agglomerative hierarchical clustering methods
• 1 Simple link grouping method
• 2 Full link grouping method
• 3 Average linkage grouping method
• 4 Ward’s method
• 5 Graphic representation
• 3 Divisive hierarchical clustering methods
• 4 Partitional agglomeration methods
• 5 Sources

General

In a non-hierarchical classification, homogeneous groups are formed without establishing relationships between the groups; In a hierarchical classification, the groups gradually merge, while the homogeneity between the groups decreases, each time more and more they are formed. Measuring this homogeneity by means of an index (phenetic distance) is a characteristic of numerical taxonomy . A hierarchical classification is generally agglomerative.
Cluster analysis is applied on a distance matrix and not on an association matrix. For qualitative descriptors, the latter must be transformed into one of distance. These distances, in turn, can be plotted in different ways, with dendrograms and point scattering on a Cartesian plane being the easiest to interpret.
Given this diversity of grouping methods or algorithms, different strategies can be followed. The simplest is exhaustive, consisting of blindly testing several and comparing results, but the computational cost can be high and the reason for the optimal result does not always have a clear interpretation.

Agglomerative hierarchical clustering methods

Hierarchical cluster analysis begins with the calculation of the distance matrix between the elements in the sample, which contains the distances between each element in the sample and all the others. The two closest elements (similar in terms of distances) are then searched for and grouped into a cluster. The resulting conglomerate is indivisible thereafter. In this way, the elements are grouped into ever larger and more heterogeneous conglomerates until reaching the last step, in which all the elements of the sample are grouped into a single group. Hierarchical methods create a decomposition of objects into hierarchical groups, in the style of “taxonomies” (superfamilies, families, species …).
The versatility of hierarchical cluster analysis lies in the possibility of using different types of measures to estimate the distance between the cases and the possibility of selecting one among a wide variety of methods. But there is no combination of these possibilities that optimizes the solution obtained. In general, it will be convenient to evaluate different solutions to choose the most consistent one.
For the representation of the individuals, once the distances have been calculated, a grouping method must be chosen. These methods refer to the iterative processes that agglomerate the individuals and that define the neighbors in the represented branches. Criteria such as closest distance, mean distance (UPGMA), or weighted mean (WPGMA) are some examples.

The single linkage clustering or closest neighbor clustering method (Gower, 1967) begins by selecting and joining the two elements of the closest distance matrix. The distance of this new cluster from the other elements of the matrix is ​​calculated as the smallest of the distances between each element of the cluster and the rest of the elements of the matrix. In successive steps, the distance between two clusters is calculated as the distance between their two closest elements.

In English Complete linkage clustering or furthest neighbor clustering method (Sorensen, 1948), behaves in the opposite way to the previous one. The distance between two clusters is calculated as the distance between their two most distant elements.

In English Average linkage clustering ( unweighted Pair-group arithmetic averages (UPGMA)) (Sneath and Sokal, 1973), has the advantage, over the previous two, of taking advantage of the information of all the members of the two conglomerates that are compared. The distance between two clusters is calculated as the average distance between all the pairs of elements in both clusters.

Ward’s method

Ward’s (Ward, 1963) or minimum variance ( Minimum variant clustering) method), for which its author argued that conglomerates should be constituted in such a way that, when two elements merge, the loss of information resulting from the merger was minimal, quantifies the amount of information as the sum of the squared distances of each element with respect to the centroid of the conglomerate it belongs to (SCE = Sum of Squares Error). To do this, we start by calculating, in each cluster, the vector of means of all the variables, that is, the multivariate centroid. The squared Euclidean distances between each element and the centroids (mean vector) of all the clusters are then calculated. Finally, the distances corresponding to all the elements are added.
At each step, those clusters are joined that give rise to a smaller increase in the SCE, that is, in the sum of squares of the intra-cluster distances.
The method of medium linkage within groups, or intra-group linkage, as in the previous case, takes advantage of the information of all the members of the clusters that are compared by joining them previously. The distance between two clusters is calculated as the average distance between all the members of the conglomerate joining both.

Graphic representation

Hierarchical cluster methods provide a dendrogram that is the graphical representation that allows you to visualize the relationships between the different OTUs. Obviously, as in any multifactorial analysis, some information is lost during grouping, but it is extremely useful to summarize a large amount of information. It should be noted that the similarity matrix sometimes involves hundreds of characteristics that would be impossible to analyze together.
Like the procedure above, the distance matrix at each stage for calculations is the matrix from the previous step.

These methods, once a group has been mixed, or an object has been removed from a group, the process cannot be undone. This rigidity is advantageous in relation to computational time, but at a cost of not analyzing all possible combinations of decisions or choices to be made.

Divisive hierarchical clustering methods

The algorithms start with all the observations joined in a single cluster. In successive steps the conglomerate is divided, and the result of the division forms two new conglomerates, which are divided again.

Partitional agglomeration methods

The algorithms begin by dividing the n observations into K groups. This first assignment can be done randomly. In each of the groups, the mean vector (center of the group) is obtained and each observation is assigned sequentially to the group whose center is closest. At each stage, the center of the group to which an observation is added and the center of the group from which that observation is removed are recalculated.
The stats package is part of the basic R library that is installed by default, it contains the hclust () function that allows obtaining several hierarchical conglomerate analyzes.
As an additional package for cluster analysis is the cluster package which broadens the range of cluster analysis, as it also includes divisional and partition hierarchical methods.