library(mclust)
<- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="EII")
mb_eii $parameters$pro mb_eii
[1] 0.3333956 0.4135728 0.2530316
\(K\)-means and hierarchical clustering discussed in Chapter 25 have in common that an observation is assigned to exactly one cluster; this is referred to as hard clustering. Also, the methods are nonparametric in the sense that no distributional assumptions were made. The clusters are based solely on the distance (dissimilarity) between data points. The goal of clustering is finding regions that contain modes in the distribution of \(p(\textbf{X})\). The implicit assumption of \(K\)-means and hierarchical clustering is that regions where many data points are close according to the chosen distance metric are regions where the density \(p(\textbf{X})\) is high.
In model-based clustering, this is made explicit through assumptions about the form of \(p(\textbf{X})\), the joint distribution of the variables. An important side effect of its probabilistic formulation is that observations no longer belong to just one cluster. They have a non-zero probability to belong to any cluster, a property referred to as soft clustering. An advantage of this approach lies in the possibility to assign a measure of uncertainty to cluster assignments.
A common assumption is that the distribution of \(\textbf{X}\) follows a finite mixture model (FMM), that is, it comprises a finite number \(k\) of component distributions.
In general, a finite mixture distribution is a weighted sum of \(k\) probability distributions where the weights sum to one: \[ p(Y) = \sum_{j=1}^k \pi_j \, p_j(Y) \quad \quad \sum_{j=1}^k \pi_j = 1 \] \(p_j(Y)\) is called the \(j\)th component of the mixture and \(\pi_j\) is the \(j\)th mixing probability, also referred to as the \(j\)th mixture weight. The finite mixture is called homogeneous when the \(p_j(Y)\) are of the same family, and heterogeneous when they are from different distributional families. An important example of a heterogeneous mixture, discussed in Section 10.6, is a zero-inflated model for counts, a two-component mixture of a constant distribution and a classical distribution for counts such as the Poisson or Negative Binomial.
The class of finite mixture models (FMM) is very broad and the models can get quite complicated. For example, each \(p_j(Y)\) could be a generalized linear model and its mean can depend on inputs and parameters. The mixing probabilities can also be modeled as functions of inputs and parameters using logistic (2-component models) or multinomial model types.
How do mixtures of distributions come about? An intuitive motivation is through a latent variable process. A discrete random variable \(S\) takes on states \(j=1,\cdots,k\) with probabilities \(\pi_j\). It is called a latent variable because we cannot observe it directly. We only observe its influence on the distribution of \(Y\), the variable we are interested in, and that depends on \(S\). This dependence results in a different conditional distribution of \(Y\) for each of the possible states of \(S\), \(p(Y| S=j)\). We can then derive the marginal distribution of \(Y\), the distribution we are interested in, by integrating out (summing over) \(S\) in the joint distribution
\[\begin{align*} p(Y) &= \sum_{j=1}^k \Pr(Y, S=j) \\ &= \sum_{j=1}^k \Pr(S=j) \, p(Y|S=j) = \sum_{j=1}^k \pi_j \, p_j(Y) \end{align*}\]
This is the same as computing the weighted sum over the conditional distributions–the finite mixture formulation. We cannot observe \(S\) directly, it exerts its influence on \(Y\) through the likelihood of its states (\(\pi_j\)) and the conditional distributions of \(Y|S=j\).
A special case of FMMs are Gaussian mixtures where the component distributions are multivariate Gaussian. (See Section 3.7 for a review of the multivariate Gaussian distribution.) The mixture distribution in model-based clustering is \[ p(\textbf{X}) = \sum_{j=1}^k \pi_j \, G(\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j) \]
Each \(G(bmu_j,\boldsymbol{\Sigma}_j)\) is a \(p\)-variate Gaussian, \(\boldsymbol{\mu}_j\) and \(\boldsymbol{\Sigma}_j\) are the \((p \times 1)\) mean vector and the \((p \times p)\) covariance matrix of the \(j\)th component. These need to be estimated. In terms of clusters, \(\boldsymbol{\mu}_j\) is the center of the \(j\)th cluster and \(\boldsymbol{\Sigma}_j\) determines the volume, shape, and orientation of the \(j\)th cluster.
Figure 26.1 shows contours of the density for two bivariate (\(p=2\)) Gaussian distributions. The distribution on the right has \[\boldsymbol{\mu}_1 = \left [ \begin{array}{c}1 \\ 0\end{array} \right ] \qquad \boldsymbol{\Sigma}_1 = \left [ \begin{array}{cc} 1 & 0.5 \\ 0.5 & 2\\ \end{array} \right] \] and the distribution on the left has
\[\boldsymbol{\mu}_2 = \left [ \begin{array}{c}-1.6 \\ 1\end{array} \right ] \qquad \boldsymbol{\Sigma}_2 = \left [ \begin{array}{cc} 1.5 & 0 \\ 0 & 1\\ \end{array} \right] \] When the variance-covariance matrix is diagonal, the contours align with the axes of the coordinate system, stretching in the direction of greater variance. The covariance between \(X_1\) and \(X_2\) introduces a rotation of the contours in the graphic on the right.
The Gaussian mixture model (GMM) can be viewed as a soft version of \(K\)-means clustering. The latter assigns an observation to exactly one of \(K\) clusters—a hard assignment. The GMM assigns probabilities of cluster membership to each observation based on \(G(\boldsymbol{\mu}_1,\boldsymbol{\Sigma}_1), \cdots, G(\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)\). GMM allows for a “gray area”; an observation has a non-zero probability to belong to any of the \(k\) clusters (Figure 26.2).
To train a Gaussian mixture model on data we need to estimate the following parameters: \(k\), the number of components, \(\boldsymbol{\mu}_1, \cdots, \boldsymbol{\mu}_k\), the means of the distributions, and \(\boldsymbol{\Sigma}_1, \cdots, \boldsymbol{\Sigma}_k\), the variance-covariance matrices of the distributions (covariance matrix for short). Each \(\boldsymbol{\mu}_j\) is a \(p \times 1\) vector and each \(\boldsymbol{\Sigma}_j\) is a \(p \times p\) matrix with up to \(p(p+1)/2\) unique entries. The total number of potential unknowns in a GMM is thus \[ k \times (p + p(p+1)/2) = k\frac{3p + p^2}{2} \]
A mixture with \(k=5\) components and \(p=10\) variables has 325 unknowns. For \(p=20\) this grows to 1,150 unknowns. That is not a large number compared to say, an artificial neural network, but it is a substantial number for estimation by maximum likelihood or Bayesian methods. To reduce the number of parameters and to add interpretability, the class of mixtures considered is constrained by imposing structure on the covariance matrices of the Gaussian distributions.
The most constrained covariance model is to assume that the variables are uncorrelated and have the same variance across the \(k\) clusters (an equal variance model): \[ \boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \cdots = \boldsymbol{\Sigma}_k = \sigma^2 \textbf{I} \]
The next most constrained model allows for a different variance in each cluster but maintains uncorrelated inputs: \[ \boldsymbol{\Sigma}_j = \sigma^2\textbf{I} \] To capture the relevant covariance structures in model-based clustering, Fraley and Raftery (2002) start from the eigendecomposition of \(\boldsymbol{\Sigma}\) and use the general parameterization \[ \boldsymbol{\Sigma}_j = \lambda_j \textbf{D}_j \textbf{A}_j \textbf{D}_j^\prime \]
\(\textbf{D}_j\) is an orthogonal matrix of eigenvectors, \(\textbf{A}_j\) is a diagonal matrix whose elements are proportional to the eigenvalues, and \(\lambda_j\) is a proportionality constant. The highly constrained equal variance model corresponds to the special case \(\boldsymbol{\Sigma}_j = \lambda \textbf{I}\) and the heterogeneous variance model corresponds to \(\boldsymbol{\Sigma}_j = \lambda_j \textbf{I}\).
The idea behind this parameterization is to hold elements constant across clusters or vary them, and because \(\lambda\), \(\textbf{D}\), and \(\textbf{A}\) represent different aspects of the shape of the covariance matrix, one arrives at a reasonable set of possible models to consider and to compare. The geometric aspects of the component distributions captured by the three elements of the decomposition are
For example, the model \[ \boldsymbol{\Sigma}_j = \lambda \textbf{D}_j \textbf{A}\textbf{D}_j^\prime \] imposes equal volume and equal shape across the clusters, but allows for cluster-specific orientation of the distributions.
One could use other parameterizations of covariance matrices, for example, expressing covariances as functions of distance between values—the parameterization shown here is implemented in the Mclust
function of the popular mclust
package in R
.
‘mclust’ uses a three-letter code to identify a specific covariance model (Table 26.1)
mclust
.
mclust Code | \(\boldsymbol{\Sigma}_j\) | Volume | Shape | Orientation | Notes |
---|---|---|---|---|---|
EII |
\(\lambda \textbf{I}\) | Equal | Identity | NA | Spherical |
VII |
\(\lambda_j \textbf{I}\) | Variable | Identity | NA | Spherical |
EEI |
\(\lambda \textbf{A}\) | Equal | Equal | Coord. axes | Diagonal |
VEI |
\(\lambda_j \textbf{A}\) | Variable | Equal | Coord. axes | Diagonal |
EVI |
\(\lambda \textbf{A}_j\) | Equal | Variable | Coord. axes | Diagonal |
VVI |
\(\lambda_j \textbf{A}_j\) | Variable | Variable | Coord. axes | Diagonal |
EEE |
\(\lambda\textbf{D}\textbf{A}\textbf{D}^\prime\) | Equal | Equal | Equal | Elliptical |
EEV |
\(\lambda\textbf{D}_j \textbf{A}\textbf{D}_j^\prime\) | Equal | Equal | Variable | Elliptical |
VEV |
\(\lambda_j\textbf{D}_j \textbf{A}\textbf{D}_j^\prime\) | Variable | Equal | Variable | Elliptical |
VVV |
\(\lambda_j\textbf{D}_j \textbf{A}_j\textbf{D}_j^\prime\) | Variable | Variable | Variable | Elliptical |
The first six models in Table 26.1 have diagonal covariance matrices, the inputs are independent (lack of correlation does equal independence for Gaussian random variables). Models EEE, EEV, VEV, and VVV capture correlations among the \(X\)s and vary different aspects of \(\boldsymbol{\Sigma}_j\) across the clusters. The EEE model applies the same covariance matrix in all clusters, EEV varies orientation, VEV varies orientation and volume, and VVV varies all aspects across clusters.
The choice of covariance model should not be taken lightly, it has considerable impact on the clustering analysis. Assuming independent inputs is typically not reasonable. However, if model-based clustering is applied to the components of a PCA, this assumption is met. Some of the models in Table 26.1 are nested and can be compared through a hypothesis test. But in addition to choosing the covariance model, we also need to choose \(k\), the number of clusters. One approach is to compare the model-\(k\) combinations by way of BIC, the Bayesian information criterion, introduced in Section 22.2.1.
BIC is based on the log-likelihood, which is accessible because we have a full distributional specification, and a penalty term that protects against overfitting. Before applying model-based clustering in a full analysis to select \(k\) and the covariance models, let’s examine how to interpret the results of model-based clustering and some of the covariance structures for the Iris data.
Mclust
in R
Model-based clustering based on a finite mixture of Gaussian distributions is implemented in the Mclust
function of the mclust
package in R
. Maximum likelihood estimates are derived by EM (Expectation-Maximization) algorithm and models are compared via BIC. An iteration of the EM algorithm comprises two steps. The E-step computes at the current estimates an \((n \times k)\) matrix \(\textbf{Z}\) of conditional probabilities that observation \(i=1,\cdots,n\) belongs to component \(j=1,\cdots,k\). The M-step updates the parameter estimates given the matrix \(\textbf{Z}\).
Mclust
Example: Equal Variance Structure for Iris Data
The G=
and modelNames=
option im Mclust
are used to specify the number of components in the mixture (the number of clusters) and the covariance model. The defaults are G=1:9
and all available covariance models based on whether \(p=1\) or \(p>1\) and whether \(n > p\) or \(n \le p\).
The following statements fit a 3-component mixture with EII
covariance model to the numerical variables in the Iris data. The EII
model is the most restrictive multivariate model, assuming diagonal covariance matrices and equal variances across variables and clusters (see Table 26.1). This model is probably not appropriate for the Iris data but simple enough to interpret the results from the analysis.
library(mclust)
<- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="EII")
mb_eii $parameters$pro mb_eii
[1] 0.3333956 0.4135728 0.2530316
The three Gaussian distributions mix with probabilities \(\widehat{\pi}_1 =\) 0.3334, \(\widehat{\pi}_2 =\) 0.4136, and \(\widehat{\pi}_3 =\) 0.253.
$parameters$mean mb_eii
[,1] [,2] [,3]
Sepal.Length 5.0060172 5.903268 6.848623
Sepal.Width 3.4278263 2.748010 3.074750
Petal.Length 1.4622880 4.400526 5.732651
Petal.Width 0.2461594 1.432036 2.074895
The first component has a mean of \(\widehat{\boldsymbol{\mu}}_1\) = [5.006, 3.4278, 1.4623, 0.2462]. The following lists the covariance matrices in the three clusters, the model forces them to be diagonal, to have the same variance for all variables and across the clusters, \(\widehat{\lambda}_j\) = 0.1331.
$parameters$variance mb_eii
$modelName
[1] "EII"
$d
[1] 4
$G
[1] 3
$sigma
, , 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1330939 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1330939 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1330939 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1330939
, , 2
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1330939 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1330939 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1330939 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1330939
, , 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1330939 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1330939 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1330939 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1330939
$Sigma
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1330939 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1330939 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1330939 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1330939
$sigmasq
[1] 0.1330939
$scale
[1] 0.1330939
Based on the underlying \(\textbf{Z}\) matrix of the EM algorithm at convergence, the observations can be classified based on the highest probability. For observation #53, for example, \(\textbf{z}_{53}\) = [0, 0.3068, 0.6932]. The probability is highest that the observation belongs to the third component, hence its cluster membership is classified as 3.
round(mb_eii$z[53,],4)
[1] 0.0000 0.3068 0.6932
$classification[53] mb_eii
[1] 3
One of the big advantages of model-based clustering is the ability to quantify the uncertainty associated with a cluster assignment. How confident are we in assigning observation #53 to cluster 3? We are certain that it does not belong to cluster 1, but there is a non-zero probability that the observation belongs to cluster 2. You can retrieve the uncertainty associated with the classification of each observation in the uncertainty
vector. For observations 50—55, for example:
round(mb_eii$uncertainty[50:55],4)
[1] 0.0000 0.3289 0.0015 0.3068 0.0000 0.0031
There is no uncertainty associated with classifying #50. An examination of \(\textbf{z}_{50}\) shows that \(z_{50,1} = 1.0\). There is uncertainty associated with the next three observations, and it is higher for #51 and #53 than for #52.
The uncertainty is simply the complement of the highest component probability. For #53, this is 1-0.6932 = 0.3068. The following code validates this for all observations.
<- apply(mb_eii$z,1,which.max)
c <- rep(0,length(c))
probs for (i in seq_along(c)) {probs[i] <- mb_eii$z[i,c[i]]}
sum(mb_eii$uncertainty - (1-probs))
[1] 0
Finally, you can plot the classification, uncertainty, and density contours for the analysis by choosing the what=
parameter of the plot
method.
plot(mb_eii, what="classification")
plot(mb_eii, what="uncertainty",dimens=c(1,2))
plot(mb_eii, what="density")
The classification and uncertainty plots are overlaid with the densities of the components. The parallel alignment of the densities with the coordinate axes reflects the absence of correlations among the variables. The volume of the densities does not capture the variability of most attributes, many points fall outsode of the density contour—the equal variance assumption across variables and clusters is not justified.
Example: Other Covariance Structures for Iris Data
Now let’s look at how the choice of covariance structure affects the results of the cluster analysis. The following statements fit 3-component models with four additional covariance structures:
<- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="VII")
mb_vii <- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="VEI")
mb_vei <- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="EEE")
mb_eee <- Mclust(iris[,-5],G=3,verbose=FALSE, modelNames="VEV")
mb_vev
$parameters$variance$sigma mb_vii
, , 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.075755 0.000000 0.000000 0.000000
Sepal.Width 0.000000 0.075755 0.000000 0.000000
Petal.Length 0.000000 0.000000 0.075755 0.000000
Petal.Width 0.000000 0.000000 0.000000 0.075755
, , 2
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1629047 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1629047 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1629047 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1629047
, , 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1635866 0.0000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.1635866 0.0000000 0.0000000
Petal.Length 0.0000000 0.0000000 0.1635866 0.0000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.1635866
$bic mb_vii
[1] -853.8144
For the VII model, the covariance matrices are uncorrelated and the variances of the inputs are the same. Compared to the EII model of the previous run, the variances now differ across the clusters.
Note that mclust
computes the BIC statistic in a “larger-is-better” form—see the note in Section 22.2.1 on the two versions of the BIC criterion.
$parameters$variance$sigma mb_vei
, , 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1190231 0.00000000 0.0000000 0.0000000
Sepal.Width 0.0000000 0.07153324 0.0000000 0.0000000
Petal.Length 0.0000000 0.00000000 0.0803479 0.0000000
Petal.Width 0.0000000 0.00000000 0.0000000 0.0169908
, , 2
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.2655657 0.0000000 0.0000000 0.00000000
Sepal.Width 0.0000000 0.1596058 0.0000000 0.00000000
Petal.Length 0.0000000 0.0000000 0.1792731 0.00000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.03791006
, , 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.3380558 0.0000000 0.0000000 0.00000000
Sepal.Width 0.0000000 0.2031725 0.0000000 0.00000000
Petal.Length 0.0000000 0.0000000 0.2282084 0.00000000
Petal.Width 0.0000000 0.0000000 0.0000000 0.04825817
$bic mb_vei
[1] -779.1566
The VEI model introduces different variances for the attributes and has a larger (better) BIC than the VEI model.
$parameters$variance$sigma mb_eee
, , 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.26392812 0.08980761 0.16970269 0.03931039
Sepal.Width 0.08980761 0.11191917 0.05105566 0.02992418
Petal.Length 0.16970269 0.05105566 0.18674400 0.04198254
Petal.Width 0.03931039 0.02992418 0.04198254 0.03965846
, , 2
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.26392812 0.08980761 0.16970269 0.03931039
Sepal.Width 0.08980761 0.11191917 0.05105566 0.02992418
Petal.Length 0.16970269 0.05105566 0.18674400 0.04198254
Petal.Width 0.03931039 0.02992418 0.04198254 0.03965846
, , 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.26392812 0.08980761 0.16970269 0.03931039
Sepal.Width 0.08980761 0.11191917 0.05105566 0.02992418
Petal.Length 0.16970269 0.05105566 0.18674400 0.04198254
Petal.Width 0.03931039 0.02992418 0.04198254 0.03965846
$bic mb_eee
[1] -632.9647
The EEE model introduced covariances between the inputs but keeps the covariance matrices constant across clusters. Its BIC value of -632.965 is a further improvement over that of the VEI model.
$parameters$variance$sigma mb_vev
, , 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.13320850 0.10938369 0.019191764 0.011585649
Sepal.Width 0.10938369 0.15495369 0.012096999 0.010010130
Petal.Length 0.01919176 0.01209700 0.028275400 0.005818274
Petal.Width 0.01158565 0.01001013 0.005818274 0.010695632
, , 2
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.22572159 0.07613348 0.14689934 0.04335826
Sepal.Width 0.07613348 0.08024338 0.07372331 0.03435893
Petal.Length 0.14689934 0.07372331 0.16613979 0.04953078
Petal.Width 0.04335826 0.03435893 0.04953078 0.03338619
, , 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.42943106 0.10784274 0.33452389 0.06538369
Sepal.Width 0.10784274 0.11596343 0.08905176 0.06134034
Petal.Length 0.33452389 0.08905176 0.36422115 0.08706895
Petal.Width 0.06538369 0.06134034 0.08706895 0.08663823
$bic mb_vev
[1] -562.5522
The VEV model allows for correlated attributes and varies volume and orientation across clusters, keeping the shape of the densities the same. Its BIC value of -562.552 indicates a further improvement over the previous models.
The choice of covariance structures and the number of mixture components can be combined into a single analysis.
Example: Selecting Structure and Number of Components
To obtain a full analysis of relevant covariance structures for \(1 \le k \le 9\), simply do not specify the G=
or modelNames=
arguments. Alternatively, you can limit the number of combinations considered by specifying vectors for these options.
<- Mclust(iris[,-5],verbose=FALSE)
mbc summary(mbc)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEV (ellipsoidal, equal shape) model with 2 components:
log-likelihood n df BIC ICL
-215.726 150 26 -561.7285 -561.7289
Clustering table:
1 2
50 100
Based on \(9 \times 10 = 90\) models considered, Mclust
choose the VEV structure with \(k=2\) mixture components. Its BIC value is -561.7285. You can see the BIC values of all 90 models and a list of the top three models in the BIC
object:
$BIC mbc
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE
1 -1804.0854 -1804.0854 -1522.1202 -1522.1202 -1522.1202 -1522.1202 -829.9782
2 -1123.4117 -1012.2352 -1042.9679 -956.2823 -1007.3082 -857.5515 -688.0972
3 -878.7650 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647
4 -893.6140 -812.6048 -827.4036 -748.4529 -837.5452 -751.0198 -646.0258
5 -782.6441 -742.6083 -741.9185 -688.3463 -766.8158 -711.4502 -604.8131
6 -715.7136 -705.7811 -693.7908 -676.1697 -774.0673 -707.2901 -609.8543
7 -731.8821 -698.5413 -713.1823 -680.7377 -813.5220 -766.6500 -632.4947
8 -725.0805 -701.4806 -691.4133 -679.4640 -740.4068 -764.1969 -639.2640
9 -694.5205 -700.0276 -696.2607 -702.0143 -767.8044 -755.8290 -653.0878
VEE EVE VVE EEV VEV EVV VVV
1 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782
2 -656.3270 -657.2263 -605.1841 -644.5997 -561.7285 -658.3306 -574.0178
3 -605.3982 -666.5491 -636.4259 -644.7810 -562.5522 -656.0359 -580.8396
4 -604.8371 -705.5435 -639.7078 -699.8684 -602.0104 -725.2925 -630.6000
5 NA -723.7199 -632.2056 -652.2959 -634.2890 NA -676.6061
6 -609.5584 -661.9497 -664.8224 -664.4537 -679.5116 NA -754.7938
7 NA -699.5102 -690.6108 -709.9530 -704.7699 -809.8276 -806.9277
8 -654.8237 -700.4277 -709.9392 -735.4463 -712.8788 -831.7520 -830.6373
9 NA -729.6651 -734.2997 -758.9348 -748.8237 -882.4391 -883.6931
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
Model-\(k\) combinations shown as NA
failed to converge based on the default settings of the EM algorithm; these can be tweaked with the control=
option of Mclust
. The BIC of the winning combination is
max(na.omit(apply(mbc$BIC,1,max)))
[1] -561.7285
which.max(na.omit(apply(mbc$BIC,1,max)))
2
2
The plot of the BIC values shows two distinct groups of models (Figure 26.3). The models with diagonal covariance structure generally have lower (worse) BIC values if \(k < 6\). The difference between \(k=2\) and \(k=3\) is generally small for the models with correlations. [VEV, \(k=3\)] is a close competitor to the overall “best” model, [VEV, \(k=2\)]. Since we know that there are three species in the data set one might be tempted to go with \(k=3\). In fact, it seems that the \(k=2\) model separates I. setosa in one cluster from the other two species (Figure 26.4).
plot(mbc, what="BIC")
plot(mbc, what="classification",cex=0.75)