# Image Segmentation Based On Gaussian Mixture Models Psychology Essay

Clustering-based image segmentation methods rely on arranging data into groups having common characteristics [19]. During the last decade, the main research directions in the relevant literature are focused on mixture model [16, 14], graph theoretic approaches [17, 8, 21], methods based on the mean shift algorithm [6, 4] and rate distortion theory techniques [20]. Modeling the probability density function (pdf) of pixel attributes (e.g. intensity, texture) with finite mixture models (FMM) [1] is also a natural way to cluster data because it automatically provides a grouping based on the components of the mixture that generated them. Furthermore, the likelihood of a FMM is a rigorous metric for clustering performance [1]. The parameters of the FMM model with Gaussian components can be estimated very efficiently through maximum likelihood (ML) estimation using the Expectation-Maximization (EM) algorithm [7]. Furthermore, it can be shown that Gaussian components allow efficient representation of a large variety of pdf. Thus, Gaussian mixture models (GMM), are commonly employed in image segmentation tasks.

A drawback of the standard ML approach for GMM image segmentation is that commonality of location is not taken into account when grouping the data. In other words, the prior knowledge that adjacent pixels most likely belong to the same cluster is not used. To overcome this shortcoming, spatial smoothness constraints have been imposed. A common approach is the use of an MRF. Many MRF variants have been proposed; see for example [11]. However, determination of the amount of the imposed smoothness automatically requires knowledge of the normalization constant of the MRF. Since this is not known analytically, learning strategies were proposed [22, 9]. Research efforts in imposing spatial smoothness for image segmentation can be grouped into two categories. In the methods of the first category, spatial smoothness is imposed on the discrete hidden variables of the FMM that represent class labels [12, 21].

These approaches may be categorized in a more general area involving simultaneous image recovery and segmentation which is better known as image modeling [15]. More specifically, spatial regularization is achieved by imposing a discrete MRF on the classification labels of neighboring pixels that penalizes solutions where neighboring pixels belong to different classes. Inference in this category of models is non trivial and generally performed through EM-like alternating optimization, Markov Chain Monte Carlo (MCMC) or inexact variational inference.

In the second category of methods, the MRF-based smoothness constraint is not imposed on the labels but on the contextual mixing proportions, that is on the probabilities of the pixel labels. This model is called spatially variant finite mixture model (SVFMM) [16] and avoids the inference problems of DMRFs. For instance, in [14], a new family of smoothness priors was proposed for the contextual mixing proportions based on the Gauss-Markov random fields that take into account cluster statistics, thus enforcing different smoothness strength for each cluster. The model was also refined to capture information in different spatial directions.

In these models, maximum a posteriori (MAP) estimation of the contextual mixing proportions via the MAP-EM algorithm is possible. However, the main disadvantage is that smoothness is imposed in the neighbourhood of each pixel without taking into account that the respective pixel may be an edge pixel or its neighbourhood consists of edge pixels.

In this paper, we present a new hierarchical Bayesian model for mixture model-based image segmentation with spatial constraints. This model assumes that the local differences of the contextual mixing proportions follow a Student’s t-distribution. The generative model of Student’s t-distribution contains two levels. The lower level is a Gaussian pdf with precision (inverse variance) varying with each pixel and the higher level a Gamma pdf. The varying precisions with each pixel of the Gaussians of this model capture the local image variations and thus allow the smoothness constraints to incorporate the image edge structure. A MAP-EM algorithm was used for Bayesian inference with this model. An important feature of this algorithm is that all the necessary parameters are estimated from the data.

Thus, the proposed segmentation algorithm is automatic in the sense that it does not require empirical selection of parameters like other state-of-the-art methods (ncuts, mean-shift). The model was extensively evaluated on the 300 images of the Berkeley image data base and was compared with other GMM based methods that do not require parameter selection. More specifically, it compared favorably to standard GMM and to GMM with “standard” spatial smoothness constraints [14].

## 2. Edge Preserving Spatially Varying GMM

The K-kernel spatially varying GMM [16, 14] differs from the standard GMM [1] in the definition of the mixing proportions. More precisely, in the SVGMM, each pixel xn, n = 1,…, N has a distinct vector of mixing proportions denoted by πnj , j = 1, …, K, with K being the number of Gaussian kernels. We call these parameters contextual mixing proportions to distinguish them from the mixing proportions of a standard GMM. Hence, the probability of a distinct pixel is expressed by:

where 0 ≤ πnj ≤ 1,∑K j =1 πnj = 1 for j = 1,2, …, K and n = 1,2, …, N, μj are the Gaussian kernel means and ∑j are the Gaussian kernel covariance matrices.

Generally, in image processing and computer vision, we assume that, conditioned on a hidden variable Z, pixels X = {x1, x2, …., xN }are independent and Gaussian distributed:

where the set of N x K latent variables Z = {znj}n=1…N,k=1…K is introduced to make inference tractable for the model. The Z variables are distributed multinomially:

where each zn is a binary vector, with znj = 1 if datum n is generated by the jth kernel and znj = 0 otherwise.

Considering the set of contextual mixing proportions Π as random variables and assuming a proper prior, we can incorporate the intuitive fact that neighbouring pixels are more likely to share the same class label. We assume a Markov random field on Π, which equivalently means that Π is governed by a Gibbs distribution [11] generally expressed by:

where ψc is a function on clique c, called clique potential function in the literature, and the product is over all minimal cliques of the Markov random field. In the herein proposed model, we consider clique potential functions imposing local differences of contextual mixing proportions to follow a univariate Student’s t distribution.

A d-dimensional random variable X follows a multivariate t-distribution, X ~ St( μ, ∑, ν ), with mean μ, positive definite, symmetric and real d x d covariance matrix ∑ and has ν ε (0,∞) degrees of freedom when [1], given the weight u, the variable X has the multivariate normal distribution with mean μ and covariance ∑/μ;

and the weight u follows a Gamma distribution parameterized by ν :

Integrating out the weights from the joint density leads to the density function of the marginal distribution:

where δ(x,μ;∑) = (x-μ)T ∑-1 (x-μ) is the Mahalanobis squared distance and Г is the Gamma function [1]. It can be shown that for ν→∞ the Student’s t-distribution tends to a Gaussian distribution with covariance ∑. Also, if ν > 1, μ is the mean of X and if º > 2, º(º ¡2)¡1§ is the covariance matrix of X. Therefore, the family of t-distributions provides a heavy-tailed alternative to the normal family with mean μ and covariance matrix that is equal to a scalar multiple of ∑, if ν > 2 (fig. 1)[1].

Figure 1. The Student’s t-distribution for various degrees of freedom. As ν→∞ the distribution tends to a Gaussian. For small values of ν the distribution has heavier tails than a Gaussian.

Therefore, the clique potential functions are properly defined in order to impose:

As it an be observed in eq. (7), we introduce KxD different t-distributions, amounting to an equal number of parameter sets, {βjd ,νjd}j=1..K,d=1..D. In eq. (7), D stands for the number of a pixel’s neighbourhood adjacency types, and γd(n) is the set of neighbours of pixel indexed n, with respect to the dth adjacency type. In our model, we assume 4 neighbours for each pixel, and partition the corresponding adjacency types into horizontal and vertical, thus, setting D = 2. This variability of parameter sets aims to capture the fact that smoothness statistics may vary along clusters and spatial directions [14]. Therefore, the joint distribution on Π is given by:

Figure 2. Graphical model for the edge preserving spatially variant Gaussian mixture model. Superscript n ε [1,N] denotes pixel index, subscript j ε [1,K] denotes kernel (segment) index and d ε [1,D] describes the neighbourhood direction type.

Following the definition of the t-distribution in eq. (4) and eq. (5) we introduce the latent variables U = {unj}n=1..N,j=1..K,d=1..D and the distribution of the differences of local contextual mixing proportions becomes:

This generative model (fig. 2), apart from being tractable using the EM algorithm, as it will be demonstrated in the next section, allows better insight on our assumption of Student-t cliques. Since unkj depends on datum indexed by n, each weight difference in the MRF can be described by a different instance of a Gaussian distribution. Therefore, as unkj → +∞ the distribution tightens around zero, and forces neighboring contextual mixing proportions to be smooth. On the other hand, when unk unkj → 0the distribution tends to be uninformative, and forces no smoothness. This is a desirable property when there exists an edge between neighboring pixels. Thus, the U-variable maps provide a very detailed description of the image edge structure. Furthermore, they may be considered as a continuous generalization of the binary line-process variable idea in [2, 11].

## 3. Bayesian inference using MAP-EM

As shown in the graphical model in figure 2, the unknowns ψ={μ,∑,β,ν}are considered as parameters and will be estimated in the M-step of the EM algorithm that follows. The Z,U are hidden random variables and will be inferred in the E-step of the same algorithm. The unknown quantities Π, although being random variables, they are treated as parameters and are estimated in the M-step. This is the reason we refer to this algorithm as MAP-EM [1]. To perform model inference, the evidence ln p(X;Π;ψ) with respect to the model parameters ψ={μ,∑,β,ν}and the contextual mixing proportions Π has to be optimized. In EM terminology [7] this is the incomplete data log-likelihood while the complete log-likelihood is expressed by ln p(X, Π,Z,U; ψ). The conditional expectation of the complete likelihood is an important quantity in EM - it is defined as

The E-step consists in computing the joint expectation of the hidden variables Z and U, with respect to current iteration parameters ψ(t) where t denotes the number of current iteration. Observing the graphical model in fig. 2, we can see that given X and c, Z and U are conditionally independent; therefore εz,u׀X,Π(.) = εz׀x,Π{(εu׀x,Π(.)} and we can compute these expectations separately. So we have the updates Λn,j,d, Λk ε γd(n):

where ψ(.) stands for the digamma function, and parameters ζ, n being:

Maximization of the current complete likelihood (10) must be driven with respect to the model parameters ψ and Π. With some manipulation, we can rewrite it equivalently as

In this form, parameter optimization is straightforward. The resulting update equations make up the M-step:

Moreover, the contextual mixing proportions πnj are computed as the roots of a quadratic equation:

with coefficients:

## 4. Experimental results

In our model, parameters U play a very important role in the preservation of the boundaries between image regions. The U-variable maps for the jth kernel can be considered as the edges that separate the jth segment of the image from the remaining segments. To demonstrate this point we show an example in figure 3. In this example, an image is segmented into K = 3 segments thus 6 U-variable maps are shown.

The first row of this figure shows the original and the segmented images. Then, moving from top to bottom, the U-variable maps for the three image segments, namely sky, roof and shadows, building are shown, respectively. The left column highlights vertical edges and the right column underpins horizontal edges. Notice that in the second row of figure 3, where the U-variable maps for segment sky are shown, the edges between the segment sky and the rest (roof and shadows, building) are mainly highlighted. The edges between the other segments, (roof and shadows and building) are mainly highlighted in the remaining two maps. Similarly, the edges between the segments sky and building are not highlighted in the third row of images as the U- variable maps for roof and shadows are underpinned.

Figure 3. U-variable maps: The first row shows the original image and the segmentation for K = 3 clusters; the rows below show U- variable maps (expected values of unkj variables). Brighter values represent lower values of u. In each row, the U-variable maps for kernel indexed by j = 1 (sky), j = 2 (roof and shadows) and j = 3 (building), are shown respectively. The left column corresponds to u values computed for horizontal adjacencies, and the right column for vertical adjacencies.

In our implementation, we have used a 4-dimensional feature vector to describe the image data. It is comprised by the Lab color space features and the Blobworld contrast texture descriptor as described in [5]. Prior to segmentation, each variate has been separately normalized in order not to have dominating features. We have evaluated the proposed Student’s t-based SVGMM (St-SVGMM) segmentation scheme on the 300 images of the Berkeley image database [13]. We have applied our algorithm for different values of the number of segments (K ={3,5,7,10,15,20}). For comparison purposes, we have also experimented with the standard GMM [1] and the GMM based segmentation with ”standard” smoothness constraints [14] with the same number of components.

The obtained segmentations were quantitatively evaluated with two performance measures: the Rand index (RI) [18] and the boundary displacement error (BDE) [10]. The RI measures the consistency between the ground truth and the computed segmentation map while the BDE measures error in terms of boundary displacement with respect to the ground truth. The statistics for these measures are presented in tables 1 and 2. Based on the theoretical properties of the Student’s t-model one might have expected that the St-SVGMM introduced erroneous boundaries that did not agree with human segmentation. Therefore it would provide a worse RI as compared to the ”classical” non preserving algorithm (SVGMM) [14]. However, as observed in the statistics of the RI (table 1), the St-SVGMM outperforms the standard GMM in all cases and the SVGMM in the overwhelming majority of the different number of components.

Also, in terms of correct region boundary estimation, expressed by the BDE (table 2), the St-SVGMM outperforms the SVGMM, as it is theoretically expected. However, it also outperforms standard GMM and the difference in performance increases with the number of segments. The explanation for this behavior is that the standard GMM since it does not integrate a smoothing step it generally computes correctly the boundaries between segments (it also outperforms the SVGMM in the same median values). However, as the number of segments increases, the complexity of the image cannot be captured by a simple GMM and smoothness constraints that model the image edge structure become increasingly beneficial.

Overall, the St-SVGMM not only preserves region boundaries but also improves the correct classification rates with respect to the standard methods. Some representative segmentation examples are shown in figure 4.

Figure 4. Segmentation examples using the proposed edge preserving spatially variant mixture. From left to right, the columns show: the original image, segmentation with K = 5, K = 10 and K = 15 kernels.

## Table 1. Statistics on the Rand Index (RI) over the 300 images of the Berkeley image data base for the compared methods. Higher values represent better segmentations.

## Table 2. Statistics on boundary displacement error (BDE) over the 300 images of the Berkeley image data base for the compared methods. Lower values represent better segmentations.

## 5. Conclusion

In this paper a segmentation algorithm based on clustering with GMM is proposed. The main novelty of this work is a smoothness prior which apart from constraining adjacent pixel to belong in the same cluster captures the image edge structure. Thus, it does not enforce smoothness across segment boundaries. Another important feature of the herein proposed segmentation algorithm is that all required parameters are estimated from the data. Thus, this algorithm is automatic and does not require empirical parameter selection like many recent state-of-the-art segmentation algorithms. An important perspective of this study is to automatically estimate the number of components K. To this end, criteria appropriate to constrained mixtures could be conceived.

### Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:

Request the removal of this essay