Methods For Classification Of Water Chemistry Data Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A robust classification scheme for partitioning water chemistry samples into homogeneous groups is an important tool for the characterization of hydrologic systems. In this paper we test the performance of the many available graphical and statistical methodologies used to classify water samples

including: Collins bar diagram, Pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, Principal components analysis, and Fuzzy k-means clustering. All the methods are discussed and compared as to their ability to cluster, ease of use, and ease of interpretation. In addition, several issues related to data preparation, database editing, data-gap filling, data screening, and data quality assurance are discussed and a database construction methodology is presented.

The use of graphical techniques proved to have limitations compared to the multivariate methods for large data sets. Principal components analysis is useful for data reduction and to assess the continuity/overlap of clusters or clustering/similarities in the data. The most efficient grouping was achieved by statistical clustering techniques. However, these techniques do not provide information on the chemistry of the statistical groups. The combination of graphical and statistical techniques provides a consistent and objective means to classify large numbers of samples while retaining the ease of classic graphical presentations.

204 Words Abstract

Keywords: Water chemistry, Database construction, Classification techniques, Cluster analysis, Fuzzy k-means clustering, California.


The chemical composition of surface and groundwater is controlled by many factors that include composition of precipitation, mineralogy of the watershed and aquifers, climate, and topography. These factors combine to create diverse water types that change spatially and temporally. In our study area, which lies within the south Lahontan hydrologic region of southeastern California (Figure 1), there is a wide variety of climatic conditions (high alpine to desert), hydrologic regimes (alluvial basin-fill aquifers, fractured rock aquifers, and playas) and geologic environments (igneous rocks, volcanic rocks, metamorphic rocks, sedimentary deposits, evaporites, and mineralized zones). Thus, the samples from the area could potentially represent a variety of water types providing an opportunity to test the performance of many of the available graphical and statistical methodologies used to classify water samples.

The use of major ions as natural tracers (Back 1966) has become a very common method to delineate flowpaths in aquifers. Generally, the approach is to divide the samples into hydrochemical facies (a.k.a. water types), that is groups of samples with similar chemical characteristics that can then be correlated with location. The spatial variability observed in the composition of these natural tracers can provide insight into aquifer heterogeneity and connectivity, as well as the physical and chemical processes controlling water chemistry. Thus, a robust classification scheme for partitioning water chemistry samples into homogeneous groups can be an important tool for the successful characterization of hydrogeologic systems. A variety of graphical and multivariate statistical techniques have been devised since the early 1920's in order to facilitate the classification of waters, with the ultimate goal of dividing a group of samples into similar homogeneous groups (each representing a hydrochemical facies). Several commonly used graphical methods and multivariate statistical techniques are available including: Collins bar diagram, Pie diagram, Stiff pattern diagram, Schoeller semi-logarithmic diagram, Piper diagram, Q-mode hierarchical cluster analysis (HCA), K-means clustering (KMC), Principal components analysis (PCA), and Fuzzy k-means clustering (FKM). This paper utilizes a relatively large data set to review these techniques and compare their ease of use and ability to sort water chemistry samples into groups.

Hydrogeologic Setting

The study area is part of the Basin and Range Province of the southwestern USA and extends from 35 to 37 degrees of latitude north and from 117 to 118.5 degrees of longitude west (Figure 1). The area comprises a portion of the Sierra Nevada mountain range, which is the recharge area, and adjoining alluvial basins, which are arid. Due to modern arid climate, surface water is scarce in the area and groundwater is the only source of drinking and household use water (Berenbrock and Schroeder 1994). Thus, effective management of the groundwater resources requires an accurate model for the aquifer characteristics, groundwater flow directions, recharge mechanisms, discharge mechanisms, and water chemistry processes.

In the Basin and Range groundwater system, water flows from recharge areas in the mountains to discharge areas in adjacent valleys (Maxey 1968). This local flow system is often modified by local geologic, physiographic, and climatic factors. During the Pleistocene and Holocene epochs, the valley floors in the study area were periodically occupied by a chain of lakes stretching from Mono Lake in the north to Lake Manley in Death Valley (Duffield and Smith 1978; Lipinski and Knochenmus 1981). Present-day valley floors are occupied by playas, known in different localities as "salt lakes," "soda lakes," "alkali marshes," "dry lakes" or "borax lakes", where the majority of groundwater discharges by evapotranspiration (Lee 1912; Fenneman 1931; Dutcher and Moyle 1973). Minor discharge also occurs by other ways including discharge from springs, seeps, and pumping from wells. The groundwater in the area occurs in two porosity regimes: (1) intergranular porosity found mostly in alluvial basin-fill aquifers and (2) fracture porosity found in the mountain watersheds. The alluvial basin-fill aquifers can be further divided into two components: a shallow saline aquifer (<150 m depth) and a deep (610 m), locally confined aquifer that extends throughout the area (Dutcher and Moyle 1973).


The available major solute data (spring, surface, and well water) for the area was compiled for this study in order to create a comprehensive database, called SLHDATA (South Lahontan Hydrochemical Database), for the classification of waters into hydrochemical facies representing "water types". The data were arranged in rows (for sampling locations) and in columns (for chemical parameters). The entire database consists of chemical analyses of 152 spring samples, 153 surface samples, and 1063 well (groundwater) samples, including temporal samples (samples collected over a period of time at the same location). Sources of the data are presented in Table 1. In the case of multiple samples from the same location, the more recent and/or the more complete sample data were included in the statistical analysis unless evaluation of temporal effects was desired. Database construction procedures and comparison of the results from the various statistical and graphical techniques is discussed in detail in the following sections. Detailed analysis of the graphical and statistical water groups in terms of the physical and chemical factors that control water chemistry is not the focus of this paper. Instead, we are interested here in the ability of available techniques to classify a diverse set of samples into distinct groups.

Of the 39 hydrochemical variables (consisting of major ions, minor ions, trace elements and isotope data) in the compiled database, 11 variables (Specific Conductance, pH, Ca, Mg, Na, K, Cl, SO4, HCO3, SiO2, and F) occur most often and were thus used in our evaluation. It is usually assumed that adequate quality assurance (QA) and quality control (QC) measures were performed at the time of original data collection and analysis, however, we screened the data to verify that they were usable (see below for further discussion). The data collection methods, which are similar are described in detail for most of the data sources, or documented in U.S. Geological Survey's "Techniques of Water-Resources Investigations" manuals (e.g. Brown et al. 1970; Wood 1981).

The StatisticaÒ Release 5.0 (StatSoft, Inc. 1995) commercial software package was utilized for the basic statistical analyses performed. MicrosoftÒ Excel 97 (Microsoft Corporation 1985) and RockWorksÒ (RockWare, Inc. 1999) were used for the graphical analyses. Classification of the data was also performed using FuzME® (Fuzzy k-Means with Extragrades) (Minasny and McBratney 1999). The techniques used include cluster analysis (HCA and KMC), Principal components analysis (PCA), Fuzzy k-means clustering (FKM), and a variety of graphical methods. Detailed technical descriptions of HCA, KMC and PCA techniques and description of FKM technique are provided in StatSoft, Inc. (1997) and Bezdek et al. (1984), respectively.

Database Editing

Figure 2 is a flow chart that summarizes the methodology used for compiling the hydrochemical database. If reported, field measurements of alkalinity and pH were used for the construction of the database. Otherwise, laboratory measurements of these variables were used. Some of the individual data sets contained the same sample, or apparent near-duplicate analyses for minor elements.

In general, there were more discrepancies for iron than for any other minor element. This was probably due to the different ways in which iron concentrations were expressed or the convention being used was not clearly stated. We have chosen not to use any minor element data for our study due to these sort of problems.

Samples with uncertain locations were located using reports and maps or eliminated from the database when such information was not available. Locations of sites that had only the name of the well or spring were determined as accurately as possible, usually to several hundred meters and always within one kilometer. Well-water samples without sampling depths were retained in the database, but eliminated from the multivariate statistical analysis.

Units of measurement were sometimes inconsistent between different data sets. All values were converted to an internally consistent format (all units are in mg L-1 in the SLHDATA database). Common reporting units for data sets were weight-per-volume units (milligrams per liter (mg L-1) and micrograms per liter (mg L-1)), equivalent-weight units (milliequivalents per liter (meq L-1) and microequivalents per liter (meq L-1)), or weight-per-weight units (parts per million (ppm), parts per billion (ppb), and parts per trillion (ppt)). Conversion factors for calculation of a unit from the other units are given by Hem (1989).

Censored Values

Water chemistry data are frequently censored, that is, concentrations of some elements are reported as non-detected, less-than or greater-than. These values are created by the lower or upper detection limit of the instrument or method used. Censored data are not appropriate for many multivariate statistical techniques. Therefore, the non-detected, less-than and greater-than values must be replaced with unqualified values (Farnham et al. 2002). In our database there were no censored values for the 11 variables used in this study, however, since this is often not the case we briefly discuss methods to deal with this situation.

A number of techniques have been suggested for replacement of a censored value including replacement of the less-than values by 3/4 times the lower detection limit and the greater-than values by 4/3 times upper detection limit (VanTrump and Miesch 1977). An alternative is replacement of less-than values by 0.55 times the lower detection limit and the greater-than values by 1.7 times upper detection limit (Sanford et al. 1993). For data where the proportion of the censored values is >10%, another method that was devised by Sanford et al. (1993) can be used. This method estimates the mean of the normal distribution using a maximum likelihood estimation method. Then, this estimated mean is used to derive an estimated replacement value.

Data-Gap Filling Procedures - Estimation of the Missing Values

Usually the effective use of many of the methods requires complete water analyses (no missing data values). Missing data values may make the use of graphical water chemistry techniques impossible, or limit the quality of the statistical analysis. During the statistical analysis, most statistical software packages replace those missing values with means of the variables or prompt the user for casewise deletion of analytical data, both of which are not desirable. This can bias statistical analyses if these values represent a significant number of the data being analyzed.

There are statistical methods and chemical relationships that can be employed to estimate missing data values. For instance, missing conductance data can be calculated from total dissolved solids (TDS) data by using a simple linear regression method. In our database, a significantly (p<0.001) high correlation coefficient (r = 0.984) was found to exist between these two variables. The p-value is the significance probability for testing the null hypothesis that true correlation in the population is zero. A small value of p (e.g. p<0.001) indicates that there is a significant correlation. Thus, missing potassium (K) values were estimated by utilizing the linear relationship between potassium and sodium (Na), which had a significantly (p<0.001) high correlation coefficient of 0.904.

Missing bicarbonate (HCO3) data can be calculated from alkalinity values and pH. Inverting the problem, missing pH values can be calculated by using Equation (1) if the CO3 and HCO3 values were reported:


The same relationship can also be used to calculate the missing carbonate (CO3) values if pH was reported. Finally, if there were no means of establishing a value, a value of "-9999" was entered for the missing value, indicating that no data were available for that entry. In our data set there were very few (3%) samples with censored or missing values.

Charge Balance Error

The edited chemical analyses in the SLHDATA database were tested for charge balance (Freeze and Cherry 1979):


Where z is the absolute value of the ionic valence, mc the molality of cationic species and ma the molality of the anionic species.

Conventions and assumptions used in balancing the analyses included:

when bicarbonate and carbonate data were not given, alkalinity, if available, was used to estimate a bicarbonate concentration,

in the few cases where the calcium and magnesium data were missing, hardness is used to estimate the sum of calcium and magnesium concentrations.

Calculated charge balance errors are less than or equal to ±10.4% for SLHDATA database, which is an acceptable error for the purpose of this study (Table 2). Samples with errors greater than ±10.4% were not used. For the spring water and well water (groundwater) data, errors are evenly distributed between positive and negative values, and thus are not systematic (Figure 3a, 3b, 3c, and 3d). The charge balance errors of the surface water data showed a bimodal distribution and had a skewed distribution (Figure 3e and 3f). Accordingly, the surface water samples were further studied. The samples were split into those from the Sierra Nevada mountain block and the Indian Wells-Owens Valley area (see Figure 7 for locations). The Indian Wells-Owens Valley area samples have charge balance errors that approach a normal distribution and range from -6% to +10%, while the Sierra Nevadan samples had a strongly skewed distribution. The Sierra Nevada data included 78 samples collected for the Domeland Wilderness study (McHugh et al. 1981), which were identified as the source of the skewed distribution and indicates a systematic error in that particular set of analyses. However, the error is not sufficient to remove the data set from the database.

Data Screening

The purpose of data screening is to evaluate the distribution characteristics of each variable in the database. We used univariate and bivariate statistical methods to assess each variable independently, and the relationship between variable pairs. The physical and chemical properties were evaluated using central tendency (mean, median, mode) and dispersion (standard deviation, skewness), and by graphical displays such as histograms, scatterplots, probability plots, and box plots. Based on these analyses, decisions were made concerning the need for, and selection of, appropriate transformations to achieve a better approximation of the normal distribution. This is important because most of the statistical analyses assume that data are normally distributed.

The data screening showed that the data used in this study were universally skewed positively; the data contained a small number of high values. Most naturally occurring element distributions follow this pattern (Miesch 1976). The data were log-transformed (except for pH) so that they more closely corresponded to normally distributed data. Then, all the 11 variables were standardized by calculating their standard scores (z-scores) as follows:


Standardization scales the log-transformed data to a range of approximately -3 to +3 standard deviations, centered about a mean of zero. In this way, each variable has equal weight in the statistical analyses. Otherwise, the Euclidean distances will be influenced most strongly by the variable that has the greatest magnitude (Judd 1980; Berry 1995). Besides normalizing and reducing outliers, these transformations also tend to homogenize the variance of the distribution (Rummel 1970). The raw data (with data-gaps filled) were used for the graphical analyses, while the transformed (log-transformed and standardized) data were used for the hierarchical cluster analysis (HCA), K-means cluster analysis (KMC), Principal components analysis (PCA), and Fuzzy k-means clustering (FKM).


The fundamental aim of the techniques compared here is to identify chemical relationships between water samples. Samples with similar chemical characteristics often have similar hydrologic histories, similar recharge areas, infiltration pathways and flowpaths in terms of climate, mineralogy and residence time. Table 3 shows the various techniques and the required input data. For brevity, only the 152 spring water samples are discussed in the following text. The other subsets of the complete database produced similar results. A preliminary analysis of temporal effects, based on examination of individual analyses, suggested that relatively little change occurred in the water quality of samples with time. This indicates that the spatial variability is the most important source of variation in the data, rather than the temporal factor. This conclusion was later tested and confirmed as discussed in the statistical methods section. For that reason, we did not include samples from temporal series to statistical analysis. This reduced the total number of samples to 118.

Graphical Methods

Most of the graphical methods are designed to simultaneously represent the total dissolved solid concentration and the relative proportions of certain major ionic species (Hem 1989). All the graphical methods use a limited number of parameters, usually a subset of the available data, unlike the statistical methods that can utilize all the available parameters. The Piper diagram (Piper 1944) (Figure 4) is the most widely used graphical form and it is quite similar to the diagram proposed by Hill (1940; 1942). The diagram displays the relative concentrations of the major cations and anions on two separate trilinear plots, together with a central diamond plot where the points from the two trilinear plots are projected. The central diamond-shaped field (quadrilateral field) is used to show overall chemical character of the water (Hill 1940; Piper 1944). Back (1961) and Back and Hanshaw (1965) defined subdivisions of the diamond field that represent water type categories that form the basis for one common classification scheme for natural waters. The mixing of water from different sources or evolution pathways can also be illustrated by this diagram (Freeze and Cherry 1979). Symbol sizes can be scaled to TDS on the diamond-shaped field to add even more information (Domenico and Schwartz 1997).

Figure 4 shows the results of plotting the 118 spring samples on the Piper diagram. The data are broadly distributed rather than forming distinct clusters. Employing the water classification scheme of Back and Hanshaw (1965), the samples are classified into a variety of water types including Ca-HCO3, Ca-Mg-HCO3, Ca-Na-HCO3, Na-HCO3, Na-Ca-HCO3, Na-Cl and Ca-Mg-SO4 types, with no dominant type. This diagram provides little information that allows us to discriminate between separate clusters (groups) of samples.

The Collins bar diagram (Collins 1923) and the Pie diagram (Figure 5) are easy to construct and present relative major ion composition in percent milliequivalents per liter (relative %meq L-1). The constituents can also be plotted in meq L-1 with an appropriate scaling. For the Collins bar diagram major cations are plotted on the left and major anions are plotted on the right. For the Pie diagram the cations are plotted in the upper half and anions are plotted in the lower half of the circle. The Pie diagram is usually drawn with a radius proportional to TDS.

The Stiff pattern (Figure 5) is a polygon that is created from three (or four) parallel horizontal axes extending on either side of a vertical zero axis (Stiff 1951). In this diagram, cations are plotted on the left of the axes and anions are plotted on the right, in units of milliequivalents per liter (meq L-1). The Stiff diagram is usually plotted without the labeled axis and is useful making visual comparison of waters with different characteristics. The patterns tend to maintain its shape upon concentration or dilution, thus visually allowing us to trace the flowpaths on maps (Stiff 1951).

The Schoeller semi-logarithmic diagram (Schoeller 1955; 1962) (Figure 5) allows the major ions of many samples to be represented on a single graph, in which samples with similar patterns can be easily discriminated. The Schoeller diagram shows the total concentration of major ions in log-scale.

As we can see from Figure 5, the Collins, Pie, and Stiff methods produce a single diagram for each sample. Clearly, it is not practical to produce and manually sort 118 separate figures (e.g. Stiff diagrams), one for each sample, in order to sort and classify large data sets. The choice of similarity would be based on the evaluation of the analyst, which is highly subjective. We suggest therefore that using purely graphical methods to group the samples is not efficient and can produce biased results. However, these methods are useful for presentation of maps showing hydrochemical facies, and software is available (e.g. RockWorks®) to automatically and rapidly prepare such maps.

Multivariate Statistical Techniques

Another approach to understanding the chemistry of water samples is to investigate statistical relationships among their dissolved constituents and environmental parameters, such as lithology, using multivariate statistics (Drever 1997). Statistical associations do not necessarily establish cause-and-effect relationships, but do present the information in a compact format as the first step in the complete analysis of the data and can assist in generating hypothesis for the interpretation of hydrochemical processes.

Statistical techniques, such as cluster analysis, can provide a powerful tool for analyzing water chemistry data. These methods can be used to test water quality data and determine if samples can be grouped into distinct populations (hydrochemical groups) that may be significant in the geologic context, as well as from a statistical point of view. Cluster analysis was successfully used, for instance, to classify lake samples into geochemical facies (Jaquet et al. 1975). Alther (1979), Williams (1982), and Farnham et al. (2000) also applied cluster analysis to classify water chemistry data.

The assumptions of cluster analysis techniques include homoscedasticity (equal variance) and normal distribution of the variables (Alther 1979). Equal weighing of all variables requires the log-transformation and standardization (z-scores) of the data, as discussed above. Comparisons based on multiple parameters from different samples are made and the samples grouped according to their "similarity" to each other. The classification of samples according to their parameters is termed Q-mode classification. This approach is commonly applied to water chemistry investigations in order to define groups of samples that have similar chemical and physical characteristics, because rarely is a single parameter sufficient to distinguish between different water types.

Both the Hierarchical cluster analysis (HCA) and K-means clustering (KMC) were used to classify the samples into distinct hydrochemical groups based on their similarity. In order to determine the relation between groups, the r ´ c data matrix (r samples with c variables) is imported into a statistics package. The StatisticaÒ (StatSoft, Inc. 1995) has seven similarity/dissimilarity measurements and seven linkage methods and supports up to 300 cases for the amalgamation process in the cluster analysis. Individual samples are compared with the specified similarity/dissimilarity and linkage methods and then grouped into clusters.

The linkage rule used here is Ward's method (Ward 1963). Linkage rules iteratively link nearby points (samples) by using the similarity matrix. The initial cluster is formed by linkage of the two samples with the greatest similarity. Ward's method is distinct from all other methods because it uses an analysis of variance (ANOVA) approach to evaluate the distances between clusters. Ward's method calculates the error sum of squares, which is the sum of the distances from each individual to the center of its parent group (Judd 1980) and forms smaller distinct clusters than those formed by other methods (StatSoft, Inc. 1995).

Similarity/dissimilarity measurements and linkage methods used for clustering greatly affects the outcome of the HCA results. After careful examinations of available combinations of similarity/dissimilarity measurements, it was found that using Euclidean distance (straight line distance between two points in c-dimensional space defined by c variables) as similarity measurement, together with Ward's method for linkage, produced the most distinctive groups where each member within the group is more similar to its fellow members than to any member from outside the group. The HCA technique does not provide a statistical test of group differences, however there are tests that can be applied externally for this purpose (e.g. Student's t-test). It is also possible in HCA results that one single sample that does not belong to any of the groups is placed in a group by itself. This unusual sample is considered as residue.

HCA classifies the data in a relatively simple and direct manner, with the results being presented as a dendogram, an easily understood and familiar diagram (Davis 1986). In the present case, we selected the number of groups based on visual examination of the dendogram (Figure 6). The resulting dendogram was interpreted to have classified the 118 spring water samples into three major groups (I to III) and nine subgroups (1 to 9) using 11 variables; this, however, is a subjective evaluation. Greater or fewer groups could be defined by moving the dashed horizontal line (phenon line) up or down. In addition, the dendogram does not give information about the distribution of the chemical constituents that form each group, a distinct limitation when compared to the graphical techniques. The differences among subgroups defined by the HCA (Figure 6) were determined to be statistically significant (p<0.001), except the subgroups 2 and 3 of Group I, which were significant only at p<0.05.

Table 4 shows the means for each of the parameters produced by the HCA analysis. These values reveal some trends between the major groups. Group I samples all have significantly higher TDS than group II or III samples. Subgroup 6 has only one member (Table 4 and Figure 6), a sample which is distinguished by an abnormally low SiO2 value. This value is probably an analytical or typographic error, and was removed from the database. Groups II and III also appear to be separated based on TDS. The basis for the division into subgroups is not so apparent. For instance, subgroups 1 and 2 appear to be distinguished from subgroup 3 by the lower pH values and higher Ca and Mg values. However, the differences between subgroups 5 and 7 are subtle.

At this point, it is fair to ask if these clusters of samples have any physical significance/meaning, or are just a statistical result. The relationship of the statistically defined clusters of samples to geographic location was tested by plotting the subgroup value for each sample on a site map (Figure 7). The figure shows that there is a good correspondence between spatial locations and the statistical groups as determined by the HCA. For instance, the spring samples composing group I are usually found close to playa or nearby discharge areas on the basin floors and have the highest TDS concentration in the area (Table 4). Group II samples are mostly located below the 2000 m contour line in the Sierra Nevada and also found at the ranges surrounding the valleys (Figure 7). Group III samples plot above the 2000 m contour line in the high Sierra Nevada and they are characterized by low TDS concentrations (Table 4). Majority of recharge to the basin-fill aquifers occurs from areas where group II and III samples are located. It appears that the technique can provide valuable information to help define the hydrologic system. For instance, the high degree of spatial and statistical coherence in this data set could be used to support a model of hydrochemical evolution where the changes in water chemistry are a result of increasing rock-water interactions along hydrological flowpaths.

K-means clustering (KMC) has also been used to classify water samples into distinct hydrochemical groups (Johnson and Wichern 1992). This method of clustering is different from the HCA as the number of clusters is pre-selected at the start of the analysis, producing a subjective bias. The KMC method will produce exactly K different clusters with the greatest possible distinction. Computationally, this method can be thought of as an analysis of variance (ANOVA) in reverse. The clustering starts with K random clusters, and then moves objects between those clusters with the goal to (1) minimize variability within clusters and (2) maximize variability between clusters (StatSoft, Inc. 1995). Unlike HCA, the results from KMC cannot be presented in a dendogram for a quick visual assessment of the results. Instead the results are presented in a large table that shows members of clusters and their distances from respective cluster centers.

As discussed previously, we did not include samples from temporal series because the preliminary analysis showed that there was little temporal variation. To verify that analysis, we included the entire 152 spring samples in an hierarchical cluster analysis (HCA) and examined the resulting dendogram. The dendogram had the temporal series samples placed together, suggesting that little change occurs in the water quality with time period of sampling. This agrees with the preliminary analysis that spatial variability is the most important source of variation in the data.

Another type of data analysis sometimes used is Principal components analysis (PCA). This technique reduces the number of dimensions present in data (reducing 11 variables to 2 variables in our study). The PCA-defined new variables can then be displayed in a scatter diagram, presenting the individual water samples as points in a lower-dimensional (generally 2-D) space. This technique, strictly speaking, is not a multivariate statistical technique but a mathematical manipulation that may provide a certain amount of insight into the structure of the data matrix (Davis 1986) by reducing the dimensions of the data matrix. Figure 8 shows the results of the Principal components analysis of the 118 samples. The first principal component (PC1) contains 54.5% of the total variance and the second component (PC2) represents 14.5% of the total variance. Although there appears to be reasonable statistical discrimination between the three major groups as defined by HCA, there is no objective means to clearly distinguish boundaries between the groups or subgroups, nor does this type of analysis provide any information about chemical composition. This method was used to investigate the degree of continuity or clustering of the samples and to determine if overlapping water types exist within the data. The scatter of points in Figure 8 suggests that there is continuous variation of the chemical and physical properties of the samples.

The HCA clustering scheme was also repeated using just the two Principal components scores (reduced two-dimensional data). The resulting classification differs very little from the first HCA classification, suggesting that employing PCA has not improved the clustering results here. However, other data sets may benefit since using lower dimensional data (defined by PCA) may improve the clustering results by reducing the redundancy in the data. The use of variables that have specific relationships can cause undesirable redundancies in cluster analysis. For instance, TDS (related to total ions present and also specific conductance), alkalinity (related to bicarbonate), and hardness (related to calcium and magnesium) were not used in our cluster analyses (HCA and KMC) since they are directly related.

Fuzzy k-means Clustering

Geological and hydrochemical systems are sometimes too complex to analyze easily using conventional graphical or statistical methods. Often the chemical and physical properties of the natural system vary continuously, rather than abruptly. In other words, these underlying physical and chemical processes do not always produce discrete outcomes. Because of this continuity, statistical clusters may not be well separated and instead may form a sequence of overlapping clusters. Therefore, methods related to "Fuzzy logic" may be useful for modeling and classification purposes.

Application of Fuzzy logic in earth sciences is still in its early stages. On this topic, there are only a small number of papers published in the areas of geophysics, geology, petroleum and geotechnical engineering. For example, McBratney and Moore (1985) applied fuzzy sets to climatic classification and later McBratney and deGruijter (1992) and Odeh et al. (1992) used the Fuzzy k-means approach for classification of soils. Nordlund (1996) applied a rule-based Fuzzy logic to model deposition and erosion processes.

Traditional Aristotelian logic (binary logic) imposes sharp boundaries (Sibigtroth 1998), however, Fuzzy logic has no sharp boundaries (Fang and Chen 1990). Fuzzy logic is basically a multi-valued logic that allows intermediate values to be defined between conventional evaluations like yes/no, 0/1, true/false, black/white, and so on (Zadeh 1965; McNeill and Freiberger 1993). Fuzzy logic also allows for formalization of qualitative statements, which are widely used in earth sciences. Both fuzziness and probability describe uncertainty numerically; however, probability treats yes/no occurrences and is inherently a statistical method. Fuzziness deals with degrees and is a non-statistical method (Zadeh 1965).

One approach to fuzzy classification, and probably the best and most commonly used, is Fuzzy c-means (Bezdek 1981), later renamed to Fuzzy k-means (FKM) by deGruijter and McBratney (1988). This method minimizes the within-class sum of square errors. In this technique, samples may not be a 100% member of a group, instead the membership of samples are graded (partitioned) between groups. For example, a water sample may be mostly a member of a certain group, but it may be also a partial member of other groups. The analysis produces membership grades for each sample between 0 and 1. The higher the membership value for a group, the more closely the sample resembles the other members of this group. The FKM method does not impose any limitations on the number of samples or objects that can be clustered in one batch. Some clustering programs limit the amount of samples that can be clustered in one batch (e.g. MVSP (Kovach 1990), 100 samples and StatisticaÒ (Statsoft, Inc. 1995), 300 samples). Others use a two-step approach (pre-clustering and clustering) to cluster samples (SAS Institute Inc. 1988). In this respect, FKM may provide a better tool for clustering a larger data set (e.g. combination of spring, surface and well water data) with overlapping or continuous clusters.

We employed the program FuzME® (Minasny and McBratney 1999), which uses Brent's algorithm (Press et al. 1992), when searching for an optimal value (deGruijter and McBratney 1988). In this method, a parameter called "fuzziness exponent" (f) is selected before application of the method. It determines the degree of fuzziness of the final solution, which is the degree of overlap between groups. With the minimum meaningful value of f=1, the solution is a hard partition, that is, the result obtained is not fuzzy at all. As f approaches infinity (¥) the solution approaches its highest degree of fuzziness (Bezdek 1981). For most data, 1.5 £ f £ 3.0 produces satisfactory results (Bezdek et al. 1984). The Fuzzy k-means algorithm is applied as follows (Minasny and McBratney 1999):

Choose the number of classes K (which is equivalent to HCA subgroups), with 1 < K < n,

Choose a value for the fuzziness exponent f, with f > 1,

Choose a definition of distance in the variable-space (Euclidean, Diagonal or Mahalanobis distance),

Choose a value for the stopping criterion e (e.g. e=0.001 gives reasonable convergence),

5. Initialize with random memberships or with memberships from a hard K-means partition (e.g. HCA or KMC).

Odeh et al. (1992) suggested methods for choosing fuzziness exponent and number of classes. For our study, a value of 1.5 was used for the fuzziness exponent (f) and Euclidean distance was chosen as the distance measure. Like the KMC, the selection of the optimal number of groups was based on the results of the HCA technique (9 subgroups).

In the FKM method, the results are strongly influenced by those variables having large variances. Therefore, log-transformed and standardized data matrix was used as input data for the FKM analysis. The FKM analysis reduced the original 118 by 11 data matrix to a 9 by 11 matrix of class centers. Table 5 shows the membership matrix for the first 20 samples (the complete table is too large to present).


A direct comparison of the results of the three types of cluster analysis (HCA, KMC, and FKM) is difficult since only the HCA technique produces a graphical output. Therefore, we plotted the HCA-, KMC- and FKM-defined means for each subgroup on a Piper diagram. Figure 9 shows that the HCA-, KMC- and FKM-defined means for each group overlap for most of the subgroups, showing the similar results obtained for all three methods. For instance, the FKM analysis placed 97% of the samples within the same three major groups defined by HCA method, while 79% of the samples are placed exactly into same subgroups. However, in both the KMC and FKM analysis we had pre-selected the number of groups (in our case that number was based on the nine defined by the HCA results). The similarity of the results for all three techniques suggests that the pre-selection of the number of groups strongly influences the outcome. This is a serious limitation that means the investigator is required to have performed some type of preliminary analysis when employing the KMC and FKM techniques, which could then bias the results of the statistical analysis.

The efficiency and semi-objective nature of the statistical techniques makes these techniques superior to the graphical methods in order to group samples based on water chemistry data. However, the graphical methods provide valuable information about the chemical nature of the groups. By combining the two techniques we can gain additional information that neither technique by itself can offer.

Figure 10a, 10b, and 10c show the mean values for each of the nine subgroups (defined by HCA) on Collins bar, Pie and Stiff diagrams, respectively. Each graphical technique shows distinct visual differences between the subgroups, while providing information about the chemical composition of each group. In Figure 11, all the samples are plotted on Schoeller semi-logarithmic diagrams for each subgroup. This plot illustrates the difficulty in using purely graphical means to cluster samples. The patterns of subgroups 1 and 2 are distinctive, but subgroup 3 does not appear related. However, while subgroups 4, 5, 6 and 7 show a distinct pattern that differs from the other subgroups, it would be difficult to discriminate between samples belonging to subgroups 4, 5, 6 and 7.

Although previously not utilized in the classification of water samples, we have included an example of icon plots that can be used to represent and visually discern similarities between water samples. The basic idea of icon plots is to represent individual water samples as graphical objects where values of variables are assigned to specific features or dimensions of the objects. The assignment is such that the overall appearance of the objects changes as a function of the configuration of values. Thus, the objects are given visual "identities" that are unique for configurations of values. One of the most elaborate type of icon plot is Chernoff faces (Chernoff 1973), which can be used to plot up to 20 parameters for one water sample. Chernoff faces were plotted for the subgroup means from the cluster analysis (Figure 12). This technique also provides an effective visualization of a small number of water samples having different characteristics. The different physical and chemical characteristics of the samples are shown by the changes in facial features. Parameter values are represented in schematic humanlike faces such that the values for each variable are represented by the variations of specific facial features (StatSoft, Inc. 1995). Examining such plots may help to discover specific clusters of both simple relations and interactions between variables.


Each technique that was discussed in this paper had advantages and disadvantages in clustering and displaying water samples using typical chemical and physical parameters. The graphical techniques can provide valuable and rapidly accessible information about the chemical composition of water samples such as the relative proportion of the major ions, however, these techniques have some serious limitations when used alone. All the graphical techniques use only a portion of the available data. Minor constituents (0.01-10 mg L-1) (e.g. boron, fluoride, iron, nitrate, strontium) and trace constituents (<0.1 mg L-1) (e.g. aluminum, arsenic, barium, bromide, chromium, lead, lithium) are not used. From a water quality standpoint, the presence of one of these minor or trace elements may be important because small amounts can pose threats to human health. Some of these minor and trace constituents behave more conservatively in the groundwater, thus can be used more efficiently to classify waters (Farnham et al. 2000). Some graphical techniques can display only one sample (or a mean) per diagram (e.g. Collins bar, Pie, Stiff), while others can display multiple samples (e.g. Piper, Schoeller). Neither type is particularly useful to produce distinct grouping of samples because there is no objective means to discriminate the groups or to test the degree of similarity between samples in a group. Collins bar, Pie, and Stiff diagrams are probably the best to help distinguish between small numbers of samples that have distinct chemical differences. For a large number of samples these diagrams are unwieldy. In this study, neither the Piper nor Schoeller diagrams could group all the similar water samples (based on statistical measures) from our data set.

Unlike the graphical classification techniques, multivariate statistical techniques can use any combination of chemical and physical parameters (e.g. temperature) to classify water samples. The HCA technique was judged more efficient than the KMC and FKM techniques since it offers a semi-objective graphical clustering procedure (dendogram), which does not require pre-determining the final number of clusters. However, none of the statistical techniques offered easily accessible information about the chemical composition of the samples in the clusters (groups). That is, these methods are very efficient at grouping water sample by physical and chemical similarities, but the results are not immediately useful for identifying trends or processes relevant to hydrogeochemical problems.

Combining the two approaches appears to offer a methodology that retains the advantages while minimizing the limitations of either approach. Using the HCA analysis to initially cluster the samples (into groups and subgroups) provides an efficient means to recognize groups of samples that have similar chemical and physical characteristics. The technique also allows discrimination of samples that have extreme values for closer evaluation. These statistical groups have distinct spatial patterns in the study area, providing the spatial discrimination desired when determining hydrochemical facies. The mean for each of the required chemical parameters are then plotted on diagrams, e.g. a Piper diagram, offering easily accessible information on the chemical differences between the groups and potential information about the physical and chemical processes in the watershed. The use of the hierarchical cluster analysis (HCA) in conjunction with a multi-sample graphical technique such as the Piper plot offers a robust methodology with consistent and objective criteria to efficiently classify large numbers of water samples based on common chemical and physical parameters.