Texture classification

T Ojala and M Pietikäinen
Machine Vision and Media Processing Unit
University of Oulu
Finland
http://www.ee.oulu.fi/research/imag/texture

Classification refers to as assigning a physical object or incident into one of a set of predefined categories. In texture classification the goal is to assign an unknown sample image to one of a set of known texture classes. Texture classification is one of the four problem domains in the field of texture analysis. The other three are texture segmentation (partitioning of an image into regions which have homogeneous properties with respect to texture; supervised texture segmentation with a priori knowledge of textures to be separated simplifies to texture classification), texture synthesis (the goal is to build a model of image texture, which can then be used for generating the texture) and shape from texture (a 2D image is considered to be a projection of a 3D scene and apparent texture distortions in the 2D image are used to estimate surface orientations in the 3D scene).

Texture analysis is important in many applications of computer image analysis for classification or segmentation of images based on local spatial variations of intensity or color. A successful classification or segmentation requires an efficient description of image texture. Important applications include industrial and biomedical surface inspection, for example for defects and disease, ground classification and segmentation of satellite or aerial imagery, segmentation of textured regions in document analysis, and content-based access to image databases. However, despite many potential areas of application for texture analysis in industry there is only a limited number of successful examples. A major problem is that textures in the real world are often not uniform, due to changes in orientation, scale or other visual appearance. In addition, the degree of computational complexity of many of the proposed texture measures is very high.

Texture classification process involves two phases: the learning phase and the recognition phase. In the learning phase, the target is to build a model for the texture content of each texture class present in the training data, which generally comprises of images with known class labels. The texture content of the training images is captured with the chosen texture analysis method, which yields a set of textural features for each image. These features, which can be scalar numbers or discrete histograms or empirical distributions, characterize given textural properties of the images, such as spatial structure, contrast, roughness, orientation, etc. In the recognition phase the texture content of the unknown sample is first described with the same texture analysis method. Then the textural features of the sample are compared to those of the training images with a classification algorithm, and the sample is assigned to the category with the best match. Optionally, if the best match is not sufficiently good according to some predefined criteria, the unknown sample can be rejected instead.

A wide variety of techniques for describing image texture have been proposed. Tuceryan and Jain (1993) divided texture analysis methods into four categories: statistical, geometrical, model-based and signal processing. Due to the extensive research on texture analysis over the past 30 years it is impossible to list all published methods, but the following chapters provide short introduction to each of the four categories, together with some key references. For surveys on texture analysis methods see Haralick (1979), Van Gool et al. (1985), Haralick and Shapiro (1992), Chellappa et al. (1993), Reed and Du Buf (1993), and Tuceryan and Jain (1993).

Statistical methods analyze the spatial distribution of gray values, by computing local features at each point in the image, and deriving a set of statistics from the distributions of the local features. Depending on the number of pixels defining the local feature statistical methods can be further classified into first-order (one pixel), second-order (two pixels) and higher-order (three or more pixels) statistics. The basic difference is that first-order statistics estimate properties (e.g. average and variance) of individual pixel values, ignoring the spatial interaction between image pixels, whereas second- and higher-order statistics estimate properties of two or more pixel values occurring at specific locations relative to each other. The most widely used statistical methods are cooccurrence features (Haralick et al. 1973) and gray level differences (Weszka et al. 1976), which have inspired a variety of modifications later on. These include signed differences (Ojala et al. 2001) and the LBP (Local Binary Pattern) operator (Ojala et al. 1996), which incorporate occurrence statistics of simple local microstructures, thus combining statistical and structural approaches to texture analysis. Other statistical approaches include autocorrelation function, which has been used for analyzing the regularity and coarseness of texture (Kaizer 1955), and gray level run lengths (Galloway 1975), but their performance has been found to be relatively poor (Conners & Harlow 1980).

Geometrical methods consider texture to be composed of texture primitives, attempting to describe the primitives and the rules governing their spatial organization. The primitives may be extracted by edge detection with a Laplacian-of-Gaussian or difference-of-Gaussian filter (Marr 1982, Voorhees & Poggio 1987, Tuceryan & Jain 1990), by adaptive region extraction (Tomita & Tsuji 1990), or by mathematical morphology (Matheron 1967, Serra 1982). Once the primitives have been identified, the analysis is completed either by computing statistics of the primitives (e.g. intensity, area, elongation, and orientation) or by deciphering the placement rule of the elements (Zucker 1976, Fu 1982). The structure and organization of the primitives can also be presented using Voronoi tessellations (Ahuja 1982, Tuceryan & Jain 1990). Image edges are an often used primitive element. Davis et al. (1979) and Davis et al. (1981) defined generalized cooccurrence matrices, which describe second-order statistics of edges. Dyer et al. (1980) extended the approach by including the gray levels of the pixels near the edges into the analysis. An alternative to generalized cooccurrence matrices is to look for pairs of edge pixels, which fulfill certain conditions regarding edge magnitude and direction. Hong et al. (1980) assumed that edge pixels form a closed contour, and primitives were extracted by searching for edge pixels with opposite directions (i.e. they are assumed to be on the opposite sides of the primitive), followed with a region growing operation. Properties of the primitives (e.g. area and average intensity) were used as texture features. Pietikäinen and Rosenfeld (1982) did not require edges to form closed contours, but statistics computed for pairs of edge pixels, which had opposite directions and were within a predetermined distance of each other, were used as texture features (e.g. distance between the edge pixels, average gray level on the line between the edge pixels).

Model-based methods hypothesize the underlying texture process, constructing a parametric generative model, which could have created the observed intensity distribution. The intensity function is considered to be a combination of a function representing the known structural information on the image surface and an additive random noise sequence. For detailed discussions of image models see Ahuja and Schachter (1983), Kashyap (1986), and Chellappa et al. (1993). Pixel-based models view an image as a collection of pixels, whereas region-based models regard an image as a set of subpatterns placed according to given rules. An example of region-based models are random mosaic models, which tessellate the image into regions and assign gray levels to the regions according to a specified probability density function (Schachter et al. 1978). The facet model is a pixel-based model, which assumes no spatial interaction between neighboring pixels, and the observed intensity function is assumed to be the sum of a deterministic polynomial and additive noise (Haralick & Watson 1981). Stochastic spatial interaction models treat the intensity process as a stochastic process. The observed intensity function is regarded as the output of a transfer function whose input is a sequence of independent random variables, i.e. the observed intensity is a linear combination of intensities in a specific neighborhood plus an additive noise term. Various types of models can be obtained with different neighborhood systems and noise sources. One-dimensional time-series models, autoregressive (AR), moving-average (MA), and autoregressive-moving-average (ARMA), model statistical relationships of intensities along a raster scan, assuming an independent noise source (McCormick & Jayaramamurthy 1974, Box & Jenkins 1976). Random field models analyze spatial variations in two dimensions. Global random field models treat the entire image as a realization of a random field (Frieden 1980, Hunt 1980), whereas local random field models assume relationships of intensities in small neighborhoods. A Gibbs random field is a global model, where using cliques of neighboring pixels as the neighborhood system a probability density function is assigned to the entire image (Besag 1974, Hassner & Sklansky 1980, Geman & Geman 1984). A widely used class of local random field models type are Markov random field models, where the conditional probability of the intensity of a given pixel depends only on the intensities of the pixels in its neighborhood (so-called Markov neighbors) (Dobrushin 1968, Woods 1972). In a Gaussian Markov random field model the intensity of a pixel is a linear combination of the values in its neighborhood plus a correlated noise term. A relatively new random field model is the so-called Wold decomposition, where texture field is decomposed into a sum of mutually orthogonal components: a purely indeterministic component and a deterministic component, which is further orthogonally decomposed into a harmonic component, and an evanescent component (Francos 1990, Francos et al. 1993). Describing texture with the random field models is an optimization problem: the chosen model is fitted to the image, and an estimation algorithm is used to set the parameters of the model to yield the best fit. The obtained parameter values are then used in further processing, e.g. for segmenting the image. Mandelbrot (1983) proposed describing images with fractals, a set of self-similar functions characterized by so-called fractal dimension, which is correlated to the perceived roughness of image texture (Pentland 1984). In contrast to autoregressive and Markov models fractals have high power in low frequencies, which enables them to model processes with long periodicities. An interesting property of this model is that fractal dimension is scale invariant. Several methods have been proposed for estimating the fractal dimension of an image (Keller et al. 1989, Rao 1990).

Signal processing methods analyze the frequency content of the image. Spatial domain filters such as Laws (1980) masks, local linear transforms proposed by Unser and Eden (1989), and various masks designed for edge detection (e.g. Roberts' and Sobel's operators (Rosenfeld & Kak (1976)) are the most direct approach for capturing frequency information. Rosenfeld and Thurston (1970) introduced the concept of edge density per unit area: fine textures tend to have a higher density of edges than coarse textures. Another class of spatial filters are moments (Laws 1980), which correspond to filtering the image with a set of spatial masks. The resulting images are then used as texture features. Tuceryan (1992) used moment-based features successfully in texture segmentation. 'True' frequency analysis is done in the Fourier domain. The Fourier transform describes the global frequency content of an image, without any reference to localization in the spatial domain, which results in poor performance. Spatial dependency is incorporated into the presentation with a window function, resulting in a so-called short-time Fourier transform. The squared magnitude of the two-dimensional version of the short-time Fourier transform is called spectrogram, which Bajcsy and Lieberman (1976) used in analysing shape from texture. Multiresolution analysis, the so-called wavelet transform, is achieved by using a window function, whose width changes as the frequency changes (Mallat 1989). If the window function is Gaussian, the obtained transform is called the Gabor transform (Turner 1986, Clark & Bovik 1987). A two-dimensional Gabor filter is sensitive to a particular frequency and orientation. Other spatial/spatial-frequency methods include the difference-of-Gaussians (Marr 1982) and the pseudo-Wigner distribution (Jacobson & Wechsler 1982). Texture description with these methods is done by filtering the image with a bank of filters, each filter having a specific frequency (and orientation). Texture features are then extracted from the filtered images. Often many scales and orientations are needed, which results in texture features with very large dimensions. Dimensionality can be reduced by considering only those bands, which have high energy (Reed & Wechsler 1988, Jain & Farrokhnia 1991). Alternatively, redundancy can be reduced by optimizing filter design so that the frequency space is covered in a desired manner (Manjunath & Ma 1996). For a detailed discussion on spatial/spatial-frequency methods see Reed and Wechsler (1990) and Reed and Wechsler (1991).

When choosing a texture analysis algorithm, a number of aspects should be considered (please note that the need for the given property is dictated by the problem in hand):

illumination (gray scale) invariance; how sensitive the algorithm is to changes in gray scale. This is particularly important for example in industrial machine vision, where lighting conditions may be unstable.
spatial scale invariance; can the algorithm cope, if the spatial scale of unknown samples to be classified is different from that of training data.
rotation invariance; does the algorithm cope, if the rotation of the images changes with respect to the viewpoint.
projection invariance (3-D texture analysis); in addition to invariance wrt. spatial scale and rotation the algorithm may have to cope with changes in tilt and slant angles.
robustness wrt. noise; how well the algorithm tolerates noise in the input images.
robustness wrt. parameters; the algorithm may have several built-in parameters; is it difficult to find the right values for them, and does a given set of values apply for a large range of textures.
computational complexity; many algorithms are so computationally intensive that they can not be considered for applications with high throughput requirements, e.g. real-time visual inspection and retrieval of large databases
generativity; does the algorithm facilitate texture synthesis, i.e. regenerating the texture that was captured using the algorithm.
window/sample size; how large sample the algorithm requires to be able to produce a useful description of the texture content.

There exists a number of classification algorithms. Among the most widely used are parametric statistical classifiers derived from the Bayesian decision theory, nonparametric k-nearest neighbor classifier, and various neural networks such as multilayer perceptrons. See Fukunaga (1993) for an introduction to statistical pattern recognition methods and Pao (1993) for an introduction to neural network methods. For texture classification surveys see Conners and Harlow (1980), Du Buf et al. (1990), Ohanian and Dubes (1992), Ojala et al. (1996), and Weszka et al. (1976).

Given a texture description method, the performance of the method is often demonstrated using a texture classification experiment, which typically comprises of following steps (please note that not all steps may always be needed and the order of the steps may vary):

selection of image data: the image data and textures may be artificial or natural, possibly obtained in a real world application. So-called Brodatz (1966) textures are probably the most widely used image data in texture analysis literature. Other well known data sets are VisTex and MeasTex textures. An important part of the selection of image data is the availability and quality of the ground truth associated with the images: do we really know that each image indeed represents the texture category it is supposed to represent according to the ground truth?
partitioning of the image data into subimages: image data are often limited in terms of the number of original source images available, hence in order to increase the amount of data the images are divided into subimages, either overlapped or disjoint, of a particular window size.
preprocessing of the (sub)images: the (sub)images may have different gray scale properties. In texture analysis the goal is to discriminate (sub)images based on texture, not on first or second order gray scale properties. Therefore (sub)images are often preprocessed to have uniform gray scale distribution, or equal first and second order statistics, by histogram equalization, for example.
partitioning of the (sub)images data into training and testing sets. In order to obtain an unbiased estimate of the performance of the texture classification procedure, training and testing sets should be independent. Different approaches can be used, including N-fold (the collection of (sub)images is divided into N disjoint sets, of which N-1 serve as training data in turn and the Nth set is used for testing), leave-one-out (each (sub)image is classified one by one so that other (sub)images serve as the training data) and holdout (the data is, preferably randomly, divided into separate training and testing sets, this can be repeated for a number of iterations for a more reliable estimate of performance).
selection of the classification algorithm. In addition to classification algorithm this may involve other selections such as metrics or (dis)similarity measures. Selection of classification algorithm can have great impact in the final performance of the texture classification procedure - no classifier can survive with poor features, but good features can be wasted with poor classifier design.
definition of the performance criterion: two basic alternatives are available, analysis of feature values and class assignments, of which the latter is used much more often. In the former the similarity of feature values between training and testing sets, or the separation of class clusters provided by the feature values, provides the basis for the quantitative performance analysis. In the case of class assignments the items in the testing set are classified, and the proportion of correctly (classification accuracy) or erroneously (classification error) classified items is used as performance criterion.

It is obvious that the final outcome of a texture classification experiment depends on numerous factors, both in terms of the possible built-in parameters in the texture description algorithm and the various choices in the experimental setup. Results of texture classification experiments have always been suspect to dependence on individual choices in image acquisition, preprocessing, sampling etc., since no performance characterization has been established in the texture analysis literature. Haralick (1994) criticized this questionable status quo from the perspective of computer vision, which applies to texture analysis as well: "This is an awful state of affairs for the engineers whose job is to design and build image analysis or machine vision systems." Therefore, all experimental results should be considered to be applicable only to the reported setup. Fortunately, there is some recent work aimed at improving the situation with standardized test benches, for example the MeasTex framework for benchmarking texture classification algorithms (Smith & Burns 1997). Additionally, an increasing number of researchers are making the imagery and algorithms used in their work publicly available in the web.