Theoretical aspects of pattern analysis
A simple introduction to principal component and cluster analysis
Van Ooyen, A. (2001). In: Dijkshoorn, L., Towner, K. J., and Struelens, M., eds. New Approaches for the Generation and Analysis of Microbial Typing Data. Amsterdam: Elsevier, pp. 31-45. [Full text: PDF]
Abstract
The purpose of most pattern detection methods is to represent the variation in a data set into a more manageable form by recognizing classes or groups. The data typically consist of a set of objects described by a number of characters. An object could be, for example, a strain of bacteria, and a character could be how well a strain of bacteria grows on a particular C-source, or whether a strain of bacteria contains a particular protein. In microarray data, an object could be a gene, and a character could be the level of expression of that gene under a particular condition.
If the objects were always described by only two or three characters, there would not be much need for pattern detection methods. Just plotting the data in two or three dimensions, respectively, would be sufficient to distinguish groups. (The number of dimensions is the number of axes that are needed in order to plot the data, with one axis for each character.) However, typically, objects are characterised by more than three characters, so that simply plotting the data is not possible. Other ways need to be found for representing the data. There are basically two approaches that have been taken to manage large data sets:
In this chapter, simple examples of both principal component analysis and cluster analysis are given to explain the ideas behind the methods.