Unsupervised Neural Networks
Self organizing maps are generally used in many fields like bio-informatics, neural networksÂ¦.It is basically used for the grouping of data. In decision making we reach a conclusion by making certain tests to the input given to the system. We will study the outputs obtained by the tests and reach a solution. Consider a decision making system which checks whether a cloth is wet or not. If we give the parameters of cloth we need to know whether it is wet or not. Simply we can say if moisture is there, the cloth is wet. But that is not the case always. The wet nature can vary. We cannot take specific boundary to distinguish between the wet cloth dry one.
In SOM we are making some sort of decision-making. Here we will input one data and we will take a set of random points in the space where the data is put. Usually that will be two-dimensional plane. All the points will be having own weight vectors. Those points are called the neurons. Then another algorithm called winner-take over used. Here the distance between the input and each of the neurons are calculated and one is selected according to the minimum distance. Then the input is moved towards the neuron along with its neighborhood. Neighborhood is the set of neurons with in a certain distance of the input. Again the process is continued till the similarity reaches a certain limit, that allows the input to accept the weight of one neuron as its own. This automatic mapping of input to one of the neuron is called the SOM. This technology is used to find the weight of unknown thing.
These notes provide an introduction to unsupervised neural networks, in particular Kohonen self-organizing maps; together with some fundamental background material on statistical pattern recognition.
One question which seems to puzzle many of those who encounter unsupervised learning for the first time is how can anything useful be achieved when input information is simply poured into a black box with no provision of any rules as to how this information should be stored, or examples of the various groups into which this information can be placed. If the information is sorted on the basis of how similar one input is with another, then we will have accomplished an important step in condensing the available information by developing a more compact representation. We can represent this information, and any subsequent information, in a much reduced fashion. We will know which information is more likely. This black box will certainly have learned. It may permit us to perceive some order in what otherwise was a mass of unrelated information to see the wood for the trees.
In any learning system, we need to make full use of the all the available data and to impose any constrains that we feel are justified. If we know that what groups the information must fall into, that certain combinations of inputs preclude others, or that certain rules underlie the production of the information then we must use them. Often, we do not possess such additional information. Consider two examples of experiments. One designed to test a particular hypothesis, say, to determine the effects of alcohol on driving; the second to investigate any possible connection between car accidents and the driverâ„¢s lifestyle. In the first experiment, we could arrange a laboratory-based experiment where volunteers took measured amounts of alcohol and then attempted some motor-skill activity (e.g., following a moving light on a computer screen by moving the mouse). We could collect the data (i.e., amount of alcohol vs. error rate on the computer test), conduct the customary statistical test and, finally, draw our conclusions. Our hypothesis may that the more alcohol consumed the greater the error rate we can confirm this on the basis of this experiment. Note, that we cannot prove the relationship only state that we are 99% certain (or whatever level we set ourselves) that the result is not due purely to chance.
The second experiment is much more open-ended (indeed, it could be argued that it is not really an experiment).Data is collected from a large number of drives those that have been involved in accidents and those that have not. This data could include the driver's age, occupation, health details, drinking habits, etc. From this mass of information, we can attempt to discover any possible connections. A number of conventional statistical tools exist to support this (e.g., factor analysis). We may discover possible relationships including one between accidents and drinking but perhaps many others as well. There could be a number of leads that need following up. Both approaches are valid in searching for causes underlying road accidents. This second experiment can be considered as an example of unsupervised learning.
The next section provides some introductory background material on statistical pattern recognition. The terms and concepts will be useful in understanding the later material on unsupervised neural networks. As the approach underlying unsupervised networks is the measurement of how similar (or different) various inputs are, we need to consider how the distances between these inputs are measured. This forms the basis Section Three, together with a brief description of non-neural approaches to unsupervised learning. Section Four discusses the background to and basic algorithm of Kohonen self-organizing maps. The next section details some of the properties of these maps and introduces several useful practical points. The final section provides pointers to further information on unsupervised neural networks.
2. Statistical Pattern Recognition
2.1 Elements of Pattern Recognition
The goal of pattern is to classify objects into one of a number of categories or classes. The objects of interest are generically called patterns and may be images, speech signals or entries in a database. It is these patterns, which we can loosely define as the natural structure, that gives meaning to events in the external world. Patterns are manifestations of rules, and through observing patterns we are able to propose rules and so form an Understanding of the underlying processes.
Supervised learning is where its correct class label accompanies each training input pattern. The difference, or error, between the recognition systemâ„¢s current response and the desired one is used to modify the system so as to reduce this error.
2.2 Pattern space and vector
Patterns are sets of measurements. For e.g., we could take measurements of height and weight for members of different rowing crews and produce a graph as shown in figure 2.2.1(a). This graph represents our patterns space that is the two dimensional space within which must lay all possible rowing crew members. This space is also called the measurement of observation space (as it is the space in which measurements or observations are made) There are two groupings of our experimental data or measurements, which we can (as an external teacher with our additional wider knowledge of rowing) identity as a group of rowers and a group of coxes. Given just the information of height and weight about an individual (i.e. the points A or B), we need to make a decision whether this individual is a Cox or a rower that is; we wish to classify the information into one of the two classes.
(a) Two- dimensional pattern space for rowing crews.
(b) Three-dimensional pattern space by augmenting with the additional measurement of age.
One approach would be to construct a line, based on the original data on which we trained the system (the training data) that separated the two groups. This is termed a decision line as data points lying to one side of the line are determined to be, say, rowers, and to the other side to be coxes. So individual ËœAâ„¢, represented by the data point A, is assumed to be a rower. While individual ËœBâ„¢ is assumed to be a Cox. An alternative approach is to consider each grouping as a cluster, and measure the distance from the points A and B to each of the cluster centers. The shortest distance will determine which class the points are a member of. If we take another independent measurement, for example Ëœageâ„¢, then the pattern space becomes three dimensional (fig. 2.2.1(b)), the clusters become spheres and the decision surface become a plane.
(An idealized Optical Character Reader)
We may take many more measurements than three. For example, the idealized optical character (OCR) illustrated in figure 2.2.2, possesses an array of 7*5 photodiodes, which yield 35 independent output voltages or measurements. This set of measurements is the measurement vector, X.
An idealised Optical Character Reader. The simple camera produces, through the 5*7 array of photodiodes, a set of 35 voltages (which are zero volts when no light falls on a diode to one volt for the maximum intensity of light falls on it). This set is the measurement vector, X. The individual elements of X are random variable that is scalar quantities lying between 0 and 1. The measurement vectors for, say, all ËœBâ„¢s will be similar, so each ËœBâ„¢ will produce a point in the 35 dimensional pattern space that is close to all other points produce by letter ËœBâ„¢s. Similar cluster will be formed for the other letters.
Where N is the number of measurements (i.e., the dimensionality of the patterns space). It is often more compact to use the transpose of this vector, i.e.
xt = [ x1 x2 Â¦Â¦Â¦xk Â¦Â¦Â¦ xN ]
Each element of this vector, Xk, is a random variable this means that we cannot predict the value of Xk before we take the measurement. If a photodiode voltage lies between 0 and 1, then Xk will lie somewhere in this range.
Delineation of pattern classes in pattern space.
The pattern space, in this example, possesses 35 dimensions &each dimension is orthogonal to all the others. Clearly, we now refer to hyperspace, hyper spheres and hyper surfaces .All the normal rules of geometry apply to these high dimensional spaces (we just cannot visualize them!).We may hope that there are 26 separate clusters each representing a single letter of the alphabet. In this situation, it should be relatively easy to construct hyper surfaces that separate each cluster from the remainder. This may not be so.
Consider the two-dimensional pattern space of Figure2.2.4, in which there are three pattern classes. The first case (a) is relatively simple as only two decision surfaces are needed to delineate the pattern space into the three class regions.
The second case (b) looks more complicated as many more decision surfaces are required to form uniquely identifiable individual case regions. However, if we transformed our two measurements in some way, we could form the two thicker curved decision surfaces. The last case © is different. Here, radically different measurements (patterns) are associated to the same class. There is no way to transform our two measurements to recover the situation where only two decision surfaces are sufficient. We are now associating patterns, which bear no similarity to each other, to the same pattern class. This is an example for hard learning for which we must use supervised learning.
2.3 Features and decision spaces
As the previous section noted, it can sometimes be beneficial to transform the measurements that we make as this can greatly simplify the task of classification. It can also be unwise to use the raw data in the case of the OCR system, the 35 individual photodiode voltages. Working in such high dimensional pattern space is not wasteful in terms of computing resources but can easily leads to difficulties in accurate pattern classification. Consider the situation illustrated in Figure2.3.1, where an OCR system has inadvertently been taught letter ËœOâ„¢s that are always circular and letter ËœQâ„¢s that are always elliptical. After training, the system is exposed to elliptical ËœOâ„¢s and circular ËœQâ„¢s. As more picture elements (pixels) match for the main body of the letters, rather than for the small cross-line, the system will make the incorrect classification. However, we know that the small cross-line is the one feature that distinguishes ËœQâ„¢s from ËœOâ„¢s.
Problems of using raw measurement data are illustrated here for a pattern recognition system trained on the unprocessed image data (image pixel intensity values). The system subsequently makes incorrect decisions because the measurement vectors are dominated by the round bodies of the letter and not by crucial, presence or absence of, the small cross-line.
A more reliable classifier could be made if we emphasized the significance of the cross-line by ensuring that it was an essential feature in differentiating between ËœOâ„¢s and ËœQâ„¢s.
Possible feature table for the OCR system.
Pattern class / line \ line | line --line Partial circle Closed circle End-points
A 1 1 0 1 0 0 2
B 0 0 1 3 2 0 0
O 0 0 0 0 0 1 0
P 0 0 1 2 1 0 1
Q 0 1 0 0 0 1 2
Z 1 0 0 2 0 0 2
Functional block diagram of a generalized pattern recognition system
3. Unsupervised Pattern Classification
We need to measure the similarity of the various input pattern vectors so that we can determine which cluster or pattern class each vector should be associated with. This chapter discusses a number of the common metrics employed to measure this similarity. It also mentions briefly non-neural methods for unsupervised pattern classification.
3.2 Distance metrics
To record the similarity, or difference, between vectors we need to measure the distance between these vectors. In conventional Euclidean space, we can use Pythagorasâ„¢s theorem (Figure 4.2.1(a)). Normally, the distance between the two-dimensional vectors x and y is given by
d(x, y) = | x - y | = [ (x1- y1)2 + (x2 â€œ y2)2] 1/2
This can be extended to N dimensions, to yield the general expression
for Euclidean distance,
We can use many different measures which all define different kinds of metric space that is a space where distance has meaning. For a space to be metric, it must satisfy the following three conditions:
* If the vectors x and y are different, the distance between them must be positive. If the vectors are identical, then the distance must be zero.
* d(x, y) =d(y, x) the distance between x to y is the same as the distance between y to x.
* d(x, y) +d(y, z)>= d(x, z) The distance between x and z must be equal to or greater than the distance between x to y and the distance between y to z. This is the triangle inequality (figure4.2.1 (b))
(a) The Euclidean distance between vectors x and y.
(b) The triangle inequality d(x,y)+d(y,z)>=d(x,z).
3.3 Dimensionality Reduction
One important function of statistical pattern recognition is dimensionality reduction. The set of measurements (i.e., pattern space) originally taken may be too large and not the most appropriate. What we require is means of reducing the number of dimensions but at the same minimizing any error resulting from discarding measurements. If the dimensionality of the pattern space is Ëœpâ„¢, then we cannot simply keep Ëœmâ„¢ of these (where m<p).We require some kind of transformation that ranks its dimensions in terms of the variance of the data. Why should this be so
The vectors v1 and v2 are most approximate for representing the data clusters 1 and 2 respectively, as these are the directions which account for the greatest variance in each of the clusters. But are they the best for discriminating the two clusters from each other
Consider the scatter of data points in figure 4.3.1. The vector along the direction of greatest scatter (i.e., variance) captures the most important aspects of a particular data class. There are several important statistical methods based on transforming the measurement space based a different set of axes which may be more appropriate indiscriminating the various data clusters (hopefully in fewer dimensions). The general approach is called multivariant analysis with principle component analysis being the oldest and most common. There are many pitfalls for the novice in applying such techniques. The alternative approach, based on clustering methods, is perhaps easier to comprehend. It should be noted that there is no universal technique that is guaranteed to yield the best solution for all cases. So much in statistical analysis (and neural computing) is data dependent as illustrated in figure 4.3.2.
(a) Classes best separated using transform methods (e.g., principal component analysis).
(b)Classes best separated using clustering methods.
4. Unsupervised Neural Networks
At the start of Section One, we mentioned supervised learning where the desired output response of a network is determined by a set of targets. The general form of the relationship or mapping between the input and output domains are established by the training data. We say that an external teacher is needed to specify these input/output pairings. Networks that are trained without such a teacher learn by evaluating the similarity between the input patterns presented to the network. For example, if the scalar product is used as a measure of the similarity between input vectors and network's weight vectors, then an unsupervised network could adapt its weights to become more like the frequently occurring input vectors. So such networks can be used to perform cluster analysis. They make use of the statistical properties of the input data as frequently occurring patterns will have a greater influence than infrequent ones. The final trained response of the network must resemble in some way the underlying probability density function of the data.
Unsupervised learning makes use of the redundancy present in the input data in order to produce a more compact representation. There is a sense of discovery with unsupervised learning. An unsupervised network can discover the existence of unlabelled data clusters, but it cannot give them meaningful names to these clusters nor can it associate different clusters as representatives of the same class (Figure 5.1.1)
(a) Cluster (class) discrimination in an unsupervised system.
(b) Class (cluster) discrimination in supervised system
4.2 Winner-take-all Network
The basic form of learning employed in unsupervised networks is called competitive learning, and as the name suggests the neurons compete among each other to be the one that fires. All the neurons in the network are identical except that they initialized to have randomly distributed weights. As only one neuron will fire, it can be declared the winner. A simple winner-take-all network is shown in Figure 5.2.1
This network of p neurons will classify the input data x, into one of p clusters. The output of the network, y= [y1 y2 Â¦Â¦Â¦Â¦ yp], is given by
Y= W x
where W is the weight matrix
and the individual weight vector of the iâ„¢th neuron is given by
The first task is to discover the winning neuron that is the neuron with a weight vector most like the current input vector (Remember that the weight vectors were all originally set to random values). This is usually measured in terms of the Euclidean distance. Hence the winning neuron ,m, is the neuron for which | x-w^i | is a minimum. Instead of measuring each of these Euclidean distances and then identifying the smallest, an equivalent operation is to use the scalar product. That is the winning neuron is the one for which is a maximum.
This simple network can perform single-layer clustering analysis the clusters must be linearly separable by hyper planes passing through origin of the encompassing hyper sphere. We also need to specify a priori the number of neurons that is the number of identifiable data clusters.
5. Kohonen Self-Organizing Map
The SOM consists of a (usually) one or two-dimensional array of identical neurons. The input vector is broadcast in parallel to all these neurons. For each input vector, the most responsive neuron is located (using a similar method to that of the winner-take-all network). The weights of this neuron and those within a neighborhood around it are adapted as in the winner-take-all network to reduce the distance between its weight vector and the current input vector. The training algorithm will be described in detail before discussing the operation of the SOM.
5.2 SOM Algorithm
Assume an output array (Figure 5.2.1) of two dimensions with
k x k neurons, that the input samples, x, have dimensionality N, and that index n represents the nth presentation of an input sample.
. All weight vectors are set to small random values. The only strict condition is that all weights are different.
. Select a sample input vector x and locate the most responsive (winning) neuron using some distance metric usually the Euclidean distance
That is | x(n)-wi| is a minimum (and j=1,2,Â¦Â¦. M, where M=K.K)
. Adapt all weight vectors, including those of the winning neuron, within the current neighborhood region. Those outside this neighborhood are left unchanged.
Where is the current adaptation constant and is the current neighborhood size centered on the winning neuron.
. Modify, as necessary and, until no further change in the output feature map is observable (or some other termination condition) otherwise go to step two.
Two- dimensional self organizing map.
Further details on parameter selection and map format:
Â¢ Adaptation constant
Â¢ Neighborhood function
Â¢ Adaptation processes
1. Adaptation constant, ( )
For the first few 1,000 or so iterations, should be close to (but less than) unity and then gradually decrease to about 0.1 or less. The first phase of the learning is sometimes referred to as the ordering phase. During this phase, the neighborhood is relatively large and the weight vectors are changed by large amounts. The output space undergoes global topological ordering. During the second stage, where is much smaller, the space is merely refined at a local level. This second stage is sometimes referred to as the convergence stage.
2 Neighborhood function, ( )
The neighborhood usually assumes a size of between a third and a half of the full array at the start of training, and falls gradually during training to a size where only one layer of direct neighbors or none at all, lie within the neighborhood. This reduction in can be stepped. In the above description of the SOM algorithm, it was assumed that the neighborhood was square and that all neurons within this region where adapted by the same amount. The neighborhood could be hexagonal (figure 6.2.2), or the effective change in the weight vectors within the neighborhood could be weighted so that neurons close to the centre of the neighborhood are proportionally changed more than those at its boundary. A suitable profile for this neighborhood weighting profile would be Gaussian. There may be some advantages, depending on the application and the quality and quantity of input data, for using a Gaussian profile.
3 Adaptation processes
A typical learning sequence could be as follows:
Phase Neighbourhood radius Adaptation
Ordering Half map dimension (K/2) to 3 in
discrete steps. 0.9 to 0.1 5,000
Convergence 3 to 1 0.1 to 0 50,000
The convergence of the SOM is not critical on the settings of these parameters, though the speed of convergence and the final state of the topological ordering do depend on the choice of the adaptation constant, and the neighborhood size and rate of shrinking. It is important that consecutive input samples are uncorrelated.
Devised by Kohonen in the early 80's, the SOM is now one of the most popular and widely used types of unsupervised artificial neural network.
It is built in a one-or two-dimensional lattice of neurons for capturing the important features contained in an input (data) space of interest. In so doing, it provides a structural representation of the input data by the neurons weight vectors as prototypes. The SOM algorithm is neurobiologically inspired, incorporation all the mechanisms that are basic to self-organization: competition, cooperation, and self-amplification.
The development of self-organizing map as a neural model is motivated by distinct feature of the human brain. What is astonishing about Kohonenâ„¢s SOM algorithm is that it is simple to implement, yet mathematically so difficult to analyze its properties in a general setting.
I express my sincere thanks to Prof. M.N Agnisarman Namboothiri (Head of the Department, Computer Science and Engineering, MESCE), Mr. Zainul Abid (Staff incharge) for their kind co-operation for presenting the seminars.
I also extend my sincere thanks to all other members of the faculty of Computer Science and Engineering Department and my friends for their co-operation and encouragement.