In this section we present genetic programming, being the fourth member of the evolutionary algorithm family. Besides the particular representation (using trees as chromosomes) it differs from other EA strands in its application area. While the EAs are typically applied to optimization problems, GP could be rather positioned in machine learning. In terms of nature of this deferent problem types, most other EAs are used for finding some input realizing maximum payoff, whereas GP is used to seek models with maximum fit. Clearly, once maximization is introduced, modelling problems can be seen as special cases of optimization. This, in fact, is the basis of using evolution for such tasks: models are treated as individuals, their fitness being the model quality to be maximized.
The one glance summary of GP is given in Table 1.
2. INTRODUCTORY EXAMPLE
We will consider a credit scoring problem within a bank that lends money and keeps a track on how its customers are paying back their loan. This information about the clients can be used to develop a model describing good, respectively bad customers. Later on, this model can be used to predict customerâ„¢s behavior and thereby assist in evaluating future loan applicants. Technically, the classification model will be developed based on (historical) data holding personal information along with a credibility index (good or bad) of customers. The model will have personal data as input and produce a binary output. For instance, the annual salary, the marriage status, and the number of children can be used as input. In Table 2 a small data set is shown. A possible classification model using these data might be the following:
IF (Number of children = 2) AND (Salary > 80000) THEN good ELSE bad In general, the model will look like this:
IF formula THEN good ELSE bad.
Notice, that formula is the only unknown in this rule, all other elements are fixed. Our goal is thus to find the optimal formula, forming an optimal rule that classifies a maximum number of known clients correctly.
At this point we have formulated our problem as a search problem in the space of possible formulas 1 , where the quality of a formula _ can be defined as the percentage of customers correctly classified by the model IF _ THEN good ELSE bad. In evolutionary terms we have defined the phenotypes (formulas) and the fitness (classification accuracy). In accordance with the typical GP approach we use parse trees as genotypes representing formulas. Figure 1 shows the parse tree of the formula above. This representation differs from the ones used in GAs or ES in two important
Â¢ The chromosmes are non-linear structures, while in GAs and ES they are typically linear vectors of the type _ v element of D1 ,Â¦, _ Dn , Di being the domain of vi .
Â¢ The chromosmes can differ in size, measured by the number of nodes of the tree, while in GAs and ES their size, the chromosme length n, is usually fixed.
This new type of chromosomes necessitates new, suitable variation operators acting on trees. Such crossover and mutation operators will be discussed in the sections 4 and 5. As for selection, notice that it only relies on fitness information and therefore it is independent from the chromosme structure. Hence, any selection scheme known in other EAs, e.g., fitness-proportional with generational replacement, can be simply applied.
* Notice, that we have not defined the syntax, thus the space of possible formulas, exactly. For the present treatment this is not needed and Section 3 will treat this issue in general.
As the introductory example has shown, the general idea in GP is to use parse trees as chromosomes. Such parse trees are to capture expressions in a given formal syntax. Depending on the given problem and the users perception on what the solutions must look like, this can be the syntax of arithmetic expressions, formulas in first-order predicate logic, or code written in a programming language. To illustrate the matter, let us consider one of each of these types of expressions: an
These examples illustrate how generally parse trees can be used and interpreted. Depending on the interpretation GP can be positioned in different ways. From a strict technical point of view, GP is simply an variant of GAs working with a different data structure: the chromosomes are trees. This view disregards the application dependent interpretation issues. Nevertheless, such parse trees are often given an interpretation. In particular, they can be envisioned as executable codes, that is, programs. The syntax of functional programming, e.g, the language LISP, very closely matches the so-called Polish notation of expressions. For instance, the formula in equation 1 can be rewritten in this Polish notation as
Adopting this perception GP can be positioned as the \programming of computers by means of natural selection" , or the \automatic evolution of computer programs" . In the following we describe genetic programming with a rather technical avor, that is, we emphasize the mechanisms, rather than the interpretation and context specific issues. Technically speaking the specification of how to represent individuals in GA boils down to defining the syntax of the trees, or equivalently the syntax of the symbolic expressions (s-expressions) they represent. This is commonly done by defining a function set and a terminal set. Elements of the terminal set are allowed as leaves, while symbols of function set are internal nodes. For example, a suitable function and terminal set that allows the expression in equation 1 as syntactically correct is given the following table.
Table 3: Function and terminal set that allow the expression in equation 1 as syntactically correct Strictly speaking a function symbol from the function set also must be given an arity, that is the number of attributes it takes must be specified. For standard artithmetic or logical functions this is often omitted. Furthermore, for the complete specification of the syntax a definition of correct expressions (thus trees) based on the function and terminal set must be given. This definition follows the general way of defining terms in formal languages and therefore is also often omitted.
For the sake of completeness we provide it below:
* All elements of the terminal set T are correct expressions.
* If f element of F is a function symbol with arity n and e1 ,Â¦., en are correct expressions, then so is
f(e1 ,Â¦Â¦, en ).
*There are no other forms of correct expressions.
Note that in this definition we do not distinguish different types of expressions, each function symbol can take any expression as argument. This feature is known as the closure property in GP. In general, it can happen that function symbols and terminal symbols are typed and there are syntactic requirements excluding wrongly typed expressions. For instance, one might need both arithmetic and logical function symbols, e.g., to allow (N = 2) ^ (S > 80:000)) as a correct expression. In this case it must be enforced that an arithmetic (logical) function symbol only has arithmetic (logical) arguments, e.g., to exclude N ^ 80:000 as a correct expression. This issue is addressed in strongly typed Genetic Programming .
Before we go into the details of variation operators in GP, let us note that very often GP uses mutation OR crossover in one step. This is in contrast to GA and ES, where crossover (recombination) is performed, followed by mutation. That is, in those EA branches one uses crossover AND mutation in a (combined) step. This subtle difference is visible on the GP flowchart given in Figure 4 after Koza . This chart compares the loop for filling the next generation in a generational GA with that of GP
(Download Full Report And Abstract)