In the Police Artist task for Clavier, every possible binary string output is evaluated with respect to its (Hamming) distance from the target string. The percentile reinforcement schedule requires only that every possible output of the system be able to be ordinally ranked with respect to some criterion. We can treat the function that evaluates the outputs separately from the neural network or the reinforcement schedule. Different tasks for Clavier can be thought of as different evaluation functions over the space of binary strings of a fixed length.
Because the output of Clavier is intended to model animal behavior under control of shaping, it is important to select a task with analogies to the complex, sequential, motor tasks commonly shaped in animals. By choosing an appropriate evaluation function, the binary strings can be interpreted as such a motor task. Kozar (1994) recommends the Lawnmower task for a number of reasons. From the perspective of the current status into the investigations of Clavier, two features should be noted.
The interpretation of the symbol string output for the Lawnmower problem is a sequence of actions, such as TURN-LEFT, MOW-FORWARD, and JUMP-TO-position. As such, it resembles the types of tasks shaped using the percentile reinforcement schedules, both in humans and other animals. Further, unlike the Hamming distance, slight differences in output strings may have radically different values. This corresponds to a very unsmooth fitness surface with many local minima. Such fitness surfaces tend to be handled far better by genetic programming algorithms and simulated annealing than by neural networks. Since the Clavier system has certain features similar to neural networks and other features similar to simulated annealing, performance on the Lawnmower problem should prove informative as to the search characteristics of Clavier.
The Lawnmower task. Kozar (1994) uses several variants of the Lawnmower task. A simplified version of these was used to conduct preliminary tests of Clavier. In general, the Lawnmower task consists of a rectangular "lawn," divided into squares, each of which may be in one of two states, mown or unmown (see Figure E-1). The lawn begins entirely unmown. A "lawnmower" is placed on one square of the lawn facing either north, east, south or west. A series of commands is given to the lawnmower in the form of a symbol string. This symbol string is the lawnmower's "program." The lawnmower executes these programmed commands in sequence, turning, moving and mowing. The value or fitness of the symbolic program is the amount of lawn mowed by the lawn mower under control of that program.
A good learning algorithm will generate a series of programs for the lawnmower and use the amount of lawn mowed in each case to improve the fitness of later programs. Thus, a good learning algorithm will search the space of all possible programs of lawnmower commands and find one that directs the lawnmower to mow the lawn efficiently. Because of sequential dependencies in the programs, a small change in a program can radically alter its performance and thus its fitness. This makes for a very ugly fitness landscape and a very challenging problem for a learning algorithm.
There are a number of variations on the lawnmower problem. First, what happens when the lawnmower reaches the edge of the lawn? The lawn may be bounded, in which case, commands that instruct the lawnmower to move off the lawn are ignored. The lawn may "wrap-around," in which case, moving past the last column on the right causes the lawnmower to reappear in the first column on the left. Moving up too far causes the lawnmower to reappear at the bottom. Etc. Such a lawn is called toroidal, because the lawn behaves topologically as if it were stretched over a donut, with no edges. Second, in the case of bounded lawns, the starting position (and orientation) of the lawn mower may differ.
Third, there are variations in terms of what commands are permitted in the command program sequence. Usually, there are three commands, TURN-LEFT, MOW-FORWARD, and JUMP-TO-position, where position is a relative address, such as west-by-two-squares and north-by-three-squares. The TURN-LEFT command changes only the direction in which the lawnmower is pointed by 90 degrees and does not affect the mown/unmown status of the lawn. The MOW-FORWARD command does not change the direction of the lawnmower, but does change its position by one square and, if the new location square was unmown, changes that square's status to mown. The JUMP-TO-position command changes only the location of the lawnmower by the relative amount indicated.
The purpose of the TURN-LEFT and MOW-FORWARD commands is relatively obvious. And lawns can be mowed without the JUMP-TO command. The purpose of the JUMP-TO-position command is to allow the learning algorithm to generate a sub-program of TURN-LEFT and MOW-FORWARD commands that efficiently mows a patch the lawn. The JUMP-TO command can be used to link invocations of the sub-program to mow the entire lawn one patch at a time. Thus, learning algorithms can be tested for their ability to generate hierarchically structured programs.
The subprogramming feature of the Lawnmower problem is not of especial interest to us in testing Clavier, so only the TURN-LEFT and MOW-FORWARD commands were used. Since Clavier had already been tested with 64-bit strings, a lawn was constructed which would be difficult, but not impossible to mow with a sequence of 64 TURN-LEFT and MOW-FORWARD commands. A five-by-five toroidal lawn was used. An illustration of a random program executing in this lawn is illustrated in Figure E-1. A much more efficient program is provided to the reader as an exercise.
Clavier's performance on the Lawnmower task. The variant of the lawnmower task was chosen so as to use the same version of Clavier that was used in the Police Artist search task. The difference between the Lawnmower simulation program and the Police Artist simulation program was confined to the evaluation of the fitness of the output on each cycle. In the Police Artist simulation, fitness was the Hamming similarity to the target. In the Lawnmower simulation, fitness was the amount of lawn mowed by a program with zero interpreted as TURN-LEFT and one interpreted as MOW-FORWARD. In the event that the entire lawn was mowed by a program, the fitness was calculated as the size of the lawn plus any "unused" program symbols.
Over a wide variety of parameter values, Clavier performed poorly on the Lawnmower task. Significantly, thresholds on all units rose steadily and reinforcement was well below the programmed value of 30%. This indicates that the system was not learning, but responding randomly. Increasing the length of the output string from 64 to 100 (by increasing the number of emitter units) improved performance, despite the fact that many effective programs are substantially shorter than 64 bits long. Presumably, the fitness surface for the 100-bit space is more amenable to Clavier's search.
With thresholds set initially to .5, ones are output as likely as zeros. Since efficient programs for solving the Lawnmower problem involve substantially more MOW-FORWARDs than TURN-LEFTs, it may be that the shorter programs have too few ones (standing for MOW-FORWARD). As the thresholds rise due to insufficient reinforcement, the number of ones in the output string decrease. This decrease may lower the fitness due to the need for MOW-FORWARDs. A vicious cycle probably ensues.
If this scenario is accurate, then lowering the initial thresholds should improve performance as the thresholds rise through the optimum range. This hypothesis was confirmed. Reversing the encoding of the bits so that one stands for TURN-LEFT and zero stands for MOVE-FORWARD should also improve performance and perhaps break the vicious cycle. Performance did improve, but the thresholds still rose as learning did not occur.
Another untested possibility is that further increasing the size of the network (and thus the length of the output string) will improve the fitness landscape to the point where Clavier can effective search the space, perhaps finding short programs at the head of long strings.
Kauffman, S. (1995). At home in the universe: The search for the laws of self-organization and complexity. New York: Oxford University Press.
Kozar, J. R. (1994). Genetic Programming II: Automatic discovery of reusable progams. Cambridge, MA: MIT Press.
Narendra, K. S. & Thathachar, M. A. L. (1989). Learning Automata: An introduction. Englewood Cliffs, NJ: Prentice-Hall.
Platt, J. R. (1973). Percentile reinforcement: Paradigms for experimental analysis of response shaping. In G. H. Bower (Ed.), The Psychology of Learning and Motivation, Volume VII: Advances in research and theory (pp. 271-296). New York: Academic Press.