Clavier is a neural network that generates a binary string on each cycle. Simulations have been conducted with Clavier emitting a 64-bit string on each cycle, reinforced on a percentile reinforcement schedule. The goal was to get Clavier to emit a pre-designated target string. The space of strings was defined using the Hamming distance as the distance function. To date, simulations have been used to evaluate search and convergence with respect to arbitrary targets. These simulations were referred to as Police Artist simulations. Technical specifications for these simulations are given in Appendix C. This appendix reports preliminary results of those simulations.
Search. On every cycle, the probability of emitting each of the 264 possible binary strings is determined by the parameter values on that cycle. As reinforcement alters the parameter values, the probability distribution over the space of output strings changes. Thus the sequence of states of the neural network can be modeled as a sequence of random variables, one for each cycle (Narendra & Thathachar, 1989, ch. 5). Broadly speaking, shaping using the percentile reinforcement schedule involves reinforcing any response from that portion of the current distribution of responses most similar to the target response (Figure D-1).
The most recent responses emitted by the organism are used to estimate the underlying distribution for the current cycle. Real organisms alter their pattern of responding due to this selective reinforcement. The distribution of responses moves in the direction of the target until an organism that initially did not emit the desired target response at all now emits it regularly. The Clavier neural network has also demonstrated this ability. In simulations, under control of a percentile reinforcement schedule, Clavier, beginning with parameter values making it vanishingly unlikely that it will output the one target binary string out of the over 18 x 1018 possible strings, emits that target string in fewer than 350,000 cycles in every case thus far examined.
Typically, over a wide variety of measures, animal learning proceeds asymptotically, rapidly at first and then slowly as the target state is approached. In order to summarize the highly variable performance of Clavier, the measure of the Hamming similarity to the target for the best output so far was graphed for a typical simulation in Figure D-2. The typical decelerating "learning curve" is seen. (Note that, over the course of the 27,000 cycles, the output was the best-so-far, that is, closer to the target than any previous output only 16 times. By the standards of typical neural networks, Clavier is very slow. However, it must be remembered that Clavier works with only one bit of information per cycle, far less than neural networks using supervised learning or even reinforcement learning.)
A series of Monte Carlo simulations was performed in order to answer three questions: First, what proportion of Police Artist simulations of Clavier match the target exactly and how long does it take? Second, since the learning algorithm is asymmetric with respect to bit values of one and zero, it may be that the proportion of zeros to ones in the target string will affect search time. Third, how will adding the damping component suggested by Luce, affect performance?
Every emitter unit in Clavier uses a pseudo-random value to determine the activation level on each trial. Early simulations determined that small changes either in the target or in the random seed produced substantial changes in the number of cycles to successful search. Thus, in order to determine the effect of other conditions, such as proportion of zeros in the target or the use of damping, a number of simulations had to be run in each condition. For each set of conditions, 200 simulations were conducted. Ten targets were constructed stochastically using one random seed. For each target, twenty simulations were run using twenty different random seeds (selected manually from a random number table) for the stochastic component of the Clavier algorithm.
There were six conditions in all. The proportion of zeros in the targets were set to low (mean = .1), middling (mean = .5), or high (mean = .9). The damping was either used or not.
Each target was constructed by generating 64 pseudo-random numbers between zero and one and comparing them to the fixed bias for the target, which was set either to .1, .5, or .9. If the pseudo-random number exceeded the bias, that bit in the target was set to one. Otherwise, it was set to zero. Thus, the bias determined the mean proportion of zeros in the targets. Since, in the original demo simulation, zeros corresponded to black pixels and ones corresponded to white pixels, low bias (=.1) was termed the "light" condition, middling bias (=.5) was termed the "medium" condition, and high bias (=.9) was termed the "dark" condition.
In answer to the first question, successful search was achieved for all 1200 simulations. The slowest simulation took 345,006 cycles to complete. The target for this simulation was not randomly selected. It was the demonstration target, which contained 19 zeros out of 64 bits. Among the 1200 simulations with pseudo-random targets, the slowest took 278,901 cycles to complete. For twenty simulations using the demonstration target with the twenty different random seeds, the mean number of cycles to completion was 83,411.85.
The results of the 1200 simulations testing the differences between light and dark targets (different bias) with damped and undamped learning are summarized in Figure D-3. For undamped learning, dark targets were consistently matched faster. For damped learning, the differences in time to match targets generated with different biases were far smaller. Medium targets took slightly longer. Damped learning was also less variable within condition, except that undamped learning for dark targets had a very low variance. Interestingly, the mean time to match medium and dark targets was slightly faster using undamped learning.
Thus, in general, damped learning was far more consistent in almost all respects than undamped. Undamped simulations took substantially longer for targets with a low proportion of zeros.
Convergence. The percentile reinforcement schedule requires the use of some distance function (Platt, 1973). The Hamming distance was selected. Since evaluating convergence also requires a distance function, it seems reasonable to use the Hamming distance there as well. In general, convergence is defined in terms of the fact that, as the simulation goes to infinity, the distance between system's output and the target goes to zero. Since the neural network system is stochastic, there are several different definitions as to how the convergence to the limit at infinity is to be defined (Narendra & Thathachar, 1989).
Convergence is usually determined either by mathematical analysis or by Monte Carlo simulations. At present, no analysis has been conducted, though we are interested in collaborating with anyone interested in performing such an analysis. The Monte Carlo simulations reported above all were stopped at the successful completion of the search. As such, determination of convergence was not established. Still, the consistent successful search demonstrated by the Monte Carlo simulations suggests that convergence is generally obtainable.
At present, the only direct evidence that Clavier trained by percentile reinforcement using the Hamming distance will converge is due to a test simulation run past the the point of the first exact match to the target. Parameters were set to maximize speed with the Luce damping. The percentile reinforcement schedule was modified so that all outputs from the network that matched the target exactly were always reinforced (continuous reinforcement).
Initial match occurred at 6378 cycles. Immediately after the initial match, the outputs varied widely from the target. Two more exact matches occurred at 6499 and 6501 cycles. A continuous series of 7 exact matches occurred from cycles 6504 to 6510. From cycle 6512 on, every output was an exact match. The initial variability appears to be the result of an error catastrophe (Kauffman, 1995, p.184-185). The eventual sequence of exact matches suggests that convergence was achieved.
The next step will be to replicate the 1200 Monte Carlo simulations past the point of successful search in order to determine if and when convergence occurs.
Kauffman, S. (1995). At home in the universe: The search for the laws of self-organization and complexity. New York: Oxford University Press.
Narendra, K. S. & Thathachar, M. A. L. (1989). Learning Automata: An introduction. Englewood Cliffs, NJ: Prentice-Hall.
Platt, J. R. (1973). Percentile reinforcement: Paradigms for experimental analysis of response shaping. In G. H. Bower (Ed.), The Psychology of Learning and Motivation, Volume VII: Advances in research and theory (pp. 271-296). New York: Academic Press.