Validation of the training set for developing the four-body
statistical potentials
A random subset that is 70% of
the size of the original training set (which had 1167 chains) is selected. The
four-body potentials are developed using this subset alone. This experiment is
repeated ten times. The following table gives the correlation between the
log-likelihoods thus developed (using a 70% subset) and the original four-body
potentials (developed using the full training set). Each of the subsets
provided scores that are highly correlated (0.88) with the original potentials.
|
SET |
correlation |
|
70_set1 |
0.88 |
|
70_set2 |
0.88 |
|
70_set3 |
0.88 |
|
70_set4 |
0.88 |
|
70_set5 |
0.88 |
|
70_set6 |
0.87 |
|
70_set7 |
0.89 |
|
70_set8 |
0.87 |
|
70_set9 |
0.88 |
|
70_set10 |
0.89 |
The
correlation coefficients given above are calculated for the whole 8855 x 5
table of log-likelihoods (8855 possible quadruplet compositions in
each of the five classes defined based
on backbone chain connectivity). On the other, if only those log-likelihoods
were involved in the calculation for which quadruplets were observed in both
the 70% subset as well as in the whole training set, much higher correlation
coefficients are obtained.