Implications for inverted structure prediction

We have evaluated the proposition that the loglikelyhood factor accurately reflects the compatibility of native sequence with native structure; this is an essential step towards inverted structure prediction. We have chosen three proteins of different length from the dataset: hemoglobin (1eca), flavodoxin (4fxn), and papain (9pap). For each of these proteins, 100,000 random sequences of the same length as the native protein were generated as follows. An amino acid was assigned randomly to each position in a sequence according to its observed frequency of occurrence in the training dataset. This procedure generates amino acid sequences of different composition, but ensures that the ensemble of generated sequences has the same observed frequency of occurrence for each amino acid on average. Once a random sequence has been generated and assigned (threaded) to a template, the new amino acid composition and corresponding agglomeration factor for each Delaunay simplex of the template was determined. For the Delaunay simplices that were not observed in the training dataset, the value of agglomeration factor was set to zero. The total sequence/structure compatibility score was calculated as the sum of the agglomeration factors for all compositions of the Delaunay simplices of the random sequence. The results of experiments are presented in Figure 5. As can be seen in this Figure, in all cases the native protein has the highest score. One may hypothesize that the protein sequences that scored close to the native structure may in fact have similar fold.

Figure 5: Scores for native and random sequences fitted into Delaunay-based template
of hemoglobin (a), flavodoxin (b), and papain (c).