SE SOFTWARE
RATIONAL DIVISION OF A DATASET INTO TRAINING AND TEST SETS USING A SHPERE-EXCLUSION APPROACH
The method is described in:
Golbraikh, A.; Shen, M.; Xiao, Z.; Xiao, Y.D.; Lee, K.H.; Tropsha A. Rational selection of training and test sets for the development of validated QSAR models. J Comput Aided Mol Des. 2003, 2-4, 241-253.
Golbraikh, A.; Tropsha, A. Predictive QSAR Modeling Based on Diversity Sampling of Experimental Datasets for the Training and Test Set Selection. J. Comput.-Aided Mol. Des. 2002, 5-6, 357-369.
See also:
Wooton, R.; Crantield, R.; Sheppey, G.C. and Goodford, P.J. J. Med. Chem. 1975, 18, 607-613.
In order to obtain a reliable (validated) QSAR model, a dataset under study should be divided into the training and test sets. The method ensures that the training set is distributed within the whole area of the descriptor space occupied by the entire dataset, and each point of the test set is close to at least one point of the training set.
The method consists of the following steps.
1. Select compounds with the highest and lowest activity and include them in the training set.
2. Construct probe spheres with a certain radius R with the centers in the representative points of these compounds. (Radius R can be defined in several different ways described in our publications.)
3. Include compounds, corresponding to representative points within these spheres, except for the spheres' centers, in the training and test sets. The order compounds are included in the training and test sets are defined by a user.
4. Exclude all points within this sphere from the initial set of compounds.
5. If there are no more points in the dataset, stop. Otherwise, select a next point (see in our publications for the details), include it in the training set, build a sphere around this point, and put all other points within this sphere into training and test sets. Go to step 4.
The procedure is automatically repeated with different probe sphere radii and thus multiple training and test sets are obtained (a typical value of them is 50).
If you are interested in obtaining SE software, please send a message to Dr. Alexander Golbraikh.