Methods to Test Hypotheses
For anesthesiologists, a basic knowledge of statistics is necessary
for rational interpretation of the literature. For those doing research,
statistical concepts are critical in the planning, execution, presentation,
and publication of studies. Few nonstatisticians will develop sufficient
knowledge to resolve complex statistical issues; however, clinicians and
investigators can avoid obvious errors and can consult with trained statisticians
when additional expertise is needed. This synopsis will provide some basic
vocabulary and help with the answers to several commonly asked statistical
Back to Top of Page
Descriptive statistics: summarize a
group of individual data points. A group constitutes a sample of an entire
population. Categories of data are called variables (not parameters). Variables
are classified as continuous, including both ratio scales and interval
scales (e.g., cardiac output), or discontinuous, (e.g., five finger, six
layers.) Ranked variables cannot be measured but can be ordered by magnitude
(e.g., Glasgow Coma Scale). Categorical variables may be nominal or ordinal
but have unmeasurable attributes (e.g., alive or dead.) Common definitions
of descriptive statistics include:
where = sum, X = the value of an individual observation, X = the mean
of all observations, and n = the number of observations.
Frequency: the number of occurrences of a value in a group of measurements.
Mean: the average of a group of measurements (sensitive to outliers).
Median: the middle value of a group of measurements, i.e., half of the
values are above and half are below the median. The median is insensitive
to outliers; therefore it is preferable to the mean when data are skewed,
i.e., not normally distibuted.
Range: the minimum and maximum values in a sample.
Mode: The most common value in a group of measurements.
Standard Deviation (SD): estimates the variability in the population from
which a sample has been obtained.
The calculation is as follows:
If the data are normally distributed, 95% of all population members
fall within about 2 standard deviations of the mean, i.e., if the mean
t SD for systolic blood pressure is 110 10 mmHg, then 95% of systolic blood
pressures are between 80 and 130 mmHg.
Standard error of the mean (SEM): quantifies uncertainty in the estimate
of the mean.
The calculation is as follows: SEM= SD / n ^(1/2)
Back to Top of Page
The sample mean ± 2 SEMs describes the range which, with about 95%
confidence, contains the actual mean of an entire population. As such,
the mean t 2 SEMs is a rough approximation of the 95% confidence interval.
Inferential statistics: allow generalizations
to a population, based upon a sample; used to test hypotheses and evaluate
estimates. The hypothesis of "no difference" is often called the null hypothesis.
Back to Top of Page
Parametric tests: based on the assumptions that the populations are normally
distributed (bell curve) and that the variances are equal.
Nonparametric tests: utilized when the conditions above do not apply, i.e.,
the data are not normally distributed or variances are markedly unequal.
Power analysis: Determination, ideally before beginning a study, of the
approximate number of subjects that will be (or would have been) required
to detect a meaningful difference. Necessary assumptions include the means
and variances of the control group and the expected treatment effect.
p value: Commonly overinterpreted and misused, the p value is a statement
of the probability that an apparent difference between values could have
occurred by chance when there is no true difference in the entire population.
The statement that p<0.05 means that there is less than a 5% probability
that the difference is not real (but see caveats below about multiple testing).
Student's t test: first described in 1908 (by a statistician named Student),
this is a basic test for comparing two sets of with normally distributed
Paired t test: compares data acquired at two intervals in the same individuals,
i.e., before and after drug administration.
Unpaired t test: compares data between groups of individuals, i.e., data
after drug administration in treatment and placebo groups.
Chi-Square: used to compare proportions of samples; looks at "cells" of
categorical data (i.e., alive or dead) and evaluates the observed values
in comparison to the expected values in the cells.
Type I (a) error: identification of a difference that would not have been
found if the entire population had been studied. (You found a difference
but there wasn't one.)
Type Ix (j3) error: failure to identify a difference that would have been
found if the entire population had been studied. (You didn't find any difference
but there was one.)
Analysis of variance (ANOVA): a test for comparing two or more treatment
groups (it can be used in place of the t test in two groups) consisting
of different individuals. When comparing only one post-treatment interval
and only one treatment per group, one way ANOVA is used.
Example: Measuring cerebral blood flow in patients who have received intravenous
injections of either 0.9% saline, sodium thiopental or etomidate.
Repeated measures ANOVA: used to compare multiple treatments or multiple
intervals in the same individuals.
Example: Measuring hematocrit in cardiac surgical patients (1) pre-cardiopulmonary
bypass (CPB), (2) during normothermic CPB, (3) during hypothermic CPB,
and (4) after rewarming.
When multiple treatments or multiple intervals are measured in two or more
groups, it is called two way ANOVA.
Example: Measuring hematocnit in two groups of cardiac surgical patients
(hemodiluted and nonhemodiluted) (1) pre-CPB, (2) during normothermic CPB,
(3) during hypothermic CPB, and (4) after rewarming.
Multiple testing: a common, major problem in statistical management. The
basic requirement is to account for the increasing probability of type
I error that accompanies increasing numbers of tests. (Imagine doing twenty
tests with the assumption that you will accept a one-in-twenty chance of
Type I error (p<0.05) per test)
Post hoc test: used after ANOVA or repeated measures ANOVA to determine
specific interval differences between groups (or time intervals, doses,
Bonferroni adjustment: divides the p value by the number of tests to determine
the appropriate p value, e.g., if ten tests are done, the p value is 0.005
(0.05% = 10).
Other multiple comparison procedures: Student-Newman-Keu1's; Scheffe (very
conservative); Dunnett's (compares all groups to control); Tukey's.
Linear regression: estimates the linear relationship of an outcome variable
with an explanatory variable in terms of slope and intercept. The associated
p value is the probability that the calculated slope would have occurred
by chance if the true slope is "0."
Linear correlation: measures the strength (in terms of "r" ranging from
-1 to +1) of the linear relationship between two variables, but not the
agreement between them. The associated p value is the probability that
the calculated correlation coefficient would have occurred by chance when
there is no correlation between the two variables. R-square (r^2): literally
r times r; explains the proportion of the variability in y explained by
the variability in x.
Multivariate analysis: describes a variety of techniques (Hotelling's T2,
discriminant analysis, and logistic regression) that permit looking at
all the response variables together rather than just one at a time to evaluate
differences between groups.
Difference Plot: displays a comparison between an old, "gold-standard"
measurement and a new one. Determines bias (the average difference between
the two measurements) and precision (the standard deviation of the difference
between the measurements).
Bland-Altman Difference Plot: displays a comparison between a conventional
(but not "gold standard") measurement and a new measurement. The new measurement
is compared to the average of the old and new measurement obtained at the
Back to Top of Page
Are two or more numbers different?
(See table inside front cover of Glantz, SA. Primer of Bio-Statistics
(4th Edition). McGraw-Hill, Inc., NY, NY, 1997.)
Parametric data (normally distributed, continuous data): unpaired or
paired t tests; ANOVA. Discontinuous data or non-normally distributed continuous
data: Mann-Whitney rank-sum test (used in unpaired data); Wilcoxon signed-rank
test (used with paired data); KruskalWallis statistic (used similarly to
ANOVA). Categorical data: Chi-square analysis-of-contingency table for
unpaired data or three or more groups of different individuals; McNemar's
test for paired data.
Are two or more responses different?
Parametric data (normally distributed, continuous data): repeated measures
ANOVA. Discontinuous data or non-normally distributed continuous data:
Friedman statistic. Categorical data: Cochrane's Q for three or more treatments
in the same individuals.
Are statistically significant differences clinically important?
Dependent on judgement and experience. For instance, if two anesthetics
are associated with a statistically significant 3.0 mmHg difference in
intracranial pressure in patients with brain tumors, that might be of little
What is the meaning of a "zero numerator?"
A common statistical question implicit in clinical practice: What is
the implication of not observing a complication or an effect in a given
population? ("What does it mean if I have never induced a pneumothorax
in a series of subclavian central venous catheterizations?") The basic
rule is that the actual incidence of that occurrence if the series were
continued would be
0 to 3
where n is the number of events currently in the series.
Are two or more numbers equivalent?
The same approaches are used as in question I above. However, power
analysis is essential in determining the confidence with which the evidence
should be accepted (i.e., is the sample size sufficient to safely conclude
that there is no difference?) A good conceptual comparison is the criminal
justice system which provides a verdict of "not guilty" rather than "innocent."
A special case is the determination of whether two measurement techniques
are equivalent, in which the Bland-Altman approach (see above) is the preferable
Are two or more responses the same?
The same approach is used as in question 2 above. However, power analysis
is essential in determining the confidence with which the evidence should
be accepted (i.e., is the sample size sufficient to safely conclude that
there is no difference?)
METHODS TO TEST HYPOTHESES
Glanz, SA. Primer of Bio-Statistics (4th Edition). McGraw-Hill, NY,
NY, 1997. Chapter references in (parentheses).
* If the assumption of normally distributed populations is not met, rank
the observations and use the methods of data measured on an ordinal scale.
|Scale of Measurement
||2 treatment groups, different individuals
||3+ groups, diff individuals
||Before and after a single treatment in same individuals
||Multiple treatments, same individuals
||Association between 2 variables
|Interval (and drawn from normally distributed populations)*
||Unpaired t-test (4)
||Analysis of variance (ANOVA) (3)
||Paired t-test (9)
||Repeated-measures ANOVA (9)
||Linear regression and Pearson product-moment correlation; Bland-Altman
||Chi-square analysis-of-contingency table (5)
||Chi-square analysis-of-contingency table (5)
||McNemar's test (9)
||Cochrane Q **
||Contingency coefficients **
||Mann-Whitney rank-sum test (10)
||Kruskal-Wallis statistic (10)
||Wilcoxon signed-rank test (10)
||Friedman statistic (10)
||Spearman rank correlation (8)
||Log-rank test or Gehan's test (11)
** Not included in this text.
Back to Top of Page
Glanz, SA. Primer of Bio-Statistics (4th Edition). McGraw-Hill, NY, NY,
An excellent general reference for a nonstatistician.
Moses LE. Statistical concepts fundamental to investigations. N Engl J
A good overview of how to use statistics in interpreting (and planning)
Cupples LA, Heeren T. Schatzkin A, Cotton T. Multiple testing of hypotheses
in comparing two groups. Ann Intern Med 1984;100:122-129.
A good overview of multivariate testing (the kind of methodology that
is used to answer questions such as "What factors correlate with postoperative
myocardial infarction? "
Bland JM, Altman DG. Statistical methods for assessing agreement between
two methods of clinical measurement. The Lancet 1986;1:307.
A widely cited reference for a fundamental type of question such as
"What is the comparison between measurements of hemoglobin saturation done
with arterial blood samples or pulse oximetry? "
William on DF, Parker RA, Kendrick IS. The box plot: a simple visual method
to interpret dataAnn Intern Med 1989;110:916.
Advocates a new approach to the presentation of data that probably will
become widely accepted (or required) over time.
Hanley JA, Lippman-Hand A. If nothing goes wrong, is everything all right?
JAMA 1983;249: 1743-1745.
Well worth reading, even if you plan to interpret nothing other than
your own clinical experience.
Steel RGC, Torrie JH. Multiple comparisons in Steel RGC, Torrie JH (Eds.)
Principles and Procedures of Statistics: a Biometrical Approach, 2nd Ed.
McGraw-Hill Inc., NY, NY, 1980. Ch. 8, p 173-194.
Math is a little heavy, but the commonly used tests are presented.
Student. The probable error of a mean. Biometrika 1908;6:1-25.
A classic. No, I don't know why no first or middle name is given.
Derish PA. Biostatistics for Editors. CBE Views 1994;17:3-6.
Bailar JC, Mosteller F. Guidelines for statistical reporting in articles
for medical journals. Annals of Internal Medicine 1988;108:266-273.
Mills JL. Data Torturing. N Engl J Med 1993;329:1196-1199.
Kubinski JA, Rudy TE, Boston JR- Research design and analysis: the many
faces of validity. J Crit Care 1991;6:143-151.
This reference and the two that follow are a series of readable papers
that address important basic issues.
Boston JP, Rudy TE, Kubinski JA. Multiple statistical comparisons: fishing
with the right bait. J Crit Care 1991;6:211-220.
Rudy TE, Kubinski JA, Boston JR. Multivariate analysis and repeated measurements:
a premier. J Crit Care 1992;7:30-41.