Our
methodological research is focused on the statistical
analysis of high dimensional data. The research is
driven in part by biomedical problems arising in the
study of cancer and toxicology. Broadly
speaking, our goal is to develop statistical methods that
can help provide insight into disease etiology and
treatment, with the aim of understanding and extending
long term survival. This work has been, and
continues to be, part of a cross disciplinary effort
involving researchers at UNC from the Lineberger
Comprehensive Cancer Center, the Department of
Biostatistics, and the Department of Environmental and
Health Sciences.
Of primary interest to us are data sets containing gene
(and microRNA) expression, copy number variation, and
genotype (principally SNP) information. Although
motivated by biomedical problems, our methods have
application to other high dimensional statistical
problems, including the analysis of data concerning social
and political networks. Research in these related
areas is ongoing. Our research has two principal
directions.
Data
Integration

A frequent
problem arising in biomedical analyses is how
to integrate information from multiple data
sets. In one version of this problem,
which we refer to as horizontal integration,
the goal is to combine multiple data sets of a
common type (for example, gene expression)
into a single data set to which standard
analysis methods can be applied.
Alternatively, it may be of interest to
combine the conclusions of analyses performed
separately on multiple data sets of a common
type, a problem closely related to
metaanalysis. Two model based
approaches to horizontal integration can be
found at the websites for XDE
and XPN.
Another, more challenging, version of this
problem is vertical integration, also known as
data fusion. Here the goal is to combine
information from data sets of different types
(for example, measurements of gene expression,
copy number, and genotype) on a common set of
samples. Natural questions revolve around
the presence (or absence) of associations
between variables in different datatypes, and
how these associations may change across
different treatments or conditions. Our
most recent work aims to express the variation
in the given data matrices as a sum of variation
due to joint (common) behavior, variation
specific to each datatype, and noise.
Details can be found at the website for JIVE.

Data Mining

Data mining
describes a host of exploratory methods that
seek to find distinguished patterns or
regularities in large, usually high
dimensional, data sets. Data mining is part of
a trend away from traditional,
hypothesisdriven scientific research towards
what has been termed datadriven
research. In the latter, hypotheses of
potential interest are generated from
the exploration or ``mining'' of large data
sets. While promising, mining large data
sets for interesting patterns is often
computationally prohibitive, and may yield
spurious findings, i.e., patterns that have
occurred simply by chance when no true
structure is present.
With these
caveats in mind, we are developing data mining
methods that are based primarily on
statistical, rather than algorithmic,
principles. Our algorithms rely on approximate
search procedures that are computationally
efficient. They directly address the
problem of spurious findings within the
traditional statistical framework of
hypothesis testing and multiple
comparisons. In particular, a natural
measure of statistical significance enables us
to rank discovered patterns that might
otherwise be difficult to compare. Our
initial work considers the problem of
identifying significant samplevariable
associations that correspond to large average
submatrices of the data matrix. (This is a
special case of what is known as biclustering,
or subspace clustering.) The associated
computational problem is addressed by a search
procedure that operates in a simple, iterative
fashion. More details, and an extensive
validation study, can be found at the software
link for the LAS
(large average submatrix) algorithm. We
are currently investigating supervised data
mining procedures that combine biclustering
and classification.

