Research Overview: Andrew Nobel and students



Methodological and Applied Research

Our methodological research is focused on the statistical analysis of high dimensional data.  The research is driven in part by biomedical problems  arising in the study of cancer and toxicology.   Broadly speaking, our goal is to develop statistical methods that can help provide insight into disease etiology and treatment, with the aim of understanding and extending long term survival.  This work has been, and continues to be, part of a cross disciplinary effort involving researchers at UNC from the Lineberger Comprehensive Cancer Center, the Department of Biostatistics, and the Department of Environmental and Health Sciences.

Of primary interest to us are data sets containing gene (and micro-RNA) expression, copy number variation, and genotype (principally SNP) information.  Although motivated by biomedical problems, our methods have application to other high dimensional statistical problems, including the analysis of data concerning social and political networks.  Research in these related areas is ongoing.  Our research has two principal directions. 


Data Integration

A frequent problem arising in biomedical analyses is how to integrate information from multiple data sets.  In one version of this problem, which we refer to as horizontal integration, the goal is to combine multiple data sets of a common type (for example, gene expression) into a single data set to which standard analysis methods can be applied.  Alternatively, it may be of interest to combine the conclusions of analyses performed separately on multiple data sets of a common type, a problem closely related to meta-analysis.  Two model based approaches to horizontal integration can be found at the websites for XDE and XPN.

Another, more challenging, version of this problem is vertical integration, also known as data fusion. Here the goal is to combine information from data sets of different types (for example, measurements of gene expression, copy number, and genotype) on a common set of samples.  Natural questions revolve around the presence (or absence) of associations between variables in different datatypes, and how these associations may change across different treatments or conditions.  Our most recent work aims to express the variation in the given data matrices as a sum of variation due to joint (common) behavior, variation specific to each datatype, and noise.  Details can be found at the website for JIVE


Data Mining

Data mining describes a host of exploratory methods that seek to  find distinguished patterns or regularities in large, usually high dimensional, data sets. Data mining is part of a trend away from traditional, hypothesis-driven scientific research towards what has been termed data-driven research.  In the latter, hypotheses of potential interest  are generated from the exploration or ``mining'' of large data sets.  While promising, mining large data sets for interesting patterns is often computationally prohibitive, and may yield spurious findings, i.e., patterns that have occurred simply by chance when no true structure is present. 

With these caveats in mind, we are developing data mining methods that are based primarily on statistical, rather than algorithmic, principles. Our algorithms rely on approximate search procedures that are computationally efficient.  They directly address the problem of spurious findings within the traditional statistical framework of hypothesis testing and multiple comparisons.  In particular, a natural measure of statistical significance enables us to rank discovered patterns that might otherwise be difficult to compare.  Our initial work considers the problem of identifying significant  sample-variable associations that correspond to large average submatrices of the data matrix. (This is a special case of what is known as biclustering, or subspace clustering.)  The associated computational problem is addressed by a search procedure that operates in a simple, iterative fashion.  More details, and an extensive validation study, can be found at the software link for the LAS (large average submatrix) algorithm.  We are currently investigating supervised data mining procedures that combine biclustering and classification.




Theoretical Research

Our theoretical research is motivated by our methodological research, and by fundamental questions arising in inference and machine learning.


Assessing Patterns in Gaussian Random Matrices

As a natural complement of our work on data mining, we have undertaken a theoretical analysis of the occurrence of numerical patterns in Gaussian random matrices.  The patterns of interest correspond to distinguished submatrices.  Gaussian random matrices have independent standard normal entries  (with mean zero and variance one).  They are a natural setting (null hypothesis) in which to investigate the possibility of patterns occurring ``by chance'', that is, when no real structure is present.  We are currently investigating the maximal size of distinguished submatrices of a Gaussian random matrix.  Of interest are submatrices whose entries have average greater than or equal to a positive constant, and submatrices whose entries are well-fit by a two-way ANOVA model.  Our goal is to obtain size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices.  Of particular interest are interval concentration results for the size of large average submatrices in the square case.  Our work shows, for example, that in an n by n Gaussian random matrix, the largest square submatrix with average 2 or greater has side length equal to log n minus log log n, up to an additive constant, where log is the natural logarithm.  More details can be found here.  Earlier work exploring related questions in binary matrices can be found here.  We are currently investigating applications and extensions of these results to the analysis of social and biological networks.


Approximation of Set-families with Finite Combinatorial Complexity

The Vapnik-Chervonenkis (VC) dimension measures the degree to which a family of sets can separate finite sets of points.  Specifically, the VC dimension of a family is the largest k such that  sets in the family can select every subset of some k element set. Families of sets with finite VC dimension, often called VC classes, have finite combinatorial complexity.   They play an important role in machine learning, empirical processes theory and combinatorial geometry. There is a large literature on VC classes, extending over 30 years, that addresses their combinatorial properties, as well as related exponential probability inequalities for uniform laws of large numbers.    

Recently we have established a new, and somewhat unexpected, property of VC classes that has interesting probabilistic consequences.   The property is this: for any positive epsilon, the sets in a VC class can be uniformly approximated, up to error epsilon, by the cells of a fixed, finite partition.  More precisely, for every set in the class, the cells of the partition that intersect both the set and its complement have total measure at most epsilon. The result is established by means of a recursive  construction that makes use of the dual VC dimension, and basic compactness results from functional analysis.
 
Immediate corollaries of the approximation property include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of large numbers under ergodic sampling, and exhibit uniform mixing, facts that were previously unknown.   In subsequent work we have extended these results to families of real valued functions with finite VC-graph or VC-major dimension, and families with finite gap (fat shattering) dimension. See here and here for more details. We are currently investigating the implications of these results for functional central limit theorems.