SiZer: which features are "really there"?
by P. Chaudhuri, J. S. Marron, J. C. Jiang, C. S. Kim, R. Z. Li, V. Rondonotti, J. de Uña Alvarez

(This page is under contruction.)

Click Here for a Romanian Translation of this page.  Kindly provided by Aleksandra Seremina of   http://www.azoft.com/


What is SiZer all about, and how will it help me analyze data?

SiZer enables meaningful statistical inference, while doing exploratory data analysis using statistical smoothing methods (e.g. histograms or scatterplot smoothers).  It is a new visualization that brings clear and immediate insight into a central scientific issue in exploratory data analysis:

Which features observed in a smooth of data are "really there"?

A rephrasing is:

What is "important underlying structure", as opposed to being "noise artifacts", or "attributable to sampling variability"?

This central question is critical in real data analysis, because discovery of a new feature, such as an unexpected "bump" or surprising "regions of decrease/increase", might lead to important new scientific insight (see Section B for several examples of this).  The word "might" is very important, because (as shown in Section A below) while smoothing is a powerful method for finding such features, it is also capable of highlighting many spurious features.  Newly discovered genuine structure leads to scientific breakthroughs, and guides research in important new directions, e.g. towards explaining the phenomenon, often with an appropriate new model.  But such new research efforts require serious investment of time and resources, which are wasted should the deeper inquiry reveal that the newly discovered features were mere noise artifacts.

This point is illustrated by the following data set, which consist of Family Incomes in the United Kingdom, during 1975.  The histogram suggests that there might be two modes in the income distribution.  From a classical viewpoint this would be surprising, as the several parametric families for modelling income distributions are all unimodal.  Investigation, detailed validation (including eventual fitting of a parametric mixture model), and explanation of the bumps became the PhD dissertation of Heinz-Peter Schmitz (University of Bonn), part of which is published in Econometric Theory (1992) 8, 476-488.  Had the bimodal structure proven to be a mere artifact of sampling variability, considerable effort could have been wasted.  SiZer gives even non-experts a quick and effective means of making this important type of research decision.  This point is further illustrated in the context of this data set in Section B.

This issue is not simple to handle, because (as illustrated in Section A) it is confounded with the problem of "amount of smoothing".  Experienced data analysts (who know enough to view several levels of smoothing, and to understand what they are looking at!) are usually very effective in determining which structures are "signal" and which are "noise".  SiZer allows major strides in this decision process in two different contexts:

(i)    It makes this type of inference readily do-able by the non-expert.

(ii)    It speeds the decision for the smoothing expert.

An index to pages with a detailed look at various aspects of SiZer, and some interesting asides follows.

???  To do: add link to software page ???



Table of Contents

(Links to Pages with a detailed introduction to SiZer, some motivating analyses, some insightful simulated examples, connections to previous work in statistical smoothing, some interesting asides, and some ideas for extension to more complicated smoothing settings)



A.    An Introduction to the Basics:

This section gives background material in statistical smoothing and an introduction to SiZer.

1.  Histograms are "smoothers",
but here is why you shouldn't use them.

2.  Kernel Density Estimation,
a "smoothed histogram", and the importance of the bandwidth.

3.  An introduction to scatterplot smoothing, i.e. nonparametric regression,
a useful way to find structure in data, and again the importance of the bandwidth.

4.  The family approach to smoothing,
look at all members of the family of smooths, i.e. all the bandwidths, instead of attempting to choose a "best" one.
by J. S. Marron and S. S. Chung

5.  SiZer,
Introduction to the basic ideas.
 



B.    A Set of Examples:
 
 
 
 
 



C.    Connections to the history of statistical smoothing:

This section is intended for experts in statistical smoothing.  It connects scale space ideas, including SiZer to earlier approaches to these problems.  Here are two intentionally provocative personal opinions, that are backed up inside:

1.  Bandwidth selection is no where near as important as I once thought.

2.  Confidence bands are the wrong way to undertand the variability of a smoother (i.e. curve estimator).
 



D.    Fun with Scale Space:
 
 



E.    Extensions and Enhancements of SiZer:

Here are some of the ways that the SiZer idea has been extended to date:

1.  SiZer for finding jumps
by C. S. Kim and J. S. Marron

2.  SiZer for dependent data
by V. Rondonotti and J. S. Marron

3.  SiZer for local likelihood
by R. Z. Li and J. S. Marron

4.  SiZer for censored and uncensored density and hazard rate estimation
by J. C. Jiang and J. S. Marron

5.  SiZer for length liased density and hazard estimation
by J. de Uña Alvarez and J. S. Marron

6.  High Dimensional Versions
by F. Godtliebsen, J. S. Marron and P. Chaudhuri
    SiZer and its higher dimensional extensions are called "SSS", or "S cubed", for "Significance in Scale Space".  Extension to more than one dimension requires a really different visual paradigm, although the statistical backbone is the same.  So far only dimension 2 has been implemented.  The statistical end is straightforward in higher dimensions, but the visualization appears to require yet another new set of ideas.  For details in the 2-d case, go to  http://www.unc.edu/~marron/Movies/SSS_movies.html.
 
 


Downloadable SiZer Software:
 

Matlab 7 Functions for SiZer and SSS (ascii)
 

For a Java version of SiZer (thus no Matlab required), go to Daniel H. Wagner Associates, and follow the "Download SiZer software" link.
 


SiZer References: 

“Scale space view of curve estimation”, (2000) Chaudhuri, P. and Marron, J. S., Annals of Statistics, 28, 408-428.

“Zooming statistics: Inference across scales”, (2001) Hannig, J., Marron, J. S. and Riedi, R. H., Journal of the Korean Statistical Society, 30, 327-345.

“Dependent SiZer: Goodness of Fit Tests for Time Series Models” (2004)

Park, C., Marron, J. S. and Rondonotti, V. Applied Statistics, 31, 999-1017.

“SiZer for length biased, censored density and hazard estimation” (2004) de Uña Álvarez, J. and Marron, J. S., Journal of Statistical Planning and Inference, 121, 149-161.

“SiZer for smoothing splines” (2005) Zhang, J. T. and Marron, J. S., Computational Statistics, 20, 481-502.

“Local Likelihood SiZer map”, Li, R. and Marron, J. S. (2005) Sankhya, 67, 476-498.

“Advanced distribution theory for SiZer”, Hannig, J. and Marron, J. S. (2006) Journal of the American Statistical Association, 101, 484-499.

“SiZer for jump detection”, Kim, C. S. and Marron, J. S. (2006) Journal of Nonparametric Statistics, 18, 13-20.

“SiZer for time series: a new approach to the analysis of trends”, Rondonotti, V., Marron, J. S. and Park, C. (2007) Electronic Journal of Statistics, 1, 268-289 (http://dx.doi.org/10.1214/07-EJS006).


 


For more about SiZer, or if you have other questions, inquire by email from marron@email.unc.edu.


Data Analysis Table of Contents

Marron's Home Page