During my career, I developed rather diverse research interests in both theoretical statistics and probability. In particular there are four major areas in which I have made multiple contributions: generalized fiducial inference, methodology and theory for SiZer, analytical probability and application to engineering and finance. Currently I am also developing a new research interest in biological applications. I will now describe my contributions and future plans by area.
Generalized Fiducial Inference and related topics
(funded by NSF grants DMS 1007520, DMS 0707037 and DMS 1512945)
A large percentage of my current effort is concerned with studying theoretical properties of generalized fiducial inference. R. A. Fisher's fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930s. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, it appears to have made a resurgence recently under various labels such as generalized inference, confidence distributions, Dempster-Shafer calculus and its derivatives. In these new guises fiducial inference has proved to be a useful tool for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable.
The aim of my work is to revisit the fiducial idea of Fisher from a fresh new angle. I do not attempt to derive a new paradox free theory of fiducial inference as I do not believe this is possible. Instead, with minimal assumptions I present a new simple fiducial recipe that can be applied to conduct statistical inference via the construction of generalized fiducial distributions. This recipe is designed to be fairly easily implementable in various practical applications, and can be applied regardless of the dimension of the parameter space (i.e., including nonparametric problems). I term the resulting inference generalized fiducial inference.
From the very beginning our work has been motivated by important applications in areas such as pharmaceutical statistics and metrology. Jointly with several of my students and collaborators at other institutions we applied the generalized fiducial methodology to important applied problems with a great success. In addition to methodological research I have also analyzed mathematical properties of generalized fiducial distribution in several setups proving that generalized fiducial distribution often gives rise to statistical procedures that are asymptotically exact. Fiducial based statistical procedures also possess very competitive small sample properties. This is shown by mounting evidence from several simulation studies giving practitioners an exciting new data analysis tool. More specifically, my contributions naturally cluster into three areas:
First, is the definition and theoretical properties of generalized fiducial inference. The first significant contribution comes in 2006 when we connect fiducial inference to a new field of generalized inference sparking a number of subsequent publications. In a series of papers I bring a formal definition of generalized fiducial inference, some useful computational formulas and proof of asymptotic correctness of generalized fiducial inference in one parameter situation. I also use a discretization to provide a rigorous derivation of workable formulas and mathematical proof of Bernstein-von Mises like theorem establishing asymptotic correctness of generalized fiducial inference for a large class of i.i.d. parametric models. My coauthors and I show how generalized fiducial distribution can be used for prediction, connect generalized fiducial distribution with another growing field of confidence distributions and address some computational issues.
Second, my students, colaborators at other institutions and I provide methodological applications of generalized fiducial inference to statistical models of practical interest. For example we provide new statistical tools for inference in linear mixed models that have properties very favorable compared to other methodologies available in the literature. We also provide a tool for statistical inference about extremes and address specific applications in measurement science. Some of these ideas have a direct impact on government policy-making with regard to international inter-laboratory experiments and assessment of measurement capabilities by the U. S. National Institute of Standards and Technology (NIST).
Third, is the use of the generalized fiducial distribution for model selection. My students and I cover a fiducial model selection in a more classical setup. However, the flexibility of generalized fiducial inference also allows us to move beyond parametric problems, e.g. a large sparse linear systems. A distinctive characteristic for such a system is that, when comparing to the sample size, the number of parameters in the system is large but most of these parameters are statistically nonsignificant. Therefore the issue of model selection is inherently built-in for this class of problems. As a first step in this direction my coauthors and I investigated the use of generalized fiducial inference for constructing wavelet regression confidence intervals and regression for high-dimensional data.
Another contribution to the large dimensional model selection problem is forthcoming. Indeed, further study of the interaction between model selection and fiducial inference is one of my immediate future plans. I also plan to study various optimality properties of generalized fiducial distribution. In the long term, I plan to continue working on applications of fiducial inference to important current problems of statistical inference.
A reader interested in learning more about generalized fiducial inference can consult our review article.
Big Data, Data Integration, Non-parametric smoothing and SiZer
(funded by NSF grant IIS-1633074)
Big Data has become a popular fad among statistician, computar scientist and mathematicians. However, not enough attention has yet been paid to large scale Big Data challenges such as data heterogeneity. This phenomenon frequently arises in Big Data sets, because it naturally arises when data sets are combined. So far there has been rather little thought or discussion within the quantitative communities (neither in statistics, nor elsewhere) about the impact of data set combination, yet that is a crucial issue. This may be partly because appropriate terminology does not yet exist. This problem is addressed by introducing the new terminology of "robustness against heterogeneity".
In joint work with Prof. Marron we have imoproved a data integration tool JIVE by using tools from perturbation theory for matrices. JIVE allows us to decompose the information in several datasets into joint and individual parts. This can be use to better understand underlying common driving forces or on the opposite side of the spectrum to eliminate batch effects. Currently we plan to apply this tool to various datasets and develop a supervised version of JIVE.
Another basic questions in statistics is finding a functional relationship between predictors and response variables known under the technical term regression. When the relationship cannot be described by a simple function, e.g. line, a more flexible, non-parametric, method is sought. One of such non-parametric methods is called local polynomial smoothing. The idea of local polynomial smoothing is to fit a simple function, e.g., polynomial to the data in a narrow sliding window. The size of the window is often called a bandwidth and the practical performance of the local polynomial smoothing is critically influenced by its choice. There has been a lot of competing literature on how to best select the bandwidth in various of situations, mainly in the 1990s.
At the turn of the millennium Chaudhury and Marron have argued that instead of selecting a single ``best'' bandwidth one should work with a number of possible bandwidths identifying features at number of different scales. These features were distinguished using a large number of statistical tests summarized using a special graphics termed SiZer (Significant Zero crossing) map. Since then SiZer maps have found their use in many exploratory data analysis situations.
The fact that SiZer map is based on a large number of statistical tests requires an adjustment to make the map less susceptible to false positive, multiple testing adjustment. The original SiZer of Chaudhuri and Marron had an ad-hoc adjustment that made the original SiZer prone to false positive results. My first contribution to this area was to provide a rigorous multiple testing adjustment based on extreme value theory substantially improving the validity of SiZer maps. I worked out a ``second order'' approximation to the extreme value distribution of the Gaussian random process implied by SiZer.
In order to make the SiZer idea practical beyond the original i.i.d. setup one needs to extend it to other models. My next contributions have concentrated on such extensions. For example we use quantile regression and M-estimation to provide a robust SiZer map capable of dealing with outliers and propose a version of a SiZer for dependent data. We also provided a tool for rigorous comparison tool for SiZer maps. This is important because currently there is no way of deciding in a rigorous way under what condition will one of the many versions of SiZers in the literature outperform others.
Applications to Engineering and Finance
(funded by NSF grants DMS 1016441 and ECCS 0700559)
Another important active area of my interest is application of statistics and probability to engineering and finance. Here I have worked on several interesting applications.
The first application I am part of is concerned with the modeling and simulation of extremely large networks using time-dependent partial differential equations (PDEs). In many applications, numerical simulation is the tool of choice for the design and evaluation of large networks. However, the computational overhead associated with direct simulation severely limits the size and complexity of networks that can be studied in this fashion. Performing numerical simulations of large stochastic networks has been widely recognized as a major hurdle to future progress in understanding and evaluating large networks. Our modeling approach is based on asymptotic analysis of a stochastic system that provides a probabilistic description of the network dynamics. This approach appears particularly promising for networks like a wireless ad hoc network.
In this kind of network, nodes send to and receive from other nodes that are within transmission range. Transmission success is affected by interference; e.g., nodes are often so simple that they can receive only one message at a time, and propagation losses are often modeled by a power law dependence on distance. In this situation, we believe that it is possible to formulate the flow of information through the network using hydrodynamic scaling limits for the behavior of the individual packets or particles. The technical details involve defining a probability structure that describes the likelihood of information from one node passing to nearby nodes and then passing from this local probability structure to a diffusion limit description of the motion. The team working on this project contains an electrical engineer, probabilist and a pde specialist. In a series of papers we develop technical tools, provide a rigorous mathematical proof of convergence of the random process modeling a class of communication networks to the limiting PDE and apply the ideas to various network protocols.
The second application is simultaneous target tracking. The main idea here is to provide an algorithm that, based on a limited information from sensors or images, provides a location (or sequence of locations called a track) for each of the targets with high fidelity. My student, collaborator at another institution and I provids a model based algorithm for tracking of multiple moving objects extracted from an image sequence allowing for birth, death, splitting and merging of targets. This is an important problem which finds numerous applications in science and engineering. We also establishe the almost sure convergence of the estimators based on our model to the truth. In other words we proved that this model should work well under the conditions it was designed for provided we have enough data. This consistency property of the tracking estimates was empirically verified by numerical experiments. From a somewhat different angle, another group of collaborators and I study a tracking of targets using sensors with limited communication capacity using information theoretic tools.
The third application is financial data. The presence or absence of jumps in the financial time series data, such as stock prices has been of interest among researchers and practitioners due to the effect presence of jumps has on pricing of various financial instruments. We provide a new test for detecting jumps in financial time series. In the future I plan to use the ideas of generalized fiducial inference and non-parametric smoothing to provide new statistical inference procedures for volatility in financial data.
Next, we deal with some aspects of statistical analysis of internet traffic data. In particular, we addresse a controversy about a distribution of sizes of files transferred over the internet. We also look at the relationship of statistical summaries (such as sample covariance) of internet traffic data computed by aggregating the data at different resolutions.
Lastly, my students, collaborators at EPA and I provide a statistical algorithm for detecting anthrax from laser induced spectroscopy data. This manuscript is a first among several forthcoming manuscript dealing with application to chemical statistics and pharmacology. One of them will develop an algorithm for finding misclassified compounds in chemical libraries.
(funded by NSF grant DMS 0504737)
A large portion of my early career was spent working on problems from analytical probability. The main area of my interest was small deviation for Gaussian processes, i.e., understanding the behavior of the probability that a stochastic process X(t) stays during a time interval [0,T] in a small ball of radius e around the origin. As e tends to 0, this probability clearly tends to zero and the question is at what rate? Answers to small deviation questions are used in other fields of mathematics such as analysis of non-parametric Bayes estimators, quantization and metric entropy. My collaborators and I made contributions to the theory of small deviations under the L2 norm. we characterize the precise L2 small deviations for a large class of continuous Gaussian processes. We also provide a comparison theorem for lower tail of sums of positive random variables.
Another area of interest was an analysis of several stochastic search algorithms related to simulated annealing. The convergence of simulated annealing has been established earlier by Hajek in the 1980s. Our work uses an alternative simple approach based on the relative frequency simulated annealing spends in the various states of the system. We also provide a particular type of rates of convergence not available before.
Next, we provides a definition of continuous ARMA(p,q) in the case of q>=p in which case the process does not exist in the classical sense.
Finally, my dissertation studied the properties of filtrations supporting only purely discontinuous martingales. The main result could be paraphrased as follows: if all martingales have at least one jump then all the information available in the system is included in the timing and sizes of the jumps.
In addition to continuing my work on generalized fiducial inference and applications as indicated above I have started working on statistical analysis of high throughput DNA sequencing data. The particular problem of interest is detecting changes in venom in response to the environmental pressure. The techniques my student and I are currently applying combine SiZer type ideas with generalized fiducial and objective Bayesian methods.
Second new direction I am actively pursuing is exploring a connection between generalized fiducial inference and inverse problems for PDEs. Many physical phenomenons are modeled by partial differential equation, e.g., a dissipation of heat in a plate. It is of practical interest to infer from a distribution of quantities of interest, e.g., a temperature measurements at several locations, infer the distribution of the parameters of the differential equation, e.g., the thickness of the plate. The mathematics involved in such a inverse problems are very similar to computation of a generalized fiducial distribution.
copyright © Jan Hannig, designed by JanAltonDesign, 2008