Invited talks

Title: A Criterion for Protecting Privacy in Surveys and Its Attainment via Randomized Response

Author: Tapan Nayak, George Washington University, USA

Abstract: Randomized response (RR) methods have long been suggested for protecting respondents' privacy in statistical surveys. However, how to set and achieve privacy protection goals have received inadequate attention. We consider the view that a privacy mechanism should ensure that no intruder will gain much new information about any respondent from his response. We give a general development and analysis of this view. To formalize the idea, we say that a privacy breach occurs when an intruder's prior and posterior probabilities about a property of a respondent, denoted p and p*, respectively, satisfy p* < l(p) or p* > u(p), where l and u are two given functions. An RR procedure protects privacy if it does not permit any privacy breach. We explore effects of (l, u) on the resultant privacy demand, and prove that it is precisely attainable only for certain (l, u). This result is used to define a canonical strict privacy protection criterion, and give practical guidance on the choice of (l, u). Then, we characterize all RR procedures that satisfy any specified privacy requirement. We compare data utility of all privacy preserving RR procedures using the sufficiency of experiments concept and identify the class of all admissible procedures. In practice, these results should be helpful in choosing an appropriate RR procedure. Finally, we establish an optimality property of a commonly used RR method.

Title: Bayesian Variable Selection with Application to High Dimensional EEG Data by Local Modeling

Author: Dipak K. Dey (Joint with Shariq Mohammed), Department of Statistics, University of Connecticut

Abstract: Due to the immense technological advances, very often we encounter data in high-dimensions. Any set of measurements taken at multiple time points for multiple subjects leads to data of more than two dimensions (matrix of covariates for each subject). In this talk, we present a Bayesian method for binary classification of subject-level responses by building binary regression models using latent variables along with the well-known spike and slab priors. We also study the scaled normal priors on the parameters, as they cover a large family of distributions. Due to the computational complexity, we build many local (at different time points) models and make predictions using the temporal structure between the local models. We perform variable selection for each of these local models. If the variables are locations, then the variable selection can be interpreted as spatial clustering. We show the results of a simulation study and also present the performance of these models on multi-subject neuroimaging (EEG) data.

Title: Causal trees and Wigner’s semicircle law

Author: Ian W. McKeague, Department of Biostatistics, Mailman School of Public Health, Columbia University

Abstract: Numerous aspects of standard model particle physics can be explained by a suitably rich algebra acting on itself. Such an approach has been proposed by Furey (2015) using the tensor product of the four normed division algebras over the real numbers. I will discuss some statistical aspects of large causal tree diagrams that combine freely independent elements in such an algebra. Wigner’s semicircle law (as arises in random matrix theory) will be shown to emerge as the limit of a normalized sum-over-paths of positive elements assigned to the edges of trees. This result is established in the setting of non-commutative (quantum) probability. Trees with classically independent positive edge weights (random multiplicative cascades) were originally proposed by Mandelbrot as a model displaying the fractal features of turbulence. The novelty of the present work is the use of non-commutative (free) probability in order to allow the edge weights to take values in an algebra.

Title: Accelerating Monte Carlo Markov Processes

Author: Chii-Ruey Hwang, Institute of Mathematics, Academia Sinica, Taipei, Taiwan

Abstract: Monte Carlo Markov processes have been widely used to approximate the underlying probability or the expectation of statistic of the probability. The evaluation of the approximation depends on various criteria, e.g. asymptotic variance, spectral gap, convergence exponent in variational norm etc. The worst-case analysis, the average-case analysis, uniform comparison, antisymmetric perturbations are considered. Related problems will be discussed.

Title: Geometric Statistics for High-Dimensional Data Analysis

Author: Snigdhansu Chatterjee, Director, Inst for Research in Statistics and its Applications (IRSA), School of Statistics, University of Minnesota

Abstract: We present a scheme of studying the geometry of high-dimensional data to discover patterns in it, using minimal parametric distributional assumptions. Our approach is to define multivariate quantiles and extremes, and develop a method of center-outward partial ordering of observations. We formulate methods for quantifying relationships among observed variables, thus generalizing the notions of regression and principal components. We propose tests for linear relations between variables in many dimensions using the geometric properties of the data, thus paving a way for checking whether recent developments involving Gaussian assumption or sparsity of relations are applicable. We devise geometric algorithms for detection of outliers in high dimensions, classification and supervised learning. Examples on the use the proposed methods will be provided. This is joint work with several students.

Title: Individualized Multi-directional Variable Selection

Author: Annie Qu, Department of Statistics, University of Illinois Urbana-Champaign

Abstract: In this talk, we propose an individualized variable selection approach to select different relevant variables for different individuals. In contrast to conventional model selection approaches, the key component of the new approach is to construct a separation penalty with multi-directional shrinkages including zero, which facilitates individualized modeling to distinguish strong signals from noisy ones. As a byproduct, the proposed model identifies subgroups among which individuals share similar effects, and thus improves estimation efficiency and personalized prediction accuracy. Another advantage of the proposed model is that it can incorporate within-subject correlation for longitudinal data. We provide a general theoretical foundation under a double-divergence modeling framework where the number of subjects and the number of repeated measurements both go to infinity, and therefore involves highdimensional individual parameters. In addition, we present the oracle property for the proposed estimator to ensure its optimal large sample property. Simulation studies and an application to HIV longitudinal data are illustrated to compare the new approach to existing penalization methods. This is joint work with Xiwei Tang.

Title: Invertibility and condition number of sparse random matrices

Author: anirban basak,

Abstract: I will describe our work that establishes (akin to) von Neumann's conjecture on condition number, the ratio of the largest and the smallest singular values, for sparse random matrices. Non-asymptotic bounds on the extreme singular values of large matrices have numerous uses in the geometric functional analysis, compressed sensing, and numerical linear algebra. The condition number often serves as a measure of stability for matrix algorithms. Based on simulations von Neumann and his collaborators conjectured that the condition number of a random square matrix of dimension \(n\) is \(O(n)\). During the last decade, this conjecture was proved for dense random matrices.

Sparse matrices are abundant in statistics, neural network, financial modeling, electrical engineering, and wireless communications. Results for sparse random matrices have been unknown and requires completely new ideas due to the presence of a large number of zeros. We consider a sparse random matrix with entries of the form \(\xi_{i,j} \delta_{i,j}, \, i,j=1,\ldots,n\), such that \(\xi_{i,j}\) are i.i.d.~with zero mean and unit variance and \(\delta_{i,j}\) are i.i.d.~Ber\((p_n)\), where \(p_n \downarrow 0\) as \(n \to \infty\). For \(p_n < \frac{\log n}{n}\), this matrix becomes non-invertible, and hence its condition number equals infinity, with probability tending to one. In this talk, I will describe our work showing that the condition number of such sparse matrices (under certain assumptions on the moments of \(\{\xi_{i,j}\}\)) is \(O(n^{1+o(1)})\) for all \(p_n > \frac{\log n}{n}\), with probability tending to one, thereby establishing the optimal analogous version of the von Neumann's conjecture on condition number for sparse random matrices.

This talk is based on a sequence of joint works with Mark Rudelson.

Title: Adjustments of Rao’s Score Test for Distributional and Local Parametric Misspecifications

Author: Anil Kumar Bera, Professor of Economics, College of LAS, Adjunct Professor of Finance, College of Business, Adjunct Professor of Agricultural and Consumer Economics, College of ACES, University of Illinois

Abstract: Rao (1948)’s seminal paper introduced a fundamental principle of testing based on the score function and the score test has local optimal properties. When the assumed model is misspecified, it is well known that Rao’s score (RS) test loses its optimality. A model could be misspecified in a variety of ways. In this paper, we consider two kinds of misspecification: distributional and parametric. In the first case, the assumed probability density function differs from the data generating process. Kent (1982) and White (1982) analyzed this case and suggested a modified version of the RS test that involves adjustment of the variance. In our parametric misspecification, the dimension of the parameter space of the assumed model does not match with that of the true one. Using the distribution of the RS test under this situation, Bera and Yoon (1993) developed a modified RS test that is valid under local parametric misspecification. This involves adjustment of both the mean and variance of the standard RS rest. This paper considers the joint presence of distributional and parametric misspecification and develops a modified RS test that is valid under both types of misspecification. Earlier modified tests under misspecification can be obtained as special cases of the proposed test. We provide three examples to illustrate the usefulness of the suggested test procedure. In a Monte Carlo study, we demonstrate that the modified test statistics have good finite sample properties.

Title: Asymptotic behavior of common connections in social networks

Author: Bikramjit Das, Pillar of Engineering Systems and Design, Singapore University of Technology and Design

Abstract: Studies regarding the generation and structure of social network connections have been ubiquitous. Understanding relationships in the web ensues how we could optimally dissipate important information, be it the occurrence of an extreme event (flood, disease, etc.), or the release of a new drug or the availability of a new job, as exemplified by Twitter, Facebook, LinkedIn in the past decade.

Power-law tail behavior of degree distribution has been observed both empirically and under broad scale-free model assumptions on dynamic networks. Does the power-law phenomenon also appear when we look for number of common nodes (friends) shared by two specific individuals in the network? We observe that under a linear preferential attachment growth assumption a variety of growth behavior is found. We exhibit our findings using simulated data.

Title: Graph limits under respondent driven sampling.

Author: Adrian Roellin, Department of Statistics and Applied Probability, Faculty of Science, National University of Singapore

Abstract: Consider a big, unknown network of individuals, and consider the following sampling procedure. An initial individual is asked for a few “referrals” in the network, that is, for other individuals the initial individual is connected to. This procedure is iteratively repeated with the referrals in the same way. After many such referrals, how does the sampled network compare to the underlying network? We investigate this sampling procedure in the context of the theory of so-called “graphons”, which appear as limiting objects in graph limit theory. This is joint work with Siva Athreya.

Title: Penalized Estimating Function Approach for Analyzing Durations in Financial Data

Author: Nalini Ravishanker, Department of Statistics, University of Connecticut

Abstract: Accurate modeling of patterns in inter-event durations is of interest in several applications, such as high-frequency financial data, and is important for capturing valuable information that facilitates decision-making. We describe fast and accurate methods for fitting models to long time series of durations under least restrictive assumptions using penalized martingale estimating functions. We also discuss an online approach for detecting breaks in stochastic properties of these time series.

Title: An Integrated framework for analyzing Spatially Correlated Functional Data

Author: Surajit Ray,

Abstract: Datasets observed over space and time have become increasingly important due to its many applications in different fields such as medicine, public health, biological sciences, environmental science and image data. Both spatiotemporal methods and functional data analysis techniques are used to model and analyse these types of data considering the spatial and temporal aspects. In this talk we will present an integral framework for modeling and analysing functional which are spatially correlated. In particular we wish to integrate existing approaches and identify gaps for analyzing a wide variety of spatially correlated functional data and provide the practitioner with objective choices to identify the best method to analyze their data.

Title: Hierarchical hidden Markov models for detecting aberrant methylation

Author: Mayetri Gupta,

Abstract: NA methylation is an important epigenetic mechanism for controlling gene expression, silencing and genomic imprinting in living cells. Aberrant methylation has been associated with a variety of diseases, including cancer, and critical biological phenomena, such as aging. Recent developments in ultra-high throughput sequencing technologies allow for the rapid and inexpensive sequencing of billions of bases in human and other genomes, promoting understanding the workings of biological systems in depth. However, at the same time, the fast accumulation of massive amounts of sequencing data pose significant challenges in modelling and analysis. High-throughput sequencing methods to study DNA methylation include direct sequencing of sodium bisulfite-treated DNA (BS-Seq). Several software tools for pre-processing and alignment of BS-seq data analysis have been recently published- however, most methods for analysing the resulting methylation profiles, and detecting differentially methylated regions (DMRs) are relatively primitive, relying on smoothing techniques that tend to have high false discovery rates or may be biased by experimental artefacts. In this work, we develop a Bayesian statistical framework and methodology for the identification of differential patterns of DNA methylation between different groups of cells, focusing on a study of human ageing. Our approach develops and extends a class of Bayesian hierarchical hidden Markov models (HMMs) that can accommodate various degrees of dependence among the sequence-level measurements, and can be adapted to both single base-pair level as well as lower resolution data. Strong observed correlations between normal and senescent (ageing) methylation profiles is modelled through an extra layer of auxiliary latent variables that are assumed to be generated from a bivariate Normal distribution. We demonstrate how our proposed HMM-based methods significantly improve correct prediction rates over several existing methods for DMR detection. Finally we present genome-wide results obtained from all human chromosomes, illustrating how our findings can help understanding phenotypic changes associated with human ageing.

This is joint work with Tushar Ghosh, Neil Robertson, John Cole, and Peter Adams.

Title: CV, ECV, and Robust CV designs for replications under a class of linear models in factorial experiments.

Author: Subir Ghosh, University of California, Riverside, USA

Abstract: A class of linear models is considered for describing the data collected from an experiment. Any two models have some common as well as uncommon parameters. To discriminate between any two models, the uncommon parameters play a major role. A common variance(CV) design is proposed for collecting the data so that all the uncommon parameters are estimated with as similar variances as possible in all models. The variance equality fora CV design is attained exactly when there is one uncommon parameter for any two models within the class. A new concept ‘‘Robust CV designs for replications’’ having the possibility of replicated observations is introduced. The conditions are presented for a CV design having no replicated observations to be robust for general replicated observations. A CV design having no replicated observations is always robust for any equally replicated observations. In the class of linear models considered for factorial experiments, the common parameters for all models correspond to the general mean and main effects, and the other parameters correspond to two factor interactions. Two general CV designs are presented for three level factorial experiments. Examples of Efficient CV (ECV) designs as well as Robust CV designs for general replicated observations are also presented. A simple illustrative example of the complete 2 × 3 factorial design is demonstrated to be not a CV design and then the condition on replications of each run is obtained to turn it into a CV design.

Title: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression

Author: Bani K. Mallick with Anirban Bhattacharya and Antik Chakraborty, Texas A&M University

Abstract: We propose an efficient way to sample from a class of structured multivariate Gaussian distributions. The proposed algorithm only requires matrix multiplications and linear system solutions. Its computational complexity grows linearly with the dimension, unlike existing algorithms that rely on Cholesky factorizations with cubic complexity. The algorithm is broadly applicable in settings where Gaussian scale mixture priors are used on high-dimensional parameters. Its effectiveness will be illustrated through a high-dimensional regression problem with a horseshoe prior on the regression coefficients. Other potential applications will be discussed.

Title: Optimal Bayesian classification for high dimensional data

Author: Subhashis Ghoshal, North Carolina State University

Abstract: Classification of items in one of the two or more given classes based on auxiliary measurements is a fundamental problem of statistical decision making in face of uncertainty. Professor Mahalanobis made a pioneering contribution in this field by introducing his famous D^2-distance between populations, quantifying the difficulty involved in a classification problem. Linear and quadratic discriminant analysis provide optimal model based classification rules, which require estimation of the precision matrix in a multivariate normal population. Modern data often involve high dimensional measurements, making accurate estimation of the precision matrix difficult, and hence compromising the accuracy of classification rules. However accurate estimation of a precision matrix in high dimension is possibly under a sparsity assumption that many off-diagonal entries of the precision matrix are zero, which corresponds to conditional independence between the resulting variables given others. We consider a Bayesian approach to classification by inducing sparsity through a shrinkage prior on the Cholesky decomposition of the precision matrix. We show that the posterior for the precision matrix contracts at the optimal rate and the resulting misclassification error of the Bayes classifier converges to that of the oracle Bayes classifier. In simulation studies we demonstrate good performance of the proposed Bayesian method. We apply the method in several real data sets. The talk is based on a joint work with Xingqi Maggie Du.

Title: Estimation and inference for the causal mediation proportion

Author: Donna Spiegelman, Departments of Epidemiology and Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA

Abstract: In epidemiology, public health and social science, mediation analysis is often undertaken to investigate the extent to which the effect of a risk factor on an outcome of interest is mediated by other covariates. The identification of one or more plausible mediators can strengthen causal inference by confirming hypothesized mechanisms of action. A pivotal quantity of interest in such an analysis is the mediation proportion. A common method for estimating it, termed the difference method, compares estimates from models with and without the hypothesized mediator. However, rigorous methodology for estimation and statistical inference for this quantity has not previously been available. We formulated the problem for the Cox model and generalized linear models, and utilized a data duplication algorithm together with a generalized estimation equations approach for estimating the mediation proportion and its variance. We further considered the assumption that the same link function hold for the marginal and conditional models, a property which we term g-linkability. We show that our approach is valid whenever g-linkability holds, exactly or approximately, and present results from an extensive simulation study to explore the finite sample properties. The methodology is illustrated by an analysis of pre-menopausal breast cancer incidence in the Nurses' Health Study. User-friendly publicly available software implementing those methods can be downloaded from my website (SAS) or from CRAN (R).

Title: Astronomical datacubes: Analysis based on Mahalanobis distance.

Author: G. Jogesh Babu, The Pennsylvania State University, USA.

Abstract: Major telescopes, both current and planned ones, produce datacubes as their primary data products. In addition to RA and Dec (location in the sky) for the sources, the third dimension provides Velocity. In order to exploit the full informational content of the data (i.e., find fainter sources), one needs to dig deeper. This is relevant for the construction of past time histories for any detected event (e.g., a source may have been flickering just under the detection threshold, but multiple weak detections add up to a statistically significant one). A straight co-addition washes out the transients in the data. To retain signal that exists in just a subset of images, better algorithms are needed. In this presentation the issue of faint signal detection in astronomical datacubes using Mahalanobis distance at pixel level will be addressed.

Title: Multi-Sample Adjusted U-Statistics that Account for Confounding Covariates

Author: Somnath Datta, University of Florida, USA

Abstract: Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric data. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually-reweighted data (i.e., using the stratification score for retrospective data or the propensity score for prospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted U-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our approach is demonstrated through simulation studies, as well as in an analysis of data from a case-control study conducted among African-Americans, comparing whether the similarity in haplotypes (i.e., sets of adjacent genetic loci inherited from the same parent) occurring in a case and a control participant differs from the similarity in haplotypes occurring in two control participants.

Title: Renewal for Hawkes processes with self-excitation and inhibition

Author: Viet Chi Tran, Université des Sciences et Technologies de Lille, France.

Abstract: We consider Hawkes processes on the positive real line exhibiting both self-excitation and inhibition. Each point of the Hawkes process impacts the intensity of the random point process by the addition of a signed reproduction function. The case of a non-negative reproduction function corresponds to self-excitation; it has been largely investigated in the literature and is well understood. In particular, there then exists a cluster representation of the self-excited Hawkes processes which allows to apply results known for continuous-time age-structured Galton-Watson trees to these random point processes. In the case we study, the cluster representation is no longer valid, and we use renewal techniques. We establish limit results for Hawkes process with signed reproduction functions, notably generalizing exponential concentration inequalities proved by Reynaud-Bouret and Roy (2007) for non-negative reproduction functions. An important step is to establish the existence of exponential moments for the distribution of renewal times of M/G/1 queues that appear naturally in our problem. This is a work in progress with M. Costa C. Graham P. Reynaud and L. Marsalle.

Title: Bayesian Delensing the Cosmic Microwave Background

Author: Ethan Anderes, University of California, Davis

Abstract: In this talk we develop the first algorithm able to jointly compute the maximum a posterior estimate of the Cosmic Microwave Background (CMB) temperature and polarization fields, the gravitational potential by which they are lensed, and cosmological parameters such as the tensor-to-scalar ratio, \(r\). This is an important step towards sampling from the joint posterior probability function of these quantities, which, assuming Gaussianity of the CMB fields and lensing potential, contains all available cosmological information and would yield theoretically optimal constraints. Attaining such optimal constraints will be crucial for next-generation CMB surveys like CMB-S4, where limits on \(r\) could be improved by factors of a few over currently used sub-optimal quadratic estimators. The maximization procedure described here depends on a newly developed lensing algorithm, which we term LenseFlow, and which lenses a map by solving a system of ordinary differential equations. This description has conceptual advantages, such as allowing us to give a simple non-perturbative proof that the lensing determinant is equal to unity in the weak-lensing regime. The algorithm itself maintains this property even on pixelized maps, which is crucial for our purposes and unique to LenseFlow as compared to other lensing algorithms we have tested. It also has other useful properties such as that it can be trivially inverted (i.e. delensing) for the same computational cost as the forward operation, and can be used to compute the lensing Jacobian.

Title: An adaptable generalization of Hotelling's \(T^2\) test high dimension

Author: Debashis Paul, University of California, Davis.

Abstract: We propose a two-sample test for detecting the difference between mean vectors in a high-dimensional regime based on a ridge-regularized Hotelling's \(T^2\). To choose the regularization parameter, a method is derived that aims at maximizing local power within a class of local alternatives. We also propose a composite test that combines the optimal tests corresponding to a specific collection of local alternatives. Weak convergence of the stochastic process corresponding to the ridge-regularized Hotelling's \(T^2\) is established under the assumption of sub-Gaussianity of the observations, and it is used to derive the cut-off values of the proposed test. The performance of the proposed test procedure is illustrated through an application to a breast cancer data set where the goal is to detect the pathways with different DNA copy number alterations across breast cancer subtypes.

Title: Extremes of log-correlated Gaussian fields

Author: Rishideep Roy, IIM Bangalore, India.

Abstract: Extreme values and entropic repulsion for two-dimensional discrete Gaussian free fields are of significant interest and have been a subject of many recent works. Our work is on a general class of Gaussian fields with logarithmic correlations, of which the discrete Gaussian free field in dimension 2 is a particular example. We will first conclude tightness for this field from the correlation structure. It also involves defining a general class of models with some assumptions on the covariance structure at microscopic and macroscopic levels which are good enough to ensure convergence of distribution of the maximum, after appropriate centering.