Statistical Sciences Unit, ISI Delhi

Upcoming Seminars

Bayesian Multi-Kernel Gaussian Process Modeling for Nonlinear Multi-Omics Integration

Bhargob Kakoty, Dept of Biostatistics and Health Data Science, University of Minnesota

Date : 20th January, 2026

Abstract: Linear models often may not capture complex, nonlinear associations between outcomes and features. Additionally, data from multiple platforms are now collected on the same individuals, requiring methods to integrate multi-platform data. To address these issues, we propose Np-BIVP, a Bayesian nonparametric framework that uses Gaussian Processes to integrate multi-platform data with multiple kernels. It allows for simultaneous variable selection for each data platform and accurate prediction of the outcome. We also establish a theoretical result on the identifiability of the kernel weights, ensuring that our method can measure the importance of each data platform in predicting the outcome. We further present an extension of our method, called Np-BIVPR, that integrates prior biological information and cross-platform regulatory networks between two omics data platforms. We conduct a realistic simulation study to show that our method outperforms others in variable selection and achieves accurate prediction for both continuous and survival outcomes. Finally, we apply our method to a Glioblastoma Multiforme (GBM) cancer study from The Cancer Genome Atlas (TCGA) and discover biologically relevant biomarkers associated with cancer survival time.

Past Seminars

Nonparametric regression of spatio-temporal data using infinite-dimensional covariates

Rishideep Roy , School of Mathematics, Statistics and Actuarial Sciences, University of Essex

6th January, 2026

Abstract: In spatio-temporal analysis, we often record data at specific time intervals but with varying spatial locations between these timepoints. We propose a conditional model to analyze such spatio-temporal data that accommodates the dependencies alongside stationary temporal-level explanatory variables, which may be infinite-dimensional. Because of the absence of a mixing-type dependence condition in this case, which is typically required by the existing studies, we consider a weaker geometric moment contraction (GMC) condition on the covariates. In this work, we obtain nonparametric point estimates of the mean and covariate functions of such a regression model which we then show to be statistically consistent. We also obtain a simultaneous confidence interval of the mean function using a central limit theorem for the proposed estimator.


Some probability inequalities with applications to statistical inference

Arun Kuchibhotla, Department of Statistics and Data Science, Carnegie Mellon University.

6th January, 2026

Abstract: In this talk, I will present a few probabilities inequalities related to order statistics of independent data and provide applications to statistical inference in some regular and irregular problems. The first probability inequality concerns the probability that an arbitrary random variable is bounded by the order statistics of an independent sample with a potentially different distribution. This has some interesting applications to the performance of bootstrap and subsampling with a fixed number of resamples. The second probability inequality concerns the probability of order statistics of a unimodal distribution covering the population mode. Some interesting implications of this methodology for inference in irregular problems will be discussed. Finally, some extensions to dependent data will be presented. This represents a combination of works with Manit Paul, Larry Wasserman, and Sivaraman Balakrishnan.


Regularizing Extrapolation in Causal Inference

Harsh Parikh, Dept of Biostatistics at Yale University

5th January, 2026

Abstract: Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this work, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel “bias-bias-variance” tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.


Wasserstein-Cramer-Rao Theory of Unbiased Estimation

Bodhisattva Sen, Columbia University

22nd December, 2025

Abstract: The quantity of interest in the classical Cramer-Rao theory of unbiased estimation (i.e., the Cramer-Rao lower bound, exact efficiency in exponential families, and asymptotic efficiency of maximum likelihood estimation) is the variance, which represents the instability of an estimator when its value is compared to the value for an independently-sampled data set from the same distribution. In this paper we are interested in a quantity which represents the instability of an estimator when its value is compared to the value for an infinitesimal additive perturbation of the original data set; we refer to this as the “sensitivity” of an estimator. The resulting theory of sensitivity is based on the Wasserstein geometry in the same way that the classical theory of variance is based on the Fisher-Rao (equivalently, Hellinger) geometry, and this insight allows us to determine a collection of results which are analogous to the classical case: a Wasserstein-Cramer-Rao lower bound for the sensitivity of any unbiased estimator, a characterization of models in which there exist unbiased estimators achieving the lower bound exactly, and a guarantee that Wasserstein projection estimators achieve the lower bound asymptotically. We use these results to treat many statistical examples, sometimes revealing new optimality properties for existing estimators and other times revealing new estimators. This is joint work with Nicolas Garcia Trillos (U Wisconsin) and Adam Jaffe (Columbia) and is based on the paper: https://arxiv.org/pdf/2511.07414.


A Casual Introduction to Causal Inference

Ambarish Chattopadhyay, ISI, Kolkata

24th and 26th November, 2025

Causal inference is a fundamental area of scientific investigation focused on understanding cause-and-effect relationships from data. Causal questions arise in numerous fields of study, including medical, social, behavioral, political, computer, and environmental sciences, as well as in education, law, engineering, marketing, history, and language studies. For example: Does this new drug cure a particular fatal disease? Does increasing the minimum wage reduce employment? Does exposure to political advertising on social media influence voting behaviour in elections? Does long-term exposure to air pollution increase the risk of COVID-19 deaths? Does this new feature on Netflix attract more subscribers? These are all causal questions, as they all pertain to the effect caused by some intervention (e.g., a new drug) on some outcome (e.g., mortality). Answering these causal questions, however, is not straightforward. In traditional Statistics courses, we learn the mantra: “Correlation is not causation.” This is one of the core methodological challenges of causal inference: causal relationships cannot be established from association alone. But when is an association causation? When is it not causation? What does causation even mean formally? These talks will provide an introduction to the fundamental concepts and methods used to formalize the notion of causality and draw causal inferences from data. We will outline how to formulate causal questions, review key approaches for estimating and testing causal effects, and discuss essential theoretical foundations. Along the way, we will also highlight some relevant research directions in causal inference.


Prabrisha Rakshit, IIM Udaipur

Adaptive Proximal Causal Inference with Some Invalid Proxies

30th October 30, 2025

Abstract: Proximal causal inference (PCI) is a recently proposed framework to identify and estimate the causal effect of an exposure on a given outcome, in the presence of hidden confounders for which proxies are available. Specifically, PCI relies on having observed two valid types of proxies; a treatment confounding proxy related to the outcome only to the extent that it is associated with an unmeasured confounder conditional on the primary treatment and mea- sured covariates, and an outcome confounding proxy related to the treatment only through its association with an unmeasured confounder conditional on measured covariates. Therefore, valid proxies must satisfy stringent exclusion restrictions; mainly, a treatment confounding proxy must not cause the outcome, while an outcome confounding proxy must not be caused by the treatment. In order to improve the prospects for identification and possibly the effi- ciency of the approach, multiple proxies will often be used, raising concerns about bias due to a possible violation of the required exclusion restrictions. To address this concern, we introduce necessary and sufficient conditions for identifying causal effects in the presence of many confounding proxies, some of which may be invalid. Specifically, under a canonical prox- imal linear structural equations model, we propose a LASSO-based median estimator of the causal effect of primary interest, which simultaneously selects valid proxies and estimates the causal effect with corresponding theoretical performance guarantees. Despite its strengths, the LASSO-based approach can under certain conditions lead to inconsistent treatment proxy selection. To overcome this limitation, we introduce an adaptive LASSO-based proximal esti- mator, which incorporates adaptive weights to differentially penalize separate treatment proxy coefficients with respect to the `1 penalty. We formally establish that the adaptive estima- tor is √n-consistent for the causal effect, and when a valid outcome-confounding proxy is available, we construct corresponding asymptotically valid confidence intervals for the causal effect. We also extend the approach to the many outcome-confounding proxies setting, some of which may be invalid. All theoretical results are supported by extensive simulation studies. We apply the proposed methods to assess the impact of right heart catheterization on 30-day survival outcomes for critically ill ICU patients, utilizing data from the SUPPORT study.


Statistical inference for some network summary statistics in a sparse SBM framework using dense and sparse network sampling schemes

Anirban Mandal

1st September, 2025

Abstract: In this talk we present methods for statistical inference for three different network summary statistics: subgraph count, clustering coefficient and degree-degree correlation, for a large population network on $N$ nodes, using data collected through a Bernoulli node sampling scheme. Our theoretical results provide valid asymptotic prediction intervals for these population level summary statistics on the basis of their network sampling based counterparts. We consider a model based framework, where the population network is assumed to be generated from a sparse Stochastic Block Model (SBM) with edge probabilities of the order of $N^{-\beta}$, for some $\beta\geq 0$. Nodes are selected independently from the population network with probability $p_N$, where $p_N\sim N^{-\alpha}$, for some $\alpha\in[0,1]$. For $\alpha = 0$, we obtain a dense sampling scheme, and a sparse sampling scheme is obtained for $\alpha\in (0,1]$. Following the initial selection of nodes we study the effects of both induced and ego-centric network formation.

For any fixed and connected graph $H$, we provide sufficient conditions on the sampling and model sparsity parameters $\alpha$ and $\beta$ and the structure of the graph $H$, to ensure asymptotic normality of the sample based subgraph count for both induced and ego-centric cases. When $(\alpha,\beta)$ are on the boundary of the Gaussian limit law domain, we establish a Poisson limit law for the estimated subgraph count in the induced case. We provide a precise description for the range of $(\alpha,\beta)$ values, where the Gaussian and Poisson limit laws arise. In the ego-centric case, we find that a wider range of sparsity values $(\alpha,\beta)$ allow a Gaussian limit law, but on the boundary values of $(\alpha,\beta)$, we find that the limit law can be either Poisson, or a linear combination of independent Poissons. Similarly we establish limiting Gaussianity results for the sample based clustering coefficient and the degree-degree correlation statistics, for both induced and ego-centric approaches. The results are illustrated with a simulation study and a real data analysis.


Adaptive Wavelet Quantile Density Estimation Based on Biased Samples

Hassan Doosti, Macquarie University, Sydney, Australia.

24th April 2025

Abstract: Quantile density functions offer critical insights in fields ranging from economics to biomedical studies. However, their accurate estimation becomes challenging when data is biased or irregularly sampled. This talk presents a comparative evaluation of threshold selection methods—specifically hard and block thresholding—in adaptive wavelet-based quantile density estimation under such conditions. By integrating theoretical analysis grounded in minimax risk over Besov spaces and extensive simulation studies, we demonstrate the effectiveness of block thresholding in achieving optimal convergence rates, even in the presence of bias. Applications to real-world data further highlight the practical advantages of the proposed methodology. The findings provide actionable guidance for statisticians and data scientists dealing with biased data and looking for robust, adaptive estimation strategies.


Three-dimensional statistical inference using z-stacks of two-dimensional images of oral microbiome samples

Suman Majumder, ISRU, ISI Kolkata

16th April 2025

Abstract: The intra- and inter-taxon relationships present between different bacteria in human oral samples has been of interest to researchers for quite some time. The problem is similar to determining interspecies relationships between different animals in a region with some key challenges and restrictions. Spatial analysis of images of saliva or dental plaque samples allows us to quantify these relationships and the attached uncertainties. Such analyses are few in numbers and primarily utilize two-dimensional methods and infer about these relationships in a two-dimensional plane, followed by some sort of meta-analysis to infer over a collection of “slices” from a sample. In this talk, I will talk about challenges of using spatial analysis in microbiome image data, discuss how we can perform three-dimensional joint analysis of these “slices” using two-dimensional spatial isotropic models, and present more challenges we would need to deal with to make the model statistically prudent.


High-dimensional Inference using Random Projections

Minerva Mukhopadhyay, ISRU, ISI Kolkata

3rd April 2025

Abstract: With increasing availability of data, nowadays, we often encounter high-dimensional data from various fields of study. Almost all standard multivariate approaches fail in effectively analysing such data, both theoretically as well as computationally. Dimension reduction techniques are an essential pre-processing step towards dealing with such data. Among existing methods, the random projection ensemble approach has proven to be a promising approach. In this talk, we plan to discuss the fundamentals of random projections and some of its improvements that have been proposed in the literature. We will also explore implementations of random projections in some specific methods of classification and density estimation.


Spectra of random hypergraphs and contractions of random tensors

Soumendu Sundar Mukherjee, SMU, ISI Kolkata

5th December 2024

Abstract: We will discuss some results on the limit spectra of adjacency and Laplacian matrices of r-uniform Erdős-Rényi hypergraphs on n vertices, in an asymptotic regime where r/n → c ∈ [0,1). Unlike the case of Erdős-Rényi random graphs, the entries of hypergraph adjacency matrices exhibit long range correlations, which make spectral analysis more challenging. Using an ANOVA-style representation, we are able to obtain many results, including a Baik-Ben Arous-Péché phase transition for the largest eigenvalue at r = 3: An appropriately scaled largest eigenvalue converges in probability to 2 if r ∈ {2,3}, and to √(r-2) + 1/√(r-2), if r ≥ 4. Time permitting, we will discuss some related results on contractions of random tensors, which show up in the context of tensor PCA and when studying the so-called p-spin spherical spin glass.

This talk will be based on several joint works with Dipranjan Pal and Himasish Talukdar, both PhD students at ISI Kolkata.


MCMC Importance Sampling via Moreau-Yosida Envelopes

Dootika Vats, Department of Mathematics and Statistics, IIT Kanpur

18th November 2024

Abstract: Markov chain Monte Carlo (MCMC) is the workhorse computational algorithm employed for inference in Bayesian statistics. Gradient-based MCMC algorithms are known to yield faster converging Markov chains. In modern parsimonious models, the use of non-differentiable priors is fairly standard, yielding non-differentiable posteriors. Without differentiability, gradient-based MCMC algorithms cannot be employed effectively. Recently proposed proximal MCMC approaches, however, can partially remedy this limitation. These approaches employ the Moreau-Yosida (MY) envelope to smooth the nondifferentiable prior enabling sampling from an approximation to the target posterior. In this work, we leverage properties of the MY envelope to construct an importance sampling paradigm to correct for this approximation error. We establish asymptotic normality of the importance sampling estimators with an explicit expression for the asymptotic variance which we use to derive a practical metric of sampling efficiency. Numerical studies show that the proposed scheme can yield lower variance estimators compared to existing proximal MCMC alternatives.