Hide abstracts

Talks

Chun-houh Chen, ISSAS

Exploratory Data Analysis of Interval-valued Symbolic Data with Matrix Visualization

Symbolic data analysis (SDA) has gained popularity over the past few years because of its potential for handling data having a dependent and hierarchical nature. Amongst many methods for analyzing symbolic data, exploratory data analysis (EDA: Tukey, (1977)) with graphical presentation is an important one. Recent developments of graphical and visualization tools for SDA include zoom star, closed shapes, and parallel-coordinate-plots. Other studies project high dimensional symbolic data into lower dimensional spaces using symbolic data versions of principal component analysis, multidimensional scaling, and self-organizing maps. Most graphical and visualization approaches for exploring symbolic data structure inherit the advantages of their counterparts for conventional (non-symbolic) data, but also their disadvantages. Here we introduce matrix visualization (MV) for visualizing and clustering symbolic data using interval-valued symbolic data as an example; it is by far the most popular symbolic data type in the literature and the most commonly encountered one in practice. Many MV techniques for visualizing and clustering conventional data are converted to symbolic data, and several techniques are newly developed for symbolic data. Various examples of data with simple to complex structures are brought in to illustrate the proposed methods.

Kei Kobayashi, ISM

Hypothesis Testing for the Difference of Dendrograms

In this talk, we propose a novel type of permutation tests for dendrogram data with respect to two metrics for measuring difference between dendrograms. First the Frobenius norm is used and consistency and efficiency of the permutation tests are proved. Next the geodesic distance on a dendrogram space is used. We use the uniqueness of geodesics on a dendrogram space. The proposed permutation tests are applied to data analysis of mental lexicons of English words. The difference of mental lexicons between native and non-native English speakers is tested for each word class.

This work is collaboration with Mitsuru Orita of Kumamoto University.

Ayanendranath Basu, ISI

Some Recent Advances in Density-Based Minimum Distance Inference

Density-based minimum distance methods have a natural resistance against model mis-specifications and outliers, and are popular tools in parametric inference. The density power divergence proposal (Basu et al. 1998, Jones et al. 2001) presented a class of useful robust alternatives to the minimum disparity estimation approach. A comprehensive description of this method is provided in Basu et al. (2011). In the present talk we will describe some recent developments in the area of minimum divergence estimation based on spirit of density power downweighting, discuss how they extend the scope of inference beyond the density power divergence, and in the process demonstrate the limitation of the influence function as a measure of local robustness.

Hsien-Kuei Hwang, ISSAS

Riccati differential equations in applied probability

Riccati equations of the form \[ y'(z) = a(z) y^2(z) + b(z) y(z) + g(z) \] were sporadically encountered in the applied probability literature.

In this talk, I will give a brief review and then propose a general approach to the asymptotics of their coefficients, which will then be helpful in establishing the limiting properties of the random variables in question. New applications to variants of seating arrangement problems will also be indicated.

Debleena Thacker, ISI

Pólya Urn Schemes with Infinitely Many Colors

In this talk we introduce a new type of urn model with infinite but countably many colors indexed by an appropriate infinite set. We mainly consider the indexing set of colors to be the \(d\)-dimensional integer lattice and consider balanced replacement schemes associated with bounded increment random walks on it. We prove central and local limit theorems for the expected configuration of the urn and show that irrespective of the null recurrent or transient behavior of the underlying random walks, the configurations have asymptotic Gaussian distribution after appropriate centering and scaling. We show that the order of any non-zero centering is always \({\mathcal O}\left(\log n\right)\) and the scaling is \({\mathcal O}\left(\sqrt{\log n}\right)\). The rate of convergence for the central limit theorem at time \(n\) will be shown to be of the order \({\mathcal O}\left(\frac{1}{\sqrt{\log n}}\right)\) and bounds similar to the classical Berry-Essen bound will be derived. Further we show that for the expected configuration a large deviation principle (LDP) holds with a good rate function and speed \(\log n\).

Joint Work with Antar Bandyopadhyay.

Siva Athreya, ISI

Dense graph limits under respondent-driven sampling

We consider certain respondent-driven sampling procedures on dense graphs. We show that if the sequence of the vertex-sets is ergodic then the limiting graph can be expressed in terms of the original dense graph via a transformation related to the invariant measure of the ergodic sequence. For specific sampling procedures we describe the transformation explicitly.

Joint work with Adrian Röllin.

Koji Tsukuda, ISM

On \(L^2\) Space Approach to Detect a Parameter Change in an Ergodic Diffusion Process Model

In this presentation, testing a change of drift parameters in an ergodic diffusion process model is discussed. For this problem, past studies chose \( \ell_\infty \) space as the framework of weak convergences of proposed \(sup\) type test statistics, that is, Kolmogorov-Smirnov type statistics. On the other hand, we shall develop an approach by limit theorems in an \(L^2\) space and propose a weighted integral type test statistic, that is, Anderson-Darling type statistics, which is expected to have better power in many cases.

This work is collaboration with Prof. Y. Nishiyama (Institute of Statistical Mathematics).

Arvind Ayyer, IISc Bangalore

Connections between Exclusion Processes and Multiclass Queues

Motivated by problems in nonequilibrium statistical physics, we consider a totally asymmetric multispecies exclusion process on a finite one-dimensional lattice with periodic boundary conditions. Physicists in the 90s had found a way to obtain the stationary distribution by a technique known as the "matrix ansatz". In 2006, P. Ferrari and J. Martin explicitly constructed the stationary distribution by using ideas from queueing theory. I will review both approaches to the proof and describe a generalization to the partially asymmetric version.

This is joint work with C. Arita, K. Mallick and S. Prolhac.

Frederick K. H. Phoa, ISSAS

Construction of 2-level and 3-level Definitive Screening Designs

Definitive screening (DS) designs draw numerous attentions from the researches of designs of experiments due to its good design properties and run-size econ- omy. This paper investigates in the structure of both 2-level and 3-level DS designs and suggests theoretically-driven approaches to construct these DS designs for any number of run size. These construction is generally applicable for any number of factors. The constructed 3-level DS designs are T-optimal and many of them are D-optimal as well, and the rest have high D-efficiencies. Similar situation holds in 2-level DS designs when D-, A- and T -optimalities are considered. The part for 3-level DS design is a joint work with Professor Dennis Lin of Pennsylvania State University. The part for 2-level DS design is a joint work with Professor William Li of University of Minnesota.

Satoshi Kuriki , ISM

Optimal experimental designs for Fourier and polynomial regressions that minimize volume of tube

Simultaneous confidence bands of a nonlinear regression are constructed by evaluating the volume of a tube about a curve or manifold defined as a trajectory of regression basis vector (Naiman, 1986). In this talk, we consider optimal experimental designs that minimize the volume of tube, that is, that attain the narrowest confidence band. In the cases of Fourier and polynomial regressions, the problems are formalized as a minimization problem over the cone of Hankel positive definite matrices, where the objective function to minimize is the volume of tube expressed as elliptic functions. We show that there exists a group that remains our problem invariant, and demonstrate that the minimization can be achieved by choosing a cross-section of orbits.

This is a joint work with Henry Wynn of the London School of Economics, UK.

Siuli Mukhopadhyay, IIT Bombay

Generalized Multinomial Models

In this talk a family of link functions for the multinomial response model is proposed. The link family includes the multicategorical logistic link as one of its members. Conditions for the local orthogonality of the link and the regression parameters are given. It is shown that local orthogonality of the parameters in a neighbourhood makes the link family location and scale invariant. Simulation studies and a numerical example based on a combination drug study are used to illustrate the proposed parametric link family.

Chen-Hung Kao, ISSAS

Mapping quantitative trait loci under selective genotyping

The selective genotyping approach has been known as a cost-effective strategy to reduce genotyping work and still have the ability to maintain efficiency in detecting quantitative trait loci (QTL). This approach is to select individuals with extreme (high and low) phenotypic values for genotyping and keep the remaining individuals ungenotyped in the entire sample. In this talk, the current and our proposed statistical methods for mapping QTL using the data from the selective genotyping experiment are presented and discussed. The issues in determining critical thresholds for claiming QTL detection under selective genotyping are also discussed. Simulated examples are used for illustration.

Xiaoling Dou, ISM

Functional Clustering of Mouse Ultrasonic Vocalization Data

Mouse ultrasonic vocalizations (USVs) are studied in various fields of science. However, background noise and varied USV patterns in observed signals make complete automatic analysis difficult. We propose a series of methods to cluster nonharmonic mouse USV data automatically. The procedure includes noise reduction, detecting USV calls, transforming USV calls as functions and functional clustering. The proposed methods are shown useful with two data sets taken from laboratory mice.

Saurabh Ghosh, ISI

Integrating Multiple Phenotypes For Association Mapping

Most clinical end-point traits are governed by a set of quantitative and qualitative precursors and a single precursor is unlikely to explain the variation in the end- point trait completely. Thus, it may be a prudent strategy to analyze a multivariate phenotype vector possibly comprising both quantitative as well as qualitative precursors for association mapping of a clinical end-point trait. The major statistical challenge in the analyses of multivariate phenotypes lies in the modelling of the vector of phenotypes, particularly in the presence of both quantitative and binary traits in the multivariate phenotype vector.

For population-based data, we propose a novel Binomial regression approach that models the likelihood of the number of minor alleles at a SNP conditional on the vector of multivariate phenotype using a logistic link function. For family-based data comprising informative trios, we propose a logistic regression method that models the transmission probability of a marker allele from a heterozygous parent conditioned on the multivariate phenotype vector and the allele transmitted by the other parent. In both the approaches, the test for association is based on all the regression coefficients. We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.

Jing-Shiang Hwang, ISSAS

A stepwise regression algorithm for high-dimensional variable selection

We propose a new stepwise regression algorithm with a simple stopping rule for the identification of influential predictors and interactions among a huge number of variables in various statistical models. Like conventional stepwise regression, at each forward selection step, a variable is included into the current model if the test statistic of the enlarged model with the predictor against current model has the minimum p-value among all the candidates and is smaller than a predetermined threshold. Instead of using conventional information types of criteria, the threshold is determined by a lower percentile of the beta distribution. We conducted extensive simulation studies to evaluate the performance of the proposed algorithm for various statistical models and found it very competitive and robust compared to several popular high-dimensional variable selection methods.

Arindam Chatterjee, ISI

Inference using Adaptive Lasso based residuals

We study a linear model with a large number of covariates. It is shown that under suitable sparsity assumptions, the residuals based on the Adaptive Lasso estimator can provide asymptotically valid inference procedures for the underlying unknown error distribution function.

This is in contrast to existing procedures based on the least squares estimator, which is known to fail when the number of covariates is large compared to the sample size.

(Joint work with S. Gupta and S. N. Lahiri.)

Shota Katayama, ISM

Lasso Penalized Model Selection Criteria for High-Dimensional Multivariate Linear Regression Analysis

Model selection criteria for multivariate linear regression analysis that identify relevant predictors play an important role in biometrics, marketing research, engineering, econometrics and many other related research fields. Recently, high-dimensional data where the sample size is comparable with the dimension of multiple responses or larger than it often appear in these applications and classical model selection criteria are not applicable to such data. In this talk, we provide two model selection criteria that allow the high-dimensionality using Lasso penalized likelihood function. The consistency property is also shown under the framework that the dimension of multiple responses goes to infinity while the maximum size of candidate models has smaller order of the sample size.

Hironori Fujisawa, ISM

Affine Invariant Divergence With Empirical Estimability And Its Applications

In statistical inference, divergences play an important role. An estimator of parameter can be obtained as the minimizer of divergence. In this talk, we focus on an invariant divergence under affine transformation of data, and then we obtain an explicit class of divergences with empirical estimability. It is proved that this class is uniquely determined under some conditions, including affine invariance and empirical estimability. A definition of cross entropy is extended to deal with a broader class of divergence. We also investigate the relation to the Bregman divergence.

This is a joint work with Takafumi Kanamori of Nagoya University.

Hsin-Cheng Huang, ISSAS

Regularized Principal Component Analysis for Spatial Data

We consider nonstationary spatial modeling using empirical orthogonal functions (EOFs) based on data observed at p spatial locations with n repeated measurements. Traditionally, EOFs are obtained using principal-component-analysis related approaches. However, when data are noisy or n is small, the leading eigenfunctions produced from these methods may lack of any spatial structure and have poor physical interpretation. To obtain more precise estimates of eigenfunctions and the spatial covariance function, we propose a regularization approach incorporating smoothness and sparseness of eigenfunctions, which can be applied even when data are observed at irregularly spaced locations. The resulting optimization problem is solved using the alternating direction method of multipliers. Some numerical examples are provided to demonstrate the effectiveness of the proposed method.

Shinsuke Koyama, ISM

Information Gain on Variable Neuronal Firing

The question of how much information can be theoretically gained from variable neuronal firing rate with respect to constant mean firing rate is investigated. For this purpose, we employ the Kullback-Leibler divergence as a measure of information gain. We first give a statistical interpretation of this information in terms of detectability of rate variation: the lower bound of detectable rate variation, below which the temporal variation of firing rate is undetectable with a Bayesian decoder, is entirely determined by this information.

We show that the information depends not only of the variation of firing rates (i.e., signals), but also significantly on the dispersion properties of neuronal firing described by the shape of interspike interval (ISI) distribution (i.e., noise properties). It is shown that under certain condition, the gamma distribution attains the theoretical lower bound of the information among all ISI distributions when the coefficient of variation of ISIs is given.

With the basis of the theoretical investigations, we propose a practical method for estimating the information from spike trains, and apply this method to biological spike data recorded from a cortical area.

Chen-Hsiang Yeang, ISSAS

Development of nonstandard personalized medicine strategies for cancers with heterogeneous subclones

Cancers are heterogeneous and genetically unstable. Current practice of personalized medicine tailors therapy to heterogeneity between cancers of the same organ type. However, it does not yet systematically address heterogeneity at the single-cell level within a single individual’s cancer or the dynamic nature of cancer due to genetic and epigenetic change as well as transient functional changes. We have developed a mathematical model of personalized cancer therapy incorporating genetic evolutionary dynamics and single-cell heterogeneity, and have examined simulated clinical outcomes. Analyses of an illustrative case and a virtual clinical trial of over 3 million evaluable “patients” demonstrate that augmented (and sometimes counterintuitive) nonstandard personalized medicine strategies may lead to superior patient outcomes compared with the current personalized medicine approach. Current personalized medicine matches therapy to a tumor molecular profile at diagnosis and at tumor relapse or progression, generally focusing on the average, static, and current properties of the sample. Nonstandard strategies also consider minor subclones, dynamics, and predicted future tumor states. Our methods allow systematic study and evaluation of nonstandard personalized medicine strategies. These findings may, in turn, suggest global adjustments and enhancements to translational oncology research paradigms.

Masaya Saito, ISM

Estimation of outer-regional effect on 2009/2010 epidemic in Japan

Influenza epidemic in 2009/2010 season in Japan was dominated by a single strain, the 2009pdm strain. According to the sentinel observation data of Japan, a single epidemic wave made the global trend, but small multiple waves superposed to the global trend are identified. In addition, synchronized abrupt changes in cases are also observed in several prefectures. In this talk, an influence on the epidemic in each prefecture from the outside areas is evaluated by comparing the data with solutions of the SIR model with a stochastic term.

This is joint work with Seiya Imoto, Rui Yamaguch, and Satoru Miyano from Institute of Medical Science, University of Tokyo, and Tomoyuki Higuchi from Institute of Statistical Mathematics, Japan.