Statistics Seminar at Georgia State University

Fall 2010-Spring 2011, Fridays 3:00-4:00pm, Paul Erdos Conference room (796) COE

Organizer: Yichuan Zhao

If you would like to give a talk in Statistics Seminar, please send an email to Yichuan Zhao at

April 22, 3:00-4:00pm, 796 COE, Professor Yuanjia Wang, Department of Biostatistics, Columbia University
Analysis of multilevel genetic and medical studies by penalized splines: marginal versus conditional approach

Abstract: Motivated by a longitudinal genetic study on risk factors of cardiovascular disease and a treatment study to improve clinical outcomes after subarachnoid hemorrhage, we propose flexible models to estimate and test genetic effect or treatment effect by penalized splines (Eliers and Marx 1996; Rupper, Wand and Carroll 2003). Both data examples have hierarchical structure, for example, repeated measures nested within subjects and subjects nested within a family that needs to be accounted for in the modeling. We propose estimation procedures under the conditional and marginal semiparametric regression framework using penalized splines. In addition, we imbed the test of a nonparametric function with multilevel data into testing fixed effects and a variance component in a linear mixed effects model with nuisance variance components. Through a spectral decomposition of the residual sum of squares, we provide a fast algorithm to compute null distribution of the test statistic which improves computational efficiency significantly comparing to bootstrap. We apply the methods to compute the genome-wide critical value and p-value of a genetic association test in a genome-wide association study (GWAS) where the usual chi-square mixture approximation is conservative and bootstrap is computationally prohibitive (up to 10^8 simulations). Lastly, we examine asymptotic properties of the penalized spline estimator with clustered data in the small-knots and large-knots scenarios.

April 21, 2:00-3:00pm, 796 COE, Professor Robert Lund, Department of Mathematical Sciences, Clemson University


April 15, 2:00-3:00pm, 796 COE, Professor Amita Manatunga, Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University
Assessing correspondence between Scales

Abstract: A fundamental and foremost objective in biomedical research is to establish valid measurements of the clinical disease of interest. Accuracy or validity of such disease instruments is commonly established by assessing similarity between measurements made on a subject by multiple time points or by comparing with some gold standard or best available measurement. Although the foundation of the methodology for addressing accuracy of measurements (agreement methodology) has been laid out, most methods are applicable only when both measurements are made on the same scale. In this presentation, we first discuss existing measures of agreement and their applicability in practice. Next, we introduce a new concept, called “broad sense agreement", which extends the classical framework of agreement to evaluate the capability of interpreting a continuous measurement in an ordinal scale. We present a natural measure for broad sense agreement. Nonparametric estimation and inference procedures are developed for the proposed measure. We also consider longitudinal settings which involve agreement assessments at multiple time points. Simulation studies have demonstrated good performance of the proposed method with small sample.

April 8, 2:00-3:00pm, 796 COE, Professor Getachew Dagne, Department of Epidemiology & Biostatistics, University of South Florida
Multivariate, Multilevel Modeling of Behavioral Interaction Data

Abstract: Preventive intervention programs often target patterns of family interaction as a means of effecting change. This presentation discusses new methods for specifying and modeling theoretically meaningful patterns of interaction based on indicators of the strength of contingency among behaviors in a behavioral sequence, with the long-term goal of providing methods that allow us to characterize key aspects of interaction and study how they mediate the effects of intervention on outcome. Our prior work has established the utility of using univariate multilevel modeling methods to characterize contingency strength between any two individual behavior categories, and has extended this work to model patterns of contingency based on all instances of a single transition during an interaction, such as the transition from a wife?s action to a husband?s reaction. In this paper we take up the multivariate extension of this model, which is necessary when modeling interaction patterns involving transitions to and from both actors. This occurs, for example, in reciprocal interactions where the behavior of each partner is hypothesized to affect the behavior of the other, leading to cycles of reciprocated behavior. We begin by formulating two bracketing models. The baseline model includes a single random effect for the total number of behaviors in a sequence, while the full association model includes random effects for the full set of behavioral contingencies across both types of transitions (actor to partner, and partner to actor). We then present a series of theory-based models reflecting different ways of characterizing reciprocal interaction, which lie between the baseline and the full association models, which serve as boundary conditions for testing how well these theory-based models capture important variation in interaction patterns. We demonstrate this strategy, through analyzing a dataset based on observation and microcoding of the sequential interactions of 254 couples experiencing substantial stress occasioned by loss of employment. The results of these analyses suggest that a construct we label reciprocated valence accounts for a substantial proportion of the variance in the complete set of bidirectional contingencies. In addition, results indicate that more complex models that separate negative and positive reciprocity provide only minimal improvements in accounting for this variation. We conclude by discussing how future extensions will allow us to embed this model within a more comprehensive mediation framework.

April 1, 2:00-3:00pm, 796 COE, Professor Shiferaw Gurmu, Department of Economics, Georgia State University
Semiparametric Estimation of Bivariate Count Data Regression Models with Flexible Correlation Structure

Abstract: The paper develops semiparametric estimation methods for bivariate count data regression models. We develop series expansion approach in which dependence between count variables is introduced by means of stochastically related unobserved heterogeneity components, and in which, unlike existing commonly used models, positive as well as negative correlations are allowed. In implementation, we use bivariate expansions based on the generalized Laguerre polynomials. Extensions that accommodate for excess zeros, truncated and censored data and multivariate generalizations are also given. The first application examines the socio-economic and demographic determinants of tobacco use in the context of the joint modeling of the daily number of smoking tobacco and number of chewing tobacco based on household survey data. We also analyze jointly two health utilization measures, number of consultations with a doctor and non-doctor consultations. One of the key contributions is in obtaining a computationally tractable closed form of the model with flexible correlation structure. Monte Carlo experiments and empirical applications confirm that the model performs well relative to existing bivariate models, in terms of various statistical criteria and capturing the range of correlation among dependent variables. This is a joint work with John Elder.

March 25, 2:00-3:00pm, 796 COE, Professor Yichuan Zhao, Department of Mathematics and Statistics, Georgia State University
Empirical likelihood inference for the Cox model with time-dependent coefficients via local partial likelihood

Abstract: The Cox model with time-dependent coefficients has been studied by a number of authors recently. In this talk, we develop empirical likelihood (EL) point-wise confidence regions for the time-dependent regression coefficients via local partial likelihood smoothing. The EL simultaneous confidence bands for a linear combination of the coefficients are also derived based on the strong approximation methods. The EL ratio is formulated through the local partial log-likelihood for the regression coefficient functions. Our numerical studies indicate that the EL point-wise/simultaneous confidence regions/bands have satisfactory finite sample performances. Compared with the confidence regions derived directly based on the asymptotic normal distribution of the local constant estimator, the EL confidence regions are overall tighter and can better capture the curvature of the underlying regression coefficient functions. Two data sets, the gastric cancer data and the Mayo Clinic primary biliary cirrhosis data, are analysed using the proposed method. This is based on joint work with Yanqing Sun and Rajeshwari Sundaram.

March 18, 3:00-4:00pm, 796 COE, Professor Steve Qin, Department of Biostatistics and Bioinformatics, Emory University
Using model-based methods to quantify exon-level gene expression in RNA-seq

Abstract: RNA sequencing (RNA-seq) is a powerful new technology for mapping and quantifying transcriptomes using ultra high-throughput next generation sequencing technologies. Using deep sequencing, gene expression levels of all transcripts including novel ones can be quantified digitally. Although extremely promising, the massive amounts of data generated by RNA-seq, substantial biases, and uncertainty in short read alignment pose daunting challenges for data analysis. In particular, large base-specific variations and between-base correlations make simple approaches, such as those that use averaging to normalize RNA-seq data and quantify gene expressions, ineffective. In this study, we propose a model-based method to characterize base-level read coverage within each exon. The underlying expression level is included as a key parameter in this model. Since our method is capable of capturing local genomic features that affect read coverage profile throughout the exon, we are able to obtain improved quantification of the true underlying expression levels.

March 11, 2:00-3:00pm, 796 COE, Professor Hao Wu, Department of Biostatistics and Bioinformatics, Emory University
Genomic "bump finding"

Abstract: Exploring genomic landscapes of different biological endpoints is an important approach for understanding biological processes and disease etiologies. Examples of these endpoints are sequence composition, DNA methylation, histone modifications, and binding sites for different transcription factors. With the completion of human genome project and advances of high-throughput technologies, tightly spaced measurements have been collected from linear chromosomes to create unbiased maps at the whole-genome scale. Detecting regions of interests from these data can be categorized as a general ³bump finding² problem, where a bump is defined as a genomic location for which data behaves differently from the majority of the genome. In this talk I will present several examples with the general theme of bump finding. In the first example we propose using Hidden Markov Models to search for CpG islands (CGI) from DNA sequence. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores, which provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for many species. In the second example we construct a hierarchical model to detect transcription factor binding sites (TFBS) by jointly analyzing multiple related ChIP-chip datasets. This model captures the locational correlation among datasets, which provides basis for sharing information across experiments. Simulation and real data tests illustrate the advantage of the joint model over strategies that analyzes the individual dataset separately.

Feburary 7, 10:00-11:00am, 796 COE, Dr. Xin Qi, Department of Epidemiology and Public Health, Yale University
Sparse principal component analysis by choice of norm


January 28, 3:00-4:00pm, 796 COE, Professor Eric Ulm, Department of Risk Management and Insurance, Georgia State University
Optimal Consumption and Allocation in Variable Annuities with Guaranteed Minimum Death Benefits

Abstract: We determine the optimal allocation of funds between the fixed and variable subaccounts in a variable annuity with a GMDB (Guaranteed Minimum Death Benefit) clause featuring partial withdrawals by using a utility-based approach. The Merton method is applied in this paper by assuming that individuals allocate funds in order to maximize the expected utility of lifetime consumption, and include the effect on asset allocation from both savings (accumulation) and dissavings (consumption). We also reflect bequest motives by including the utility of the recipient of the policyholders guaranteed death benefits. We derive the optimal transfer choice by the insured, and furthermore price the GMDB through maximizing the discounted expected utility of the policyholders and beneficiaries by investing dynamically in the fixed account and variable fund and withdrawing optimally.

January 21, 3:00-4:00pm, 796 COE, Jason Ding, Department of Computer Science, Georgia State University
Imbalanced Data Learning and Diversified Ensemble Classifiers

Abstract: Imbalanced data learning is one of the most important problems in machine learning and data mining area, attracting continuous attentions in both academia and industry in last decade. In this talk, I will introduce the binary version of this imbalance data learning problem and present an effective ensemble learning framework. First, a formal definition of imbalanced binary classification problem is introduced and several real-world examples will be provide to show its significance. Then, we will thoroughly investigate the current research trends in handling imbalance learning problem to provide a comprehensive overview of representative studies in this area. After discussing the advantages and weakness of existing learning methods, we proposed a new effective ensemble framework?Diversified Ensemble Classifiers for Imbalanced Data Learning (DECIDL). Our strategy combines three popular learning techniques together: a) ensemble learning, b) artificial example generation; c) diversity construction by oppositional data re-labeling. As a meta-learner, DECIDL can utilize general supervised learning algorithms, such as support vector machines, decision trees, neural networks, as the base learner to build effective ensemble committees. We compare the DECIDL ensemble framework with several existing ensemble imbalanced learning frameworks, namely under-bagging, over-bagging, SMOTE-bagging, AdaBoost, on our newly developed benchmark data pool consisting 30 highly skewed data sets. Extensive experiments with various base learners suggest that our DECIDL framework is comparable with other ensemble methods.

December 2, 2:00-3:00pm, 796 COE, Professor Hulin Wu, Department of Biostatistics and Computational Biology, University of Rochester School of Medicine and Dentistry
Two Cultures of Statistical Research: Statistical Inference for Empirical Models vs. Mechanism Models

Abstract: Traditional statistical inference is usually based on the assumption of empirical models for the data such as linear, nonlinear, nonparametric, and semiparametric models for continuous data, generalized models for binary or discrete data, and proportional hazard regression models for survival data. Another class of statistical inference is purely based on algorithmic models such as neural nets and decision trees to solve the black box problem in the real world. Statistical research in this main stream culture is trying to perform inference by minimizing the use of knowledge about the mechanism behind the data. However, the knowledge of the research system and the data-generation mechanism, which can be described by mathematical models, in particular dynamic models such as differential equations, are usually known or partially known in the real world. Statistical inference and research for the mechanism-based models are very sparse, but are badly needed. Thus, a new culture in statistical research for mechanism models needs to be established. I?ll illustrate and outline the statistical research and its importance for mechanism-based differential equation models by our group and others. Statistical methods and theories for differential equation models are illustrated via the experimental data from infectious diseases such as HIV and influenza research.

November 19, 2:00-3:00pm, 796 COE, Professor Mengling Liu, Division of Biostatistics, School of Medicine, New York University
Cox Regression Model with Time-Varying Coefficients in Nested Case-Control Studies

Abstract: The nested case-control (NCC) design is a cost-effective sampling method to study the relationship between a disease and its risk factors in epidemiologic studies. NCC data are commonly analyzed using Thomas partial likelihood approach under Cox's proportional hazards model with constant covariate effects. In this talk, I will present an extension, the Cox regression with time-varying coefficients, in NCC studies and an estimation approach based on a kernel-weighted Thomas partial likelihood. Both simulation studies and an application to the NCC study of breast cancer in the New York University Women's Health Study are used to illustrate the usefulness of the proposed methods. Furthermore, I will discuss another extension, the Cox regression with nonlinear covariate effects, and issues regarding different techniques to handle these two different models in NCC studies.

November 12, 3:00-4:00pm, 796 COE, Professor Yi Zhao, Department of Marketing, Georgia State University
Consumer Learning in a Turbulent Market Environment: Modeling Consumer Choice Dynamics in the Wake of a Product-Harm Crisis

Abstract: This paper empirically studies consumer choice behavior in the wake of a product-harm crisis. A product-harm crisis creates consumer uncertainty about product quality. In this paper, the authors develop a model that explicitly incorporates the impact of such uncertainty on consumer behavior. The authors assume that consumers are uncertain about the mean product quality level and learn about product quality through the signals contained in use experience and the product harm crisis. The authors also assume that consumers are uncertain about the precision of the signals in conveying product quality and update their perception of the precision of such signals over time upon their arrival. To study the possible impact of a product-harm crisis on consumer?s sensitivities to price, quality, and risk, the authors also allow these model parameters to be different before, during, and after the product-harm crisis. The model is estimated by Bayesian methods for a scanner panel dataset that includes consumer purchase history before, during, and after a product-harm crisis that hit the peanut butter division of Kraft Foods Australia in June 1996. The proposed model fits the data better than the standard consumer learning model in marketing that assumes consumers are uncertain about product quality level but the precision of information in conveying product quality is known to consumers. This study also provides substantive insights on consumers? behavioral choice responses to a product-harm crisis. Finally, the authors conduct counterfactual experiments based on the estimation results and provide insights to managers on crisis management.

November 5, 2:00-3:00pm, 796 COE, Professor Wei Wu, Department of Statistics, Florida State University
Towards Summary Statistics in the Function Space of Neural Spike Trains

Abstract: Statistical inferences are essentially important in analyzing neural spike trains in computational neuroscience. Current approaches have followed a general inference paradigm where a parametric probability model is often used to characterize the temporal evolution of the underlying stochastic processes. To capture the overall variability and distribution in the space of the spike trains directly, we focus on a data-driven approach where statistics are defined and computed in the function space in which individual spike trains are viewed as points. To this end, we at first develop a parametrized family of metrics that takes into account different warpings in the time domain and generalizes several currently used spike train distances. These new metrics are essentially penalized L^p norms, involving appropriate functions of spike trains, with penalties associated with time-warping. In particular, when p = 2, we present an efficient recursive algorithm, termed Matching-Minimization algorithm, to compute the sample mean of a set of spike trains with arbitrary numbers of spikes. The proposed metrics as well as the mean spike trains ideas are demonstrated using simulations as well as an experimental recording from the motor cortex. It is found that all these methods achieve desirable performance and the results support the success of this novel framework.

October 29, 3:00-4:00pm, 796 COE, Professor Jing Wang, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago
On Determination of Linear Components in Additive Models

Abstract: Additive models have been widely used in nonparametric regression, mainly due to their ability to avoid the problem of the "curse of dimensionality". When some of the additive components are linear, the model can be further simplified and higher convergence rates can be achieved for the estimation of these linear components. In this paper, we propose a testing procedure for the determination of linear components in nonparametric additive models. We adopt the penalized spline approach for modelling the nonparametric functions, and the test is a sort of Chi-square test based on finite order penalized spline estimators. The limiting behavior of the test statistic is investigated. To obtain the critical values for finite sample problems, we use resampling techniques to establish a bootstrap test. The performance of the proposed tests is studied through simulation experiments and a real-data example.

October 22, 3:00-4:00pm, 796 COE, Professor Rusty Tchernis, Department of Economics, Georgia State University
On the Estimation of Selection Models when Participation is Endogenous and Misclassified

Abstract: This paper presents a Bayesian analysis of the endogenous treatment model with misclassified treatment participation. Our estimation procedure utilizes a combination of data augmentation, Gibbs sampling, and Metropolis-Hastings to obtain estimates of the misclassification probabilities and the treatment effect. Simulations demonstrate that the proposed Bayesian estimator accurately estimates the treatment effect in light of misclassification and endogeneity.

October 15, 3:00-4:00pm, 796 COE, Professor Nelson Chen, Department of Biostatistics and Bioinformatics & Biostatistics Shared Core of Winship Cancer Institute, Emory University
A Novel Toxicity Scoring System Treating Toxicity Response as a Quasi-Continuous Variable in Phase I Clinical Trials

Abstract: In most current Phase I designs including Standard 3+3 design, Continuous Reassessment Method (CRM), and Escalation With Overdose Control (EWOC), toxicity response of patient is treated coarsely as a binary indicator (Yes vs No) of dose limiting toxicity (DLT) although patient usually has multiple toxicities and a lot of useful toxicity information is discarded. For the first time in the literature, we establish a novel toxicity scoring system to treat toxicity response as a quasi-continuous variable and utilize all toxicities of patients. Our toxicity scoring system consists of generally accepted and objective components (a logistic function, grade and type of toxicity, and whether the toxicity is DLT) so that it is relatively objective. Our system can successfully transform current Phase I designs treating toxicity response as a binary indictor of DLT to new designs treating toxicity response as a quasi-continuous variable by replacing the binary indicator of DLT and the Target Toxicity level (TTL) of current designs with a Normalized Equivalent Toxicity Score (NETS) and a Target NETS (TNETS), respectively. The transformed designs will improve the accuracy of Maximum Tolerated Dose (MTD) and efficiency of trial. As an example, we couple our system with EWOC to develop a new design called Escalation With Overdose Control using Normalized Equivalent Toxicity Score (EWOC-NETS). Simulation studies and its application to real trial data demonstrate that EWOC-NETS can treat toxicity response as a quasi-continuous variable, fully utilize all toxicity information, and improve the accuracy of MTD and efficiency of Phase I trial. A user-friendly software of EWOC-NETS is under development and will be available in the future.

October 8, 2:00-3:00pm, 796 COE, Professor Wenbin Lu, Department of Statistics, North Carolina State University
Variable Selection for Linear Transformation Models

Abstract: Semi-parametric linear transformation models have received much attention due to their high flexibility in modeling survival data. However, the problem of variable selection for linear transformation models has been less studied, partially because a convenient loss function is not readily available under this context. In this talk, we propose a simple yet powerful approach to achieve both sparse and consistent estimation for linear transformation models. The main idea is to derive a profiled score from the martingale-based estimating equations of Chen et al. (2001), construct a loss function based on the profile scored and its variance, and then minimize the loss subject to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and can achieve a higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently. Performance of the new procedure is illustrated through numerous simulations and real data applications.

September 24, 1:30-3:00pm, 796 COE, Professor Jun Han, Department of Mathematics and Statistics, Georgia State University
Distribution-free Estimators of Variance Components for Multivariate Linear Mixed Model

Abstract: Non-iterative, distribution-free, unbiased estimators of variance components including minimum norm quadratic unbiased estimator and method of moment estimator are derived for multivariate mixed model. A general cluster-wise covariance and a same-member-only response-wise covariance are assumed. Some properties of the proposed estimators such as unbiasedness and existence are discussed, and related computational issues are addressed. A simulation study is conducted to compare the proposed estimators with Gaussian (restricted) maximum likelihood estimator in terms of bias and mean square error. An application of gene expression family data is presented to illustrate the proposed estimators.

September 17, 2:00-3:00pm, 654 COE, Professor Ying Guo, Department of Biostatistics and Bioinformatics, Emory University
A general probabilistic model for group independent component analysis and its estimation methods

Abstract: Independent component analysis (ICA) has become an important tool for analyzing data from functional magnetic resonance imaging (fMRI) studies. ICA has been successfully applied to single-subject fMRI data. The extension of ICA to group inferences in neuroimaging studies, however, is challenging due to the unavailability of a pre-specified group design matrix and the uncertainty in between-subjects variability in fMRI data. We present a general probabilistic ICA (PICA) model that can accommodate varying group structures of multi-subject spatio-temporal processes. An advantage of the proposed model is that it can flexibly model various types of group structures in different underlying neural source signals and under different experimental conditions in fMRI studies. A maximum likelihood method is used for estimating this general group ICA model. We propose two EM algorithms to obtain the ML estimates. The first method is an exact EM algorithm which provides an exact E-step and an explicit noniterative M-step. The second method is an variational approximation EM algorithm which is computationally more efficient than the exact EM. We conduct simulation studies to evaluate the performance of the proposed methods. An fMRI data example is used to illustrate application of the proposed methods.

September 3, 2:00-3:00pm, 796 COE, Professor Ruiyan Luo, Department of Mathematics and Statistics, Georgia State University
Bayesian Hierarchical Models in Proteomics Studies

Abstract: Data produced from complex biological processes are not subject to simple statistical methods. Bayesian approaches provide a natural framework to untangle such problems through incorporation of our understanding of biological processes and the data generation process. In this talk, I will describe the development of Bayesian hierarchical models in addressing the following two Proteomics problems. iTRAQ data. iTRAQ (isobaric Tags for Relative and Absolute Quantitation) is a technique that allows simultaneous quantitation of proteins in multiple samples. However, ignoring the common nonrandom missingness will lead to biased estimation of protein expression levels. To reduce such bias, we construct a Bayesian hierarchical model-based method and model the nonrandom missingness of peptide data with a logistic regression, which relates the missingness probability for a peptide with the expression level of the protein that produces this peptide. We assumes that the measured peptide intensities are affected by both protein expression levels and peptide specific effects. The values of these two effects across experiments are modeled as random effects. Simulation results suggest that such estimates have smaller bias than those estimated from ANOVA models or fold changes. Pathway inference. Simultaneous measurements of multiple protein activities at the single cell level provide much richer information on signaling networks. With the measurements of protein activities at different experimental conditions, we propose a Bayesian hierarchical modeling framework for signaling network reconstruction. We model the existence of an association between two proteins both at the overall level across all experiments and at each individual experimental level, from which we infer the pairs of proteins that are associated and their causal relations. This approach can effectively pool information from different interventional experiments. Simulation results demonstrate the superiority of the hierarchical approach.