Statistics Seminar at Georgia State University

Fall 2009-Spring 2010, Fridays 3:00-4:00pm, Paul Erdos Conference room (796) COE

Organizer: Yichuan Zhao

If you would like to give a talk in Statistics Seminar, please send an email to Yichuan Zhao at

July 2, 2:00-3:00pm, 796 COE, Dr. Carolina B. Baguio , MSU-Iligan Institute of Technology, Iligan City, Philippines

Abstract: Time Series modeling frequently involves economic data. These data tend to have abrupt changes in its trend. These changes may occur due to economic recession, epidemic outbreak, people power revolution, or other political, social, economic, or natural events. An important tool for evaluating these changes is a new method known as Spline Regression Analysis. It is a method in regression analysis in which the independent variable is partitioned into intervals and separate regression lines are fit within intervals (segments) that join at the knots. The knots (breakpoints) can be interpreted as critical, safe, or threshold values beyond or below which undesired effects occur. The breakpoint can be important in decision making. This study investigates the efficiency of Spline Regression as a new method in time series modeling, when there is an abrupt change in the trend of the data. It is very useful to researchers especially to those who are dealing with economic studies in their analysis and forecasting. The study shows that the RMSE and AIC values of the Linear Spline Regression methods for the original data of the Peso Per U.S. Dollar Rate decreases as the number of knots increases. It also shows that the Linear Spline models with more knots have better fit to the data compared to the Linear Spline models with few knots. But it is evident that as more knots added, the model becomes complicated due to additional terms. On the other hand, the RMSE and AIC values of the Quadratic Spline Regression for original data decreases until three knots, and increases after it. This means that the Quadratic Spline model with three knots has the best fit to the data. Spline Regression methods applied for smoothed data have smaller RMSE and AIC values as compared to the original data.The asymptotic efficiency of the Spline Regression models was also investigated using the Block Bootstrapping method. From the findings of the study, it is recommended that the investigation of Spline Regression modeling for time series data with seasonal variations be done. Moreover, the efficiency of Polynomial Spline Regression of higher order is also worth studying for time series data exhibiting trend. A Monte Carlo simulation of the Peso Per U.S. Dollar Rate is also recommended for the analysis of asymptotic behavior on the efficiency of the estimates.

April 23, 2:00-3:00pm, 796 COE, Professor Hua Liang, Department of Biostatistics and Computational Biology, University of Rochester

Abstract: We explore variable selection for partially linear models when the covariates are measured with additive errors. We propose two classes of variable selection procedures, penalized least squares and penalized quantile regression, using the nonconvex penalized principle. The first procedure corrects the bias in the loss function caused by the measurement error by applying the so-called correction-for-attenuation approach, whereas the second procedure corrects the bias by using orthogonal residuals. The sampling properties for the two procedures are investigated. The rate of convergence and the asymptotic normality of the resulting estimates are established. We further demonstrate that, with proper choices of the penalty functions and the regularization parameter, the resulting estimates perform asymptotically as well as an oracle procedure. Choice of smoothing parameters is also discussed. Finite sample performance of the proposed variable selection procedures is assessed by Monte Carlo simulation studies. We further illustrate the proposed procedures by an application.

April 16, 2:00-3:00pm, 796 COE, Professor Edsel Pena, Department of Statistics, University of South Carolina
Power-Enhanced Multiple Decision Functions Controlling Family-Wise Error and False Discovery Rates

Abstract: Improved procedures, in terms of smaller missed discovery rates (MDR), for performing multiple hypotheses testing with weak and strong control of the family-wise error rate (FWER) or the false discovery rate (FDR) will be presented in this talk. Improvement over existing procedures, such as the Sidak procedure for FWER control and the Benjamini-Hochberg (BH) procedure for FDR control, is achieved by exploiting differences in powers of individual tests. Results signal the need to take into account the powers of the individual tests and to have multiple hypotheses decision functions which are not limited to simply using the individual p-values, as is the case for example with the Sidak, Bonferroni, or BH procedures. A decision-theoretic frame- work is utilized, and through auxiliary randomizers the procedures could be used with discrete or mixed-type data or with rank-based nonparametric tests. This is in contrast to existing p-value based pro- cedures whose theoretical validity is contingent on the uniformity of the p-value statistic under the null hypothesis. Proposed procedures are relevant in the analysis of high-dimensional "large M, small n" data sets arising in the natural, physical, medical, economic, and social sciences, whose generation and creation is accelerated by advances in high-throughput technology, notably, but not limited to, microarray technology. (This is joint work with Joshua Habiger and Wensong Wu.)

April 2, 2:00-3:00pm, 796 COE, Professor Daowen Zhang, Department of Statistics, North Carolina State University
Variable Selection in Partial Additive Mixed Models for Longitudinal Data

Abstract: In longitudinal studies with a potentially large number of covariates, investigators are often interested in identifying important variables that are predictive of the response. Suppose we can a priori divide the covariates into two groups: one for which parametric effects are adequate and the other for which nonparametric modeling is required. In this research, we propose a new method to simultaneously select important parametric covariate effects and nonparametric covariate effects in partial additive mixed models for longitudinal data. The proposed method takes the advantage of mixed effect representation of the smoothing spline estimates of the nonparametric covariate effects and treats the inverse of a smoothing parameter as a variance component in an induced working linear mixed model. The selection of fixed parameter effects and nonparametric effects is achieved by shrinking negligible fixed effects and induced variance components to zero. Simulation studies are conducted to evaluate the performance of the new method and a real data analysis is used to illustrate its application.

March 31, 2:00-3:00pm, 796 COE, Professor Yuehua Cui, Department of Statistics, Michigan State University


March 26, 3:00-4:00pm, 796 COE, Professor Baozhong Yang, Department of Finance, Georgia State University
A Dynamic Model of Corporate Financing with Market Timing

Abstract: In this paper, I build a dynamic trade-off model of financing with difference in beliefs be- tween the manager and investors. In the model, investors update more readily on earnings announcements than the manager does. The model offers a parsimonious treatment of endoge- nous financing, payout, and cash policies. The model generates a broad set of well-documented empirical facts that are difficult to explain using standard theories. In particular, the model predicts: 1) high stock returns predicting equity issuance, 2) the low debt ratios of firms in cross- section, 3) the substantial presence of firms with no debt or negative net debt and the fact that zero-debt firms are more profitable, pay larger dividends, and keep higher cash balances than other firms, and 4) the negative relationship between profitability and both book and market leverage ratios. If investors overextrapolate trends in earnings growth, the model also predicts the negative/positive long-run abnormal returns following stock issuances/repurchases.

March 25, 10:00-11:00am, Commerce Club, Brown Room, 18th Floor (MBD Distinguished Lecture) , Professor Jun Liu, Department of Statistics, Harvard University
Auxiliary Variable MCMC with Applications in Protein Structure Modeling


March 24, 3:00-4:00pm, 106 Classroom South (Distinguished Lecture in Statistics) , Professor Jun Liu, Department of Statistics, Harvard University
Bayesian Partition Models for Detecting Interactions

Abstract: Suppose we have N individuals and for each individual we observed its response vector variable (Yi1,..., Yiq) and its p-dimensional categorical-valued covariates (Xi1,..., Xip). Our goal is to discover which subset of the response variables is influenced by which subset of the covariates. Although the problem is similar to the multiple-response regression, our goal is much more ambitious than just finding certain linear relationships. I will present a novel Bayesian partition model through the use of a set of latent indicator vectors, together with a Markov chain Monte Carlo algorithm, to tackle the problem. I will illustrate the power of the method mainly using examples in genome-wide genetic association studies and in studies of expression quantitative trait loci.

March 19, 2:00-3:00pm, 796 COE (Colloquium) , Professor Meijie Zhang, Department of Biostatistics, Medical College of Wisconsin
The additive risk model for estimation of effect of haplotype match in BMT studies

Abstract: In this talk we consider a problem from bone marrow transplant (BMT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the overall survival. The BMT study we consider is based on donors and patients that are genotype matched, and this therefore leads to a missing data problem. We show how the Aalen's additive risk model can be applied in this setting with the benefit that the time-varying haplo-match effect can be easily studied. This problem has not been considered before, and the standard approach where one would use the EM-algorithm cannot be applied for this model because the likelihood is hard to evaluate without additional assumptions. We suggest an approach based on multivariate estimating equations that are solved using a recursive structure. This approach leads to an estimator where the large sample properties can be developed using product-integration theory. Small sample properties are investigated using simulations in a setting that mimics the motivating haplo-match problem.

March 19, 11:00-12:00pm, 796 COE (Colloquium) , Professor Lan Xue , Department of Statistics, Oregon State University
Consistent variable selection in additive models

Abstract: A penalized polynomial spline method will be introduced for simultaneous model estimation and variable selection in additive models. The proposed method approximates the nonparametric functions by polynomial splines, and minimizes the sum of squared errors subject to an additive penalty on norms of spline functions. This approach sets estimators of certain function components to exactly zero, thus performing variable selection. Under mild regularity conditions, I show that the proposed method estimates the non-zero function components in the model with the same optimal mean square convergence rate as the standard polynomial spline estimators, and correctly set the zero function components to zero with probability approaching one, as sample size goes to infinity. The theoretical results are well supported by simulation studies. The proposed method is also applied to two real data examples for illustration.

February 19, 3:00-4:00pm, 796 COE, Professor Xu Zhang, Department of Mathematics and Statistics, Georgia State University
A Proportional Hazards Regression Model for the Subdistribution with Right Censored and Left Truncated Competing Risks Data

Abstract: With competing risks failure time data, one often needs to assess the covariate effects on the cumulative incidence probabilities. Fine and Gray proposed a proportional regression model to directly model the subdistribution of a competing risk and developed some estimating procedures based on inverse probability of censoring weighting for right censored only competing risks data. Right censored and left truncated competing risks data sometimes occur in biomedical researches. In this paper, we study the proportional hazards regression model for the subdistribution of a com- peting risk with right censored and left truncated data. We adopt a new weighting technique to estimate the parameters in this model. We have derived the large sample properties of the proposed estimators. To illustrate the application of the new method, we analyze the failure time data for the children with acute leukemia. In this example, the failure times for the children who had bone marrow transplants were left truncated.

February 5, 3:00-4:00pm, 796 COE, Dr. Man Jin, Associate Program Director, Statistics and Evaluation Center, American Cancer Society
Variable Selection in Canonical Discriminant Analysis for Family Studies

Abstract: In family studies, canonical discriminant analysis can be used to find linear combinations of phenotypes that exhibit high ratios of between-family to within-family variabilities. But with large numbers of phenotypes, canonical discriminant analysis may over-fit. To estimate the predicted ratios associated with the coefficients obtained from canonical discriminant analysis, two methods are developed; one is based on bias correction and the other based on cross-validation. Because the cross-validation is computationally intensive, an approximation to the cross-validation is also developed. Furthermore, these methods can be applied to perform variable selection in canonical discriminant analysis.

January 29, 3:00-4:00pm, 796 COE, Professor Daniel Bauer, Department of Risk Management and Insurance, Georgia State University
Modeling the Forward Surface of Mortality

Abstract: Longevity risk constitutes an important risk factor for insurance companies and pension plans. For its analysis, but also for evaluating mortality-contingent structured financial products, modeling approaches allowing for uncertainties in mortality projections are needed. One model class that has attracted interest in applied research as well as among practitioners are forward mortality models, which are defined based on forecasts of survival probabilities as can be found in generation life tables and infer dynamics on the entire age/term-structure ? or forward surface ? of mortality. However, thus far, there has been little guidance on identifying suitable specifications and their properties. The current paper provides a detailed analysis of forward mortality models driven by a finite-dimensional Brownian motion. In particular, after discussing basic properties, we present an infinite- dimensional formulation, and we examine the existence of finite-dimensional realizations for time- homogenous Gaussian forward models, which are shown to possess important advantages for practical applications.

January 22, 3:00-4:00pm, 796 COE, Professor Jiawei Liu, Gerorgia State University
On creating a model assessment tool independent of data size and estimating the U statistic variance

Abstract: If viewed realistically, models under consideration are always false. A consequence of model falseness is that for every data generating mechanism, there exists a sample size at which the model failure will become obvious. There are occasions when one will still want to use a false model, provided that it gives a parsimonious and powerful description of the generating mechanism. We introduced a model credibility index, from the point of view that the model is false. The model credibility index is defined as the maximum sample size at which samples from the model and those from the true data generating mechanism are nearly indistinguishable. Estimating the model credibility index is under the framework of subsampling, where a large data set is treated as our population, subsamples are generated from the population and compared with the model using various sample sizes. Exploring the asymptotic properties of the model credibility index is associated with the problem of estimating variance of U statistics. An unbiased estimator and a simple fix-up are proposed to estimate the U statistic variance.

November 20, 2:00-3:00pm, 796 COE (Colloquium), Professor Lijian Yang, Department of Statistics and Probability, Michigan State University
Simultaneous Confidence Band for Sparse Longitudinal Regression Curve

Abstract: Recently functional data analysis has received considerable attention in statistics research and a number of successful applications have been reported, but there has been no results on the inference of the global shape of the mean regression curve. In this paper, asymptotically simultaneous confidence band is obtained for the mean trajectory curve based on sparse longitudinal data, using piecewise constant spline estimation. Simulation experiments corroborate the asymptotic theory.

November 13, 2:00-3:00pm, 796 COE (Colloquium), Professor Junhui Wang, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago
On Margin Based Semisupervised Learning

Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. This imposes a great challenge in that it is difficult to achieve good classiffication performance through labeled data alone. To leverage unlabeled data for enhancing classification, we introduces a margin based semisupervised learning method within the framework of regularization, based on an efficient margin loss for unlabeled data, which seeks efficient extraction of the information from unlabeled data for estimating the Bayes rule for classiffication. In particular, I will discuss three aspects: (1) the idea and methodology development; (2) computational tools; (3) a statistical learning theory. Numerical examples will be provided to demonstrate the advantage of our proposed methodology against other existing competitors. An application to gene function prediction will be discussed.

November 6, 3:00-4:00pm, 796 COE, Professor Yuanhui Xiao, Georgia State University
On Intraclass Correlation Coefficients

Abstract: The intraclass correlation coefficient (ICC) rho is widely used to measure the degree of family resemblance with respect to characteristics such as blood pressure, weight and height, etc. In this talk the author will discuss several statistical problems regarding ICCs. Especially, the author will present several resampling methods for computing the confidence intervals for the common ICC and testing the homogeneity of ICCs for several populations. The author will also propose a few research topics regarding ICCs.

October 30, 2:00-3:00pm, 796 COE (Colloquium), Professor Hongtu Zhu, University of North Carolina at Chapel Hill
Intrinsic Regression Models for Medial Representation and Diffusion Tensor Data

Abstract: In medical imaging analysis and computer vision, there is a growing interest in analyzing various manifold-valued data including 3D rotations, planar shapes, oriented or directed directions, the Grassmann manifold, deformation field, symmetric positive definite (SPD) matrices and medial shape representations (m-rep) of subcortical structures. Particularly, the scientific interests of most population studies focus on establishing the associations between a set of covariates (e.g., diagnostic status, age, and gender) and manifold-valued data for characterizing brain structure and shape differences, thus requiring a regression modeling framework for manifold-valued data. The aim of this talk is to develop an intrinsic regression model for the analysis of manifold-valued data as responses in a Riemannian manifold and their associations with a set of covariates, such as age and gender, in Euclidean space. Because manifold-valued data do not form a vector space, directly applying classical multivariate regression may be inadequate in establishing the relationship between manifold-valued data and covariates of interest, such as age and gender, in real applications. Our intrinsic regression model, which is a semiparametric model, uses a link function to map from the Euclidean space of covariates to the Riemannian manifold of manifold data. We develop an estimation procedure to calculate an intrinsic least square estimator and establish its limiting distribution. We develop score statistics to test linear hypotheses on unknown parameters. We apply our methods to the detection of the difference in the morphological changes of the left and right hippocampi between schizophrenia patients and healthy controls using medial shape description.

October 16, 3:00-4:00pm, 796 COE, Zhouping Li, School of Mathematics, Georgia Institute of Technology
Empirical Likelihood Method For Conditional Value-at-Risk

Abstract: Value-at-Risk is a simple, but useful measure in risk management. When some volatility model is employed, conditional Value-at-Risk is of importance. As ARCH/GARCH models are widely used in modeling volatilities, in this talk, we first propose empirical likelihood methods to construct confidence intervals for the conditional Value-at-Risk with the volatility model being an ARCH/GARCH model. We further consider an empirical likelihood-based estimation of the conditional Value-at-Risk in the nonparametric regression model.

October 9, 3:00-4:00pm, 796 COE, Dr. Zhipeng Cai, Mississippi State University
Association Study on Pedigree SNP Data

Abstract: Most association study methods become either ineffective or inefficient when dealing with increasing numbers of SNPs. Suggested by the block-like structure of the human genome, a popular strategy is to use haplotypes to try to capture the correlation structure of SNPs in regions of little recombination. This haplotype based association study would have significantly reduced degrees of freedom and be able to capture the combined effects of tightly linked causal variants. An efficient rule-based algorithm is presented for haplotype inference from pedigree genotype data, with the assumption of no recombination. This zero-recombination haplotyping algorithm is extended to a maximum parsimoniously haplotyping algorithm in one whole genome scan to minimize the total number of breakpoint sites. We show that such a whole genome scan haplotyping algorithm can be implemented in O(m3n3) time in a novel incremental fashion, here m denotes the total number of SNP loci on the chromosome. Extensive simulation experiments using eight pedigree structures that were used previously for association studies showed that the haplotype allele sharing status among the members can be deterministically, efficiently, and accurately determined, even for very small pedigrees.

October 2, 3:00-4:00pm, 796 COE, Professor Yixin Fang, Georgia State University
Some discussion on variable selection in mixed-effects models

Abstract: For model selection in mixed effects models, Vaida and Blanchard (2005) demonstrated that the marginal Akaike information criterion is appropriate as to the questions regarding the population and the conditional Akaike information criterion is appropriate as to the questions regarding the particular clusters in the data. This paper shows that the marginal Akaike information criterion is asymptotically equivalent to the leave-one-cluster-out cross-validation and the conditional Akaike information criterion is asymptotically equivalent to the leave-one-observation-out cross-validation.

September 25, COE 2:00-3:00pm, 796 (Colloquium), James L. Kepner, PhD , Vice-President, Statistics and Evaluation American Cancer Society And Adjunct Professor, Department of Biostatistics Rollins School of Public Health Emory University
Survey of Exact Methods in Sample Size Determination

Abstract: Discussed are exact one-stage and group-sequential sample size determination methods for one- and two-sample binomial proportions testing problems, methods for the corresponding finite population tests, and simultaneous tests for correlated binomial proportions. Design properties are discussed and new/unpublished results are described. The exact group sequential methods allow early stops only for efficacy or only for futility or for either efficacy or futility. Sample sizes, levels of significance and power at fixed points in the research hypothesis parameter space are compared among competing designs including those derived using asymptotic normal theory methods. Documents provided will include a description of how sample points are placed in the rejection region, simple proofs for each of the 3 one-sample theorems, tables demonstrating the efficiency of the two-sample designs, a table showing how close the one-sample designs can get to the one-stage uniformly most powerful test in terms of significance and power, a table demonstrating the remarkable sample size savings if two or more binomial endpoints are tested simultaneously.

September 23, 2:30-3:30pm, 796 COE (Colloquium), Professor Yufeng Liu, Department of Statistics & Operations Research, Carolina Center for Genome Sciences, University of North Carolina at Chapel Hill
Estimation of Multiple Noncrossing Quantile Regression Functions

Abstract: Quantile regression is a very useful statistical tool to learn the relationship between the response variable and covariates. For many applications, one often needs to estimate multiple conditional quantile functions of the response variable given covariates. Although one can estimate multiple quantiles separately, it is of great interest to estimate them simultaneously. One advantage of simultaneous estimation is that multiple quantiles can share strength among them to gain betterestimation accuracy than individually estimated quantile functions. Another important advantage of joint estimation is the feasibility to incorporate noncrossing constraints of quantile regression functions. In this talk, I will present a new multiple noncrossing quantile regression estimation technique. Both asymptotic properties and finite sample performance will be presented to illustrate usefulness of the proposed method.

August 28, 2:00-3:00pm, 654 COE (Colloquium), Professor Dabao Zhang, Department of Statistics, Purdue University
Penalized orthogonal-components regression for large p small n data

Abstract: We propose a penalized orthogonal-components regression (POCRE) for large p small n data. Orthogonal components are sequentially constructed to maximize, upon standardization, their correlation to the response residuals. A new penalization framework, implemented via empirical Bayes thresholding, is presented to effectively identify sparse predictors of each component. POCRE is computationally efficient owing to its sequential construction of leading sparse principal components. In addition, such construction offers other properties such as grouping highly correlated predictors and allowing for collinear or nearly collinear predictors. With multivariate responses, POCRE can construct common components and thus build up latent-variable models for large p small n data. This is joint work with Yanzhu Lin and Min Zhang.

May 8, 2:00-3:00pm, 796 COE, Professor Xiaoli Gao , Department of Mathematics and Statistics at Oakland University
On the study of penalized LAD methods

Abstract: Penalized regression has been widely used in high-dimensional data analysis. Much recent work has been done on the study of penalized least squares methods. In this talk, I will first introduce the application of penalized LAD methods in detecting copy number variations. I will then discuss some theoretical properties of penalized LAD methods in high-dimensional settings. The finite sample performance of proposed methods are demonstrated by simulation studies.