Abstract:
Motivated by a longitudinal genetic study on risk factors of
cardiovascular disease and a treatment study to improve clinical
outcomes after subarachnoid hemorrhage, we propose flexible models to
estimate and test genetic effect or treatment effect by penalized
splines (Eliers and Marx 1996; Rupper, Wand and Carroll 2003). Both
data examples have hierarchical structure, for example, repeated
measures nested within subjects and subjects nested within a family
that needs to be accounted for in the modeling. We propose estimation
procedures under the conditional and marginal semiparametric
regression framework using penalized splines. In addition, we imbed
the test of a nonparametric function with multilevel data into testing
fixed effects and a variance component in a linear mixed effects model
with nuisance variance components. Through a spectral decomposition of
the residual sum of squares, we provide a fast algorithm to compute
null distribution of the test statistic which improves computational
efficiency significantly comparing to bootstrap. We apply the methods
to compute the genome-wide critical value and p-value of a genetic
association test in a genome-wide association study (GWAS) where the
usual chi-square mixture approximation is conservative and bootstrap
is computationally prohibitive (up to 10^8 simulations). Lastly, we
examine asymptotic properties of the penalized spline estimator with
clustered data in the small-knots and large-knots scenarios.

Abstract:

Abstract:
A fundamental and foremost objective in biomedical research is
to establish valid measurements of the clinical disease of interest.
Accuracy or validity of such disease instruments is commonly established by assessing similarity between measurements made on a subject by multiple time points or by comparing with some gold standard or best available measurement. Although the foundation of the methodology for addressing accuracy of measurements (agreement methodology) has been laid out, most methods are applicable only when both measurements are made on the same scale.
In this presentation, we first discuss existing measures of agreement
and their applicability in practice. Next, we introduce a new concept,
called “broad sense agreement", which extends the classical framework of agreement to evaluate the capability of interpreting a continuous measurement in an ordinal scale. We present a natural measure for broad sense agreement. Nonparametric estimation and inference procedures are developed for the proposed measure. We also consider longitudinal settings which involve agreement assessments at multiple time points. Simulation studies have demonstrated good performance of the proposed method with small sample.

Abstract:
Preventive intervention programs often target patterns of family interaction as a means of effecting change.
This presentation discusses new methods for specifying and modeling theoretically meaningful patterns of interaction based on
indicators of the strength of contingency among behaviors in a behavioral sequence, with the long-term goal of providing methods
that allow us to characterize key aspects of interaction and study how they mediate the effects of intervention on outcome.
Our prior work has established the utility of using univariate multilevel modeling methods to characterize contingency strength between any two individual behavior categories, and has extended this work to model patterns of contingency based on all instances of a single transition during an interaction, such as the transition from a wife?s action to a husband?s reaction. In this paper we take up the multivariate extension of this model, which is necessary when modeling interaction patterns involving transitions to and from both actors. This occurs, for example, in reciprocal interactions where the behavior of each partner is hypothesized to affect the behavior of the other, leading to cycles of reciprocated behavior.
We begin by formulating two bracketing models. The baseline model includes a single random effect for the total number of behaviors in a sequence, while the full association model includes random effects for the full set of behavioral contingencies across both types of transitions (actor to partner, and partner to actor). We then present a series of theory-based models reflecting different ways of characterizing reciprocal interaction, which lie between the baseline and the full association models, which serve as boundary conditions for testing how well these theory-based models capture important variation in interaction patterns.
We demonstrate this strategy, through analyzing a dataset based on observation and microcoding of the sequential interactions of 254 couples experiencing substantial stress occasioned by loss of employment. The results of these analyses suggest that a construct we label reciprocated valence accounts for a substantial proportion of the variance in the complete set of bidirectional contingencies. In addition, results indicate that more complex models that separate negative and positive reciprocity provide only minimal improvements in accounting for this variation. We conclude by discussing how future extensions will allow us to embed this model within a more comprehensive mediation framework.

Abstract:
The paper develops semiparametric estimation methods for bivariate count data regression models. We develop series expansion approach in which dependence between count variables is introduced by means of stochastically related unobserved heterogeneity components,
and in which, unlike existing commonly used models, positive as well as negative correlations are allowed. In implementation, we use bivariate expansions based on the generalized Laguerre polynomials. Extensions that accommodate for excess zeros, truncated and censored data and multivariate generalizations are also given. The first application examines the socio-economic and demographic determinants of tobacco use in the context of the joint modeling of the daily number of smoking tobacco and number of chewing tobacco based on household survey data. We also analyze jointly two health utilization measures, number of consultations with a doctor and non-doctor consultations. One of the key contributions is in obtaining a computationally tractable closed form of the model with flexible correlation structure. Monte Carlo experiments and empirical applications confirm that the model performs well relative to existing bivariate models, in terms of various statistical criteria and capturing the range of correlation among dependent variables. This is a joint work with John Elder.

Abstract:
The Cox model with time-dependent coefficients has been studied by a number of authors recently.
In this talk, we develop empirical likelihood (EL) point-wise confidence regions for the time-dependent
regression coefficients via local partial likelihood smoothing. The EL simultaneous confidence bands for a
linear combination of the coefficients are also derived based on the strong approximation methods. The EL ratio
is formulated through the local partial log-likelihood for the regression coefficient functions. Our numerical studies
indicate that the EL point-wise/simultaneous confidence regions/bands have satisfactory finite sample performances.
Compared with the confidence regions derived directly based on the asymptotic normal distribution of the local constant estimator,
the EL confidence regions are overall tighter and can better capture the curvature of the underlying regression coefficient functions.
Two data sets, the gastric cancer data and the Mayo Clinic primary biliary cirrhosis data, are analysed using the proposed method.
This is based on joint work with Yanqing Sun and Rajeshwari Sundaram.

Abstract:
RNA sequencing (RNA-seq) is a powerful new technology for mapping and quantifying
transcriptomes using ultra high-throughput next generation sequencing technologies.
Using deep sequencing, gene expression levels of all transcripts including novel ones can be quantified digitally. Although extremely promising, the massive amounts of data generated by RNA-seq, substantial biases, and uncertainty in short read alignment pose daunting challenges for data analysis. In particular, large base-specific variations and between-base correlations make simple approaches, such as those that use averaging to normalize RNA-seq data and quantify gene expressions, ineffective. In this study, we propose a model-based method to characterize base-level read coverage within each exon. The underlying expression level is included as a key parameter in this model. Since our method is capable of capturing local genomic features that affect read coverage profile throughout the exon, we are able to obtain improved quantification of the true underlying expression levels.

Abstract:
Exploring genomic landscapes of different biological endpoints is an important approach for understanding
biological processes and disease etiologies. Examples of these endpoints are sequence composition,
DNA methylation, histone modifications, and binding sites for different transcription factors.
With the completion of human genome project and advances of high-throughput technologies, tightly
spaced measurements have been collected from linear chromosomes to create unbiased maps at the whole-genome scale.
Detecting regions of interests from these data can be categorized as a general ³bump finding² problem, where a bump
is defined as a genomic location for which data behaves differently from the majority of the genome.
In this talk I will present several examples with the general theme of bump finding. In the first example
we propose using Hidden Markov Models to search for CpG islands (CGI) from DNA sequence. The main advantage
of our approach over others is that it summarizes the evidence for CGI status as probability scores, which provides
flexibility in the definition of a CGI and facilitates the creation of CGI lists for many species.
In the second example we construct a hierarchical model to detect transcription factor binding sites (TFBS) by
jointly analyzing multiple related ChIP-chip datasets. This model captures the locational correlation among datasets,
which provides basis for sharing information across experiments. Simulation and real data tests
illustrate the advantage of the joint model over strategies that analyzes the individual dataset separately.

Abstract:

Abstract:
We determine the optimal allocation of funds between the fixed and variable subaccounts
in a variable annuity with a GMDB (Guaranteed Minimum Death Benefit) clause featuring
partial withdrawals by using a utility-based approach. The Merton method is applied in this
paper by assuming that individuals allocate funds in order to maximize the expected utility of
lifetime consumption, and include the effect on asset allocation from both savings (accumulation)
and dissavings (consumption). We also reflect bequest motives by including the utility of the
recipient of the policyholders guaranteed death benefits. We derive the optimal transfer choice
by the insured, and furthermore price the GMDB through maximizing the discounted expected
utility of the policyholders and beneficiaries by investing dynamically in the fixed account and
variable fund and withdrawing optimally.

Abstract:
Imbalanced data learning is one of the most important problems in machine learning and data mining area, attracting continuous attentions in both academia and industry in last decade. In this talk, I will introduce the binary version of this imbalance data learning problem and present an effective ensemble learning framework. First, a formal definition of imbalanced binary classification problem is introduced and several real-world examples will be provide to show its significance. Then, we will thoroughly investigate the current research trends in handling imbalance learning problem to provide a comprehensive overview of representative studies in this area. After discussing the advantages and weakness of existing learning methods, we proposed a new effective ensemble framework?Diversified Ensemble Classifiers for Imbalanced Data Learning (DECIDL). Our strategy combines three popular learning techniques together: a) ensemble learning, b) artificial example generation; c) diversity construction by oppositional data re-labeling. As a meta-learner, DECIDL can utilize general supervised learning algorithms, such as support vector machines, decision trees, neural networks, as the base learner to build effective ensemble committees. We compare the DECIDL ensemble framework with several existing ensemble imbalanced learning frameworks, namely under-bagging, over-bagging, SMOTE-bagging, AdaBoost, on our newly developed benchmark data pool consisting 30 highly skewed data sets. Extensive experiments with various base learners suggest that our DECIDL framework is comparable with other ensemble methods.

Abstract:
Traditional statistical inference is usually based on the assumption of empirical models for the data such as linear, nonlinear, nonparametric, and semiparametric models for continuous data, generalized models for binary or discrete data, and proportional hazard regression models for survival data.
Another class of statistical inference is purely based on algorithmic models such as neural
nets and decision trees to solve the black box problem in the real world. Statistical research in
this main stream culture is trying to perform inference by minimizing the use of knowledge about the mechanism behind the data.
However, the knowledge of the research system and the data-generation mechanism, which can be described by mathematical models,
in particular dynamic models such as differential equations, are usually known or partially known in the real world. Statistical inference and research for the mechanism-based models are very sparse, but are badly needed. Thus, a new culture in statistical research for mechanism models needs to be established. I?ll illustrate and outline the statistical research and its importance for mechanism-based differential equation models by our group and others. Statistical methods and theories for differential equation models are illustrated via the experimental data from infectious diseases such as HIV and influenza research.

Abstract:
The nested case-control (NCC) design is a cost-effective sampling method to study the relationship between a
disease and its risk factors in epidemiologic studies. NCC data are commonly
analyzed using Thomas partial likelihood approach under Cox's proportional hazards model with
constant covariate effects. In this talk, I will present an extension, the Cox regression with time-varying coefficients,
in NCC studies and an estimation approach based on a kernel-weighted Thomas partial likelihood.
Both simulation studies and an application to the NCC study of breast cancer in the New York University
Women's Health Study are used to illustrate the usefulness of the proposed methods. Furthermore,
I will discuss another extension, the Cox regression with nonlinear covariate effects, and issues regarding different
techniques to handle these two different models in NCC studies.

Abstract:
This paper empirically studies consumer choice behavior in the wake of a product-harm crisis. A product-harm crisis creates consumer uncertainty about product quality. In this paper, the authors develop a model that explicitly incorporates the impact of such uncertainty on consumer behavior. The authors assume that consumers are uncertain about the mean product quality level and learn about product quality through the signals contained in use experience and the product harm crisis. The authors also assume that consumers are uncertain about the precision of the signals in conveying product quality and update their perception of the precision of such signals over time upon their arrival. To study the possible impact of a product-harm crisis on consumer?s
sensitivities to price, quality, and risk, the authors also allow these model parameters to be different before, during, and after the product-harm crisis. The model is estimated by Bayesian methods for a scanner panel dataset that includes consumer purchase history before, during, and after a product-harm crisis that hit the peanut butter division of Kraft Foods Australia in June 1996. The proposed model fits the data better than the standard consumer learning model in marketing that assumes consumers are uncertain about product quality level but the precision of information in conveying product quality is known to consumers. This study also provides substantive insights on consumers? behavioral choice responses to a product-harm crisis. Finally, the authors conduct counterfactual experiments based on the estimation results and provide insights to managers on crisis management.

Abstract:
Statistical inferences are essentially important in analyzing neural
spike trains in computational neuroscience. Current approaches have
followed a general inference paradigm where a parametric probability
model is often used to characterize the temporal evolution of the
underlying stochastic processes. To capture the overall variability and
distribution in the space of the spike trains directly, we focus on a
data-driven approach where statistics are defined and computed in the
function space in which individual spike trains are viewed as points. To
this end, we at first develop a parametrized family of metrics that
takes into account different warpings in the time domain and generalizes
several currently used spike train distances. These new metrics are
essentially penalized L^p norms, involving appropriate functions of
spike trains, with penalties associated with time-warping. In
particular, when p = 2, we present an efficient recursive algorithm,
termed Matching-Minimization algorithm, to compute the sample mean of a
set of spike trains with arbitrary numbers of spikes. The proposed
metrics as well as the mean spike trains ideas are demonstrated using
simulations as well as an experimental recording from the motor cortex.
It is found that all these methods achieve desirable performance and the
results support the success of this novel framework.

Abstract:
Additive models have been widely used in nonparametric
regression, mainly due to their ability to avoid the problem
of the "curse of dimensionality". When some of the additive components
are linear, the model can be further simplified and higher convergence
rates can
be achieved for the estimation of these linear components. In this
paper, we propose a testing procedure for the determination of linear
components in
nonparametric additive models. We adopt the penalized spline approach
for modelling the nonparametric functions, and the test is a sort of
Chi-square
test based on finite order penalized spline estimators. The limiting
behavior of the test statistic is investigated. To obtain the critical
values for finite sample problems, we use resampling techniques to
establish a bootstrap test. The performance of the proposed tests is
studied through
simulation experiments and a real-data example.

Abstract:
This paper presents a Bayesian analysis of the endogenous treatment model
with misclassified treatment participation. Our estimation procedure utilizes a
combination of data augmentation, Gibbs
sampling, and Metropolis-Hastings to obtain estimates of the
misclassification probabilities and the treatment effect. Simulations demonstrate that the proposed Bayesian
estimator accurately estimates the treatment effect in light of misclassification and
endogeneity.

Abstract:
In most current Phase I designs including Standard 3+3 design, Continuous
Reassessment Method (CRM), and Escalation With Overdose Control (EWOC),
toxicity response of patient is treated coarsely as a binary indicator (Yes
vs No) of dose limiting toxicity (DLT) although patient usually has multiple
toxicities and a lot of useful toxicity information is discarded. For the
first time in the literature, we establish a novel toxicity scoring system
to treat toxicity response as a quasi-continuous variable and utilize all
toxicities of patients. Our toxicity scoring system consists of generally
accepted and objective components (a logistic function, grade and type of
toxicity, and whether the toxicity is DLT) so that it is relatively
objective. Our system can successfully transform current Phase I designs
treating toxicity response as a binary indictor of DLT to new designs
treating toxicity response as a quasi-continuous variable by replacing the
binary indicator of DLT and the Target Toxicity level (TTL) of current
designs with a Normalized Equivalent Toxicity Score (NETS) and a Target NETS
(TNETS), respectively. The transformed designs will improve the accuracy of
Maximum Tolerated Dose (MTD) and efficiency of trial. As an example, we
couple our system with EWOC to develop a new design called Escalation With
Overdose Control using Normalized Equivalent Toxicity Score (EWOC-NETS).
Simulation studies and its application to real trial data demonstrate that
EWOC-NETS can treat toxicity response as a quasi-continuous variable, fully
utilize all toxicity information, and improve the accuracy of MTD and
efficiency of Phase I trial. A user-friendly software of EWOC-NETS is under
development and will be available in the future.

Abstract:
Semi-parametric linear transformation models have received much attention due to their high flexibility in modeling survival data.
However, the problem of variable selection for linear transformation models has been less studied,
partially because a convenient loss function is not readily available under this context.
In this talk, we propose a simple yet powerful approach to achieve both sparse and consistent estimation
for linear transformation models. The main idea is to derive a profiled score from the martingale-based estimating equations
of Chen et al. (2001), construct a loss function based on the profile scored and its variance, and then minimize the loss subject
to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for
both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and
can achieve a higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step
approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently.
Performance of the new procedure is illustrated through numerous simulations and real data applications.

Abstract:
Non-iterative, distribution-free, unbiased estimators of variance components including minimum norm quadratic
unbiased estimator and method of moment estimator are derived for multivariate mixed model.
A general cluster-wise covariance and a same-member-only response-wise covariance are assumed.
Some properties of the proposed estimators such as unbiasedness and existence are discussed, and related
computational issues are addressed. A simulation study is conducted to compare the proposed estimators with Gaussian
(restricted) maximum likelihood estimator in terms of bias and mean square error. An application of gene expression family
data is presented to illustrate the proposed estimators.

Abstract: Independent component analysis (ICA) has become an important tool for
analyzing data from functional magnetic resonance imaging (fMRI) studies. ICA has been successfully applied to
single-subject fMRI data. The extension of ICA to group inferences in neuroimaging studies, however, is challenging due to the
unavailability of a pre-specified group design matrix and the uncertainty in between-subjects variability in fMRI data.
We present a general probabilistic ICA (PICA) model that can accommodate varying group structures of multi-subject spatio-temporal processes.
An advantage of the proposed model is that it can flexibly model various types of group structures in different underlying neural source
signals and under different experimental conditions in fMRI studies. A maximum likelihood method is used for estimating this general group ICA model.
We propose two EM algorithms to obtain the ML estimates. The first method is an exact EM algorithm which provides an exact E-step and an explicit noniterative M-step.
The second method is an variational approximation EM algorithm which is computationally more efficient than the exact EM.
We conduct simulation studies to evaluate the performance of the proposed methods. An fMRI data example is used to illustrate application of the proposed methods.

Abstract:
Data produced from complex biological processes are not subject to simple statistical methods.
Bayesian approaches provide a natural framework to untangle such problems through incorporation of
our understanding of biological processes and the data generation process. In this talk, I will describe
the development of Bayesian hierarchical models in addressing the following two Proteomics problems.
iTRAQ data. iTRAQ (isobaric Tags for Relative and Absolute Quantitation) is a technique
that allows simultaneous quantitation of proteins in multiple samples. However, ignoring the common
nonrandom missingness will lead to biased estimation of protein expression levels. To reduce such bias,
we construct a Bayesian hierarchical model-based method and model the nonrandom missingness of
peptide data with a logistic regression, which relates the missingness probability for a peptide with
the expression level of the protein that produces this peptide. We assumes that the measured peptide
intensities are affected by both protein expression levels and peptide specific effects. The values of these
two effects across experiments are modeled as random effects. Simulation results suggest that such
estimates have smaller bias than those estimated from ANOVA models or fold changes.
Pathway inference. Simultaneous measurements of multiple protein activities at the single cell
level provide much richer information on signaling networks. With the measurements of protein activities
at different experimental conditions, we propose a Bayesian hierarchical modeling framework for
signaling network reconstruction. We model the existence of an association between two proteins both
at the overall level across all experiments and at each individual experimental level, from which we infer
the pairs of proteins that are associated and their causal relations. This approach can effectively pool
information from different interventional experiments. Simulation results demonstrate the superiority
of the hierarchical approach.