Practice questions for Multivariate Statistics


Discuss the notion of p-value or prob-value as it is used in statistics. Give an example.

Describe how you could use a box and whisker plot to locate potential outliers.

What does validation mean and how can it be accomplished in multiple regression analysis? discriminant analysis? cluster analysis? factor analysis?

What graphic tool can an analyst best use to examine the shape of the distribution of a metric variable? Sketch a simple example of the graphic tool and label the key features.

Describe and explain the correct use of each of the following four devices for determining whether a variable is normally distributed.

frequency histogram
normal probability plot
Shapiro-Wilks (W) test
skewness test

How does the sample size effect your application of the following four devices for determining whether a variable is normally distributed?

frequency histogram
normal probability plot
Shapiro-Wilks (W) test
skewness test

Sketch a hypothetical histogram and a boxplot of a data set which is skewed to the right but has no outliers. Label the axes with hypothetical but concrete names.

Why is it important to identify the process which produced missing values?

Argue for or against, "Outliers negatively effect data analysis and should be removed from the results."


The next few questions refer to the situatuation having to do with MBA students in XYZ University.

Discuss the meaning of the index of multiple determination R^2 = 0.5664 in this problem.

Using observation number 27 as a randomly chosen example in this problem, discuss the practical significance of the predicted y-value from the regression analysis. What does practical significance mean in this context?

What variable is most important at predicting a y-value in this problem? Why?

Which applicants' information seem to be strongly influencing the predicted regression equation in this problem? Why do you say so?

How would you assess the linearity between the dependent variable and the independent variables in the regression analysis in this problem. Explain.

Assess the degree of collinearity in this problem and how you would advise that the model be used in light of this assessment.

a) Which of the statistical readouts of SAS could be used to assess the normality of the variable X2 in this problem?

b)What is your conclusion regarding the normality of the variable X2 in light of the SAS information in this problem? Explain your answer.

b) Give an example from the readouts and explain your use of the p-value in this problem.

b) Use the attached table taken from page 104 of your textbook to discuss adequacy of the sample size in relation to the statistical power and effect size in this problem

Minimum R2 that can be found statistically significant with a power of 0.80 for varying numbers of independent variables and sample sizes (all R2 values in the table are multiplied by 100: 49 means 0.49)
Significance level = 0.01 Significance level = 0.05
Number of variables Number of variables
n 2 5 10 20 n 2 5 10 20
20 45 56 71 #N/A 20 39 48 64 #N/A
50 23 29 36 49 50 19 23 29 42
100 13 16 20 26 100 10 12 15 21
250 5 7 8 11 250 4 5 6 8
500 3 3 4 6 500 3 4 5 7
1000 1 2 2 3 1000 1 1 2 2

What does validation mean and how might it be accomplished in this problem's multivariate analysis study?

What is the estimated prediction equation for predicting the dependent variable in this problem? What would the prediction equation predict for a person with the following information?

What is the meaning of parameter estimate for the coefficient X4 in this problem? Answer using the context and units of measurement of the study.

Describe standardized (partial) regression coefficients (beta coefficients) in this problem. What variable is most important at predicting a person's dependent variable value? Why?

Discuss technical issues regarding the use of the qualitative variables X5 or X8 as a predictor variable in this problem.

Using observation number 89 as a randomly chosen example, explain and discuss the 95% prediction interval and 95% confidence interval of the predicted person in this problem.

a) What are the uses of a partial regression residual plot in multiple regression analysis?

b) Demonstrate these uses in this problem.

What is collinearity in multiple regression analysis? Assess the degree of collinearity and how you would advise that the model in this problem be used in light of this assessment.

Management specifies a minimum X1 of 430 and a minimum X3 of 2.1. People are not considered unless they meet both minima. How would you advise management regarding their kind of criterion in light of the statistical analysis in this problem?

Why might one wish to perform a factor analysis and a cluster analysis in the same study?

Clearly explain the meaning of eigenvalues and factor loadings in the context of a principal components analysis.
"If a principal components analysis is performed on an uncorrelated data set, the eigenvalues would all equal one." Please argue for or against this statement.

Argue for the best way to measure similarities among observations in a cluster analysis context.

Give a brief description of how a cluster analysis might be used in a functional area of business of interest to you.

"In a principal components analysis, the first component is generally a good overall measure of variance in the data set." Please comment.

Compare and contrast collinearity and correlation in multivariate analysis.

"A major purpose of principal components/factor analysis is to develop a set of new variables to be used in subsequent analyses." Please comment on this statement.

What are the ideals of simple structure and why would you wish to achieve them?

Distinguish between principal components factor analysis and common factor analysis (principal factoring) and discuss why one might be preferred over the other.

"In a decision-making, industrial-application context, linear statistical models are used more often for prediction than for explanation." Develop an argument to support or refute this statement.

Why is a separate holdout sample so important in discriminant analysis?

Compare and contrast discriminant analysis and regression analysis.

Calculate the squared distance between the following two centroids.

How would you determine whether or not the classification accuracy of your discriminant function is sufficiently high relative to chance classification?

Compare and contrast discriminant analysis and cluster analysis.
Describe an effective quantitative way to help choose the number of clusters in a hierarchical cluster analysis context.

How are quantitative and qualitative factors involved in "naming" hierarchically derived clusters?

Describe the purpose and use of Kaiser's measure of sampling adequacy (MSA).

Using the concept of variance, argue for a good way to determine the number of factors to retain in an exploratory factor analysis.

Explain the meaning and use of factor loadings in the context of a principal components analysis.

What is an orthogonal rotation of factor loadings and what is its purpose?

Provide a possible interpretation of the two retained factors in this problem (and explain your answer).

Why might one wish to perform a factor analysis in a regression analysis study? Describe the steps to such an approach.

In what situations might you consider transforming a variable?

How can SAS be used to check for multicollinearity in discriminant analysis?

What does the adjective "partial" mean when it is used in statistics?

Go back to DSc8450 home