# The Problem of Identification

## Topics:

Introduction to Identification A statistical model is "identified" if the known information available implies that there is one best value for each parameter in the model whose value is not known. An example from the algebra of simultaneous equations:

x + 2y = 7
An infinite number of pairs of values will serve for x and y. These values are "not identified" or "underidentified." There are fewer "knowns" than "unknowns." Here is a different situation:

x + 2y = 7
3x - y = 7
Now there are just as many known as unknowns, and there is one best pair of values (x = 3, y = 2). The system of equation is now "just identified."

In structural equation modeling, the knowns consist chiefly of the variances and covariances of the measured variables (but may include other elements as well), while the unknowns consist of model parameters. Identification is an important concern for SEM researchers because the methodology give users the freedom to specify models that are not identified. Here, for example, is a model that is not identified:

Why is this model not identified? Compare knowns and unknowns. If we specify (for convenience) that the latent construct has unit variance, then this model has 4 parameters with unknown values--one loading and one error variance for each X. Now, the variance/covariance matrix of the X's has 3 distinct elements--the variance of each X, and their covariance. So there are 4 unknowns but only 3 knowns, and the model will nto be identified unless additional constraints are imposed.

## Overidentification

Beyond mere identification, SEM users prefer to work with models that are "overidentified"--models where there are more knowns than unknowns. Models that are just identified yield a trivially perfect fit, making the test of fit uninteresting. Models that are overidentified--that have positive degrees of freedom--may not fit well, so the fact that such a model does fits well amounts to meaningful evidence in favor of the proposition that the model is indeed a reasonable representation of the phenomena in question.

## Approaches to Testing Identification

There are four general approaches to assessing identification:

## Algebraic Solution

As Bollen (1989) notes, the parameters of a structural equation model are generally considered identified if the researcher can solve the covariance structure equations for the unknown parameters. That is, the researcher must express the parameters as independent functions of the elements of the covariance matrix. Unfortunately, the covariance structure equation quickly become complex as the model grows, making algebraic solution "tedious and error-prone," to use Bollen's words. Researchers who adopt this approach must also beware of dependencies concealed within the solution.

## Heuristics (Rules of Thumb)

Because of this difficulty, researchers have developed a number of rules of thumb, most of which are summarized in Bollen (1989). The simplest of these restates the requirement that there must be more knowns than unknowns--more distinct elements of the moment matrix being analyzed than there are parameters to be estimated. But this is merely a necessary condition, not a sufficient one.

For measurement models, for example, the "Three Measure Rule" states that a congeneric measurement model will be identified if every latent construct is associated with at least 3 measures. The "Two Measure Rule" states that a congeneric measurement model will be identified if every latent construct is associated with at least 2 measures AND every construct is correlated with at least one other construct. More recent constributions on the identification of measurement models have come from Davis (1993) and Reilly (1995).

The most well-known rule of thumb for the structural model are the "Rank and Order Conditions." These conditions are necessary and sufficient for identification of the structural model when all of the disturbance terms are allowed to correlate. There is also the "Recursive Rule," which says that recursive models are always identified. A recent contribution in this area has come from Rigdon (1995).

Researchers who rely on these heuristics must realize, however, that separately assessing the identification of the measurement model and the structural model can lead to errors. The identification status of the two can be intertwined, so that restrictions in one can aid identification of the other.

## Information Matrix Techniques

In structural equation modeling, the "information matrix" (E, at left: this version is taken from Jöreskog & Sörbom, 1989) is the matrix of second order derivatives of the fit or discrepancy function with respect to all the free parameters of the model. If the model's parameters are all identified, then the rank of the information matrix will be equal to the number of free parameters in the model (equivalently, the matrix will be positive definite); if not identified, then rank will be deficient. This is analogous to checking for multicollinearity in a regression by evaluating the rank of the covariance matrix of the predictors. In fact, some SEM programs, such as EQS, report identification problems detected in this way by saying that one parameter in the model "is linearly dependent on" some other parameter(s).

This approach has two shortcomings. First, the rank of the information matrix is only evaluated after the parameters have been estimated, and the evaluation only applies at that point in parameter space. In the words of McDonald (1982), who wrote many fundamental papers in this area, the model is identified locally, rather than globally over the whole parameter space. This suggests that a model might be identified at one point in this space but not identified at others.

The second shortcoming is a result of the way this technique is typically implemented in SEM programs. Typically, a program evaluates the rank of the information matrix sequentially, beginning with one row and column (representing one parameter) then going to the first two rows and columns (representing two parameters) . . . and so on until either the entire matrix is evaluated or a rank deficiency occurs. If a deficiency occurs, the program sends a message saying that the corresponding parameter is problematic. The problem is, the reseacher is not informed of how many other parameters may also be involved in the identification problem. Thus, identification error messages produced in this way may mislead researchers about the true nature of the problem. Besides examining the information matrix itself, researchers may also check identification by looking at some by-products. Large standard errors and very highly correlations between parameter estimates may signal identification problems, although it can be hard to tell, based only on these values, whether there is an identification problem, a model fit problem, or no problem at all.

## Evaluation of the Augmented Jacobian Matrix

Recently, Bekker, Merckens and Wansbeek (1994) presented an approach which involves evaluating an augmented version of the Jocobian matrix (at right)--the matrix of first order derivatives of the discrepancy function with respect to the free parameters. Their Jacobian is augmented because it also includes equations representing restrictions, such as equality constraints, imposed on the values of the parameters. It is also modified in ways that reduce the computational burden without affecting the conclusions obtained.

Using modern computer algebra techniques, Bekker, Merckens and Wansbeek (1994) show that the identification of the model can be assessed by evaluating the rank of a subset of this augmented Jacobian matrix, and that this evaluation can be conducted symbolically, before the parameters are estimated, and thus independently of any particular set of parameter values. Still, the assumptions upon which this method is constructed make this a test of local, rather than global, identification. However, the output of this procedure is a report on the identification status of every model parameter. This means that the researcher has a complete list of all problem parameters, which makes it more likely that the problem will be properly understood.

This approach has not yet been implemented in a mainstream SEM program. However, the authors have implemented their technique via a set of programs which are provided on a disk that accompanies their book.

## Empirical Underidentification

Kenny (1979) introduced this term for situations where a model should be identified based on its structure, but it is not identified based on the sample data being analyzed. For example, a measurement model with two correlated constructs and two congeneric measures loading on each construct should be identified, under the "Two Indicator Rule." But suppose that, in a given sample of data, the correlation between the constructs is equal to 0. Then, in that sample, the model would not be identified. As Kenny (1979) noted, the threat of "empirical underidentification" means that researchers must always be alert for signs of identification problems, even when a model is nominally identified based on its structure.

## References

Davis, W. R. (1993). The FC1 rule of identification for confirmatory factor analysis: A general sufficient condition. Sociological Methods & Research, 21(4), 403-437.

McDonald, R. P. (1982). A note on the investigation of local and global identifiability. Psychometrika, 47(1), 101-3.

Reilly, T. (1995). A necessary and sufficient condition for identification of confirmatory factor analysis models of complexity one. Sociological Methods & Research, 23(4), 421-441.

Rigdon, E. E. (1995). A necessary and sufficient identification rule for structural models estimated in practice. Multivariate Behavioral Research, 30(3), 359-383.

http://www.gsu.edu/~mkteer/identifi.html