An Introduction to Structural Equation Modeling (SEM)
SEM is a combination of factor analysis and multiple regression. It also goes by the aliases “causal modeling” and “analysis of covariance structure”. Special cases of SEM include confirmatory factor analysis and path analysis. You are already familiar with path analysis, which is SEM with no latent variables.
The variables in SEM are measured (observed, manifest) variables (indicators) and factors (latent variables). I think of factors as weighted linear combinations that we have created/invented. Those who are fond of SEM tend to think of them as underlying constructs that we have discovered.
Even though no variables may have been manipulated, variables and factors in SEM may be classified as “independent variables” or “dependent variables.” Such classification is made on the basis of a theoretical causal model, formal or informal. The causal model is presented in a diagram where the names of measured variables are within rectangles and the names of factors in ellipses. Rectangles and ellipses are connected with lines having an arrowhead on one (unidirectional causation) or two (no specification of direction of causality) ends.
Dependent variables are those which have one-way arrows pointing to them and independent variables are those which do not. Dependent variables have residuals (are not perfectly related to the other variables in the model) indicated by e’s (errors) pointing to measured variables and d’s pointing to latent variables.
The SEM can be divided into two parts. The measurement model is the part which relates measured variables to latent variables. The structural model is the part that relates latent variables to one another.
Statistically, the model is evaluated by comparing two variance/covariance matrices. From the data a sample variance/covariance matrix is calculated. From this matrix and the model an estimated population variance/covariance matrix is computed. If the estimated population variance/covariance matrix is very similar to the known sample variance/covariance matrix, then the model is said to fit the data well. A Chi-square statistic is computed to test the null hypothesis that the model does fit the data well. There are also numerous goodness of fit estimators designed to estimate how well the model fits the data.
Sample Size. As with factor analysis, you should have lots of data when evaluating a SEM. As usual, there are several rules of thumb. For a simple model, 200 cases might be adequate. When relationships among components of the model are strong, 10 cases per estimated parameter may be adequate.
Assumptions. Multivariate normality is generally assumed. It is also assumed that relationships between variables are linear, but powers of variables may be included in the model to test polynomial relationships.
Problems. If one of the variables is a perfect linear combination of the other variables, a singular matrix (which cannot be inverted) will cause the analysis to crash. Multicollinearity can also be a problem.
An Example. Consider the model presented in Figure 14.4 of Tabachnick and Fidell [Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon.]. There are five measurement variables (in rectangles) and two latent variables (in ellipses). Two of the variables are considered “independent” (and shaded), the others are considered “dependent.” From each latent variable there is a path pointing to two indicators. From one measured variable (SenSeek) there is a path pointing to a latent variable (SkiSat). Each measured variable has an error path leading to it. Each latent variable has a disturbance path leading to it.
Parameters. The parameters of the model are regression coefficients for paths between variables and variances/covariances of independent variables. Parameters may be “fixed” to a certain value (usually “0” or “1”) or may be estimated. In the diagram, an “” represents a parameter to be estimated. A “1” indicates that the parameter has been “fixed” to value “1.” When two variables are not connected by a path the coefficient for that path is fixed at “0.”
Tabachnick and Fidell used EQS to arrive at the final model displayed in their Figure 14.5.
Model Identification. An “identified” model is one for which each of the estimated parameters has a unique solution. To determine whether the model is identified or not, compare the number of data points to the number of parameters to be estimated. Since the input data set is the sample variance/covariance matrix, the number of data points is the number of variances and covariances in that matrix, which can be calculated as , where m is the number of measured variables. For T&F’s example the number of data points is 5(6)/2 = 15.
If the number of data points equals the number of parameters to be estimated, then the model is “just identified” or “saturated.” Such a model will fit the data perfectly, and thus is of little use, although it can be used to estimate the values of the coefficients for the paths.
If there are fewer data points than parameters to be estimated then the model is “under identified.” In this case the parameters cannot be estimated, and the researcher needs to reduced the number of parameters to be estimated by deleting or fixing some of them.
When the number of data points is greater than the number of parameters to be estimated then the model is “over identified,” and the analysis can proceed.
Identification of the Measurement Model. The scale of each independent variable must be fixed to a constant (typically to 1, as in z scores) or to that of one of the measured variables (a “marker variable,” one that is thought to be exceptionally well related to the this latent variable and not to other latent variables in the model). To fix the scale to that of a measured variable one simply fixes to 1 the regression coefficient for the path from the latent variable to the measured variable. Most often the scale of dependent latent variables is set to that of a measured variable. The scale of independent latent variables may be set to 1 or to the variance of a measured variable.
The measurement portion of the model will probably be identified if:
There is only one latent variable, it has at least three indicators that load on it, and the errors of these indicators are not correlated with one another.
There are two or more latent variables, each has at least three indicators that load on it, and the errors of these indicators are not correlated, each indicator loads on only one factor, and the factors are allowed to covary.
There are two or more latent variables, but there is a latent variable on which only two indicators load, the errors of the indicators are not correlated, each indicator loads on only one factor, and none of variances or covariances between factors is zero.
Identification of the Structural Model. This portion of the model may be identified if:
None of the latent dependent variables predicts another latent dependent variable.
When a latent dependent variable does predict another latent dependent variable, the relationship is recursive, and the disturbances are not correlated. A relationship is recursive if the causal relationship is unidirectional (one line pointing from the one latent variable to the other). In a nonrecursive relationship there are two lines between a pair of variables, one pointing from A to B and the other from B to A. Correlated disturbances are indicated by being connected with a single line with arrowhead on each end.
When there is a nonrecursive relationship between latent dependent variables or disturbances, spend some time with: Bollen, K.A. (1989). Structural equations with latent variables. New York: John Wiley & Sons -- or hire an expert in SEM.
If your model is not identified, the SEM program will throw an error and then you must tinker with the model until it is identified.
Estimation. The analysis uses an iterative procedure to minimize the differences between the sample variance/covariance matrix and the estimated population variance matrix. Maximum Likelihood (ML) estimation is that most frequently employed. Among the techniques available in the software used in this course (SAS and AMOS), the ML and Generalized Least Squares (GLS) techniques have fared well in Monte Carlo comparisons of techniques.
Fit. With large sample sizes, the Chi-square testing the null that the model fits the data well may be significant even when the fit is good. Accordingly there has been great interest in developing estimates of fit that do not rely on tests of significance. In fact, there has been so much interest that there are dozens of such indices of fit. Tabacknick and Fidell discuss many of these fit indices. You can also find some discussion of them in my document Conducting a Path Analysis With SPSS/AMOS.
Model Modification and Comparison. You may wish to evaluate two nested models. Model R is nested within Model F if Model R can be created by deleting one or more of the parameters from Model F. The significance of the difference in fit can be tested with a simple Chi-square statistic. The value of this Chi-square equals the Chi-square fit statistic for Model F minus the Chi-square statistic for Model R. The degrees of freedom equal degrees of freedom for Model F minus degrees of freedom for Model R. A nonsignificant Chi-square indicates that removal of the parameters that are estimated in Model F but not in Model R did not significantly reduce the fit of the model to the data.
The Lagrange Multiplier Test (LM) can be used to determine whether or not the model fit would be significantly improved by estimating (rather than fixing) an additional parameter. The Wald Test can be used to determine whether or not deleting a parameter would significantly reduce the fit of the model. The Wald test is available in SAS Calis, but not in AMOS. One should keep in mind that adding or deleting a parameter will likely change the effect of adding or deleting other parameters, so parameters should be added or deleted one at a time. It is recommended that one add parameters before deleting parameters.
Reliability of Measured Variables. The variance in each measured variable is assumed to stem from variance in the underlying latent variable. Classically, the variance of a measured variable can be partitioned into true variance (that related to the true variable) and (random) error variance. The reliability of a measured variable is the ratio of true variance to total (true + error) variance. In SEM the reliability of a measured variable is estimated by a squared correlation coefficient, which is the proportion of variance in the measured variable that is “explained” by variance in the latent variable(s).