Zum Hauptinhalt springen

Bayesian Comparison of Latent Variable Models: Conditional vs Marginal Likelihoods

Merkle, E. C. ; Furr, D. ; et al.
In: Grantee Submission, Jg. 84 (2019), S. 31
Online academicJournal

Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods 

Typical Bayesian methods for models with latent variables (or random effects) involve directly sampling the latent variables along with the model parameters. In high-level software code for model definitions (using, e.g., BUGS, JAGS, Stan), the likelihood is therefore specified as conditional on the latent variables. This can lead researchers to perform model comparisons via conditional likelihoods, where the latent variables are considered model parameters. In other settings, however, typical model comparisons involve marginal likelihoods where the latent variables are integrated out. This distinction is often overlooked despite the fact that it can have a large impact on the comparisons of interest. In this paper, we clarify and illustrate these issues, focusing on the comparison of conditional and marginal Deviance Information Criteria (DICs) and Watanabe–Akaike Information Criteria (WAICs) in psychometric modeling. The conditional/marginal distinction corresponds to whether the model should be predictive for the clusters that are in the data or for new clusters (where "clusters" typically correspond to higher-level units like people or schools). Correspondingly, we show that marginal WAIC corresponds to leave-one-cluster out cross-validation, whereas conditional WAIC corresponds to leave-one-unit out. These results lead to recommendations on the general application of the criteria to models with latent variables.

Keywords: Bayesian information criteria; conditional likelihood; cross-validation; DIC; IRT; leave-one-cluster out; marginal likelihood; MCMC; SEM; WAIC

The research reported here was supported by NSF Grant 1460719 and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140059. The authors thank Frédéric Gosselin and three anonymous reviewers for comments that improved the paper. Code to replicate the results from this paper can be found at http://semtools.r-forge.r-project.org/.

Introduction

Psychometric models typically include latent variables (called "random effects" in some cases), classic examples being item response theory (IRT) models, structural equation models (SEMs), and multilevel/hierarchical regression models (MLMs). In IRT and SEM, the latent variables represent latent traits or characteristics of persons, and in MLMs, they represent random intercepts and possibly random slopes associated with clusters, such as schools, countries or, in the case of longitudinal data, persons. We will use the generic term "clusters" for the entities represented by latent variables in the model. In a Bayesian setting, latent variable models are hierarchical Bayesian models with latent variables as "direct" parameters, whose priors depend on hyperparameters. Bayesian approaches have recently received increased interest due to improvements in the ease of coding and estimating complex models by Markov chain Monte Carlo (MCMC) in various software packages. When it is computationally demanding to evaluate the marginal likelihood (over the latent variables) for likelihood-based inference, MCMC has the great advantage of working with the conditional likelihood (given the latent variables) by sampling the latent variables in tandem with the other model parameters.

Comparison of Bayesian psychometric models is often facilitated by predictive information criteria, including the Deviance Information Criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, [46]) and the Watanabe–Akaike Information Criterion (WAIC; Watanabe, [55]). These are readily computed by popular Bayesian software packages, and their computations are heavily based on the model likelihoods. The sampling of latent variables can lead to a decision point related to the computation of the information criteria. The decision point specifically involves the question of whether or not the latent variables should be treated as model parameters. If so, then we would compute information criteria via conditional likelihoods and expect "effective number of parameter" metrics to be large (each cluster has one or more unique latent variable values, and there are many clusters). If not, then we would compute information criteria via marginal likelihoods that are not generally available from the MCMC output. In traditional model estimation methods that optimize a likelihood or discrepancy function, the marginal likelihood (or some approximation of it) is used a large majority of the time.

Because DIC and WAIC are measures of models' abilities to make predictions for new units, the "conditional vs. marginal" issue can be framed as a question of the type of new data we wish to predict. If we want our models to make predictions for new units from the same clusters as in our original data, then the conditional likelihood (conditioning on the latent variables of the specific clusters observed) is appropriate. Conversely, if we want our models to make predictions for new units from clusters not in the original data, then the marginal likelihood is appropriate. We can also think of this distinction in terms of how we would define the folds in k-fold cross-validation and leave-one-out approaches, or in other words, what we would leave out during estimation in order to evaluate predictive accuracy. To generalize to new units from existing clusters, we would leave out individual units from these clusters, and the predictive distribution for these units would exploit information from other units in the same clusters by conditioning on the cluster-specific latent variable values. To generalize to new units from new clusters, we would leave out intact clusters, and the predictive distribution would be marginal over the latent variables associated with the clusters. From this standpoint, the marginal likelihood may generally be preferred in psychometric modeling; we typically wish to generalize inferences beyond the specific clusters that were observed in our data. There are additional reasons to generally prefer the marginal likelihood, including the incidental parameter problem (Neyman & Scott, [35]; Lancaster, [18]), of which a Bayesian version exists when Bayes modal inference is used (also see Mislevy, [31]; O'Hagan, [36]).

Spiegelhalter et al. ([46]) discuss the conditional/marginal distinction in their original paper on the DIC, where they refer to the issue as "model focus" (see also Celeux, Forbes, Robert, Titterington, et al., [1]; Millar, [29]; Trevisani & Gelfand, [49]). If the parameters "in focus" include the latent variables, the likelihood becomes what we call the conditional likelihood; otherwise, the likelihood is what we call the marginal likelihood. However, the distinction is hardly ever discussed in applications. Bayesian books targeted at applied psychometricians (e.g., Song & Lee, [45]; Kaplan, [17]; Levy & Mislevy, [19]) and papers on Bayesian model comparison in psychometrics (e.g., Kang, Cohen, & Sung, [16]; F. Li, Cohen, Kim, & Cho, [20]; Zhu & Stone, [61]) define either the conditional DIC or a generic DIC for non-hierarchical models and then apply the conditional DIC without discussing the implications of this choice or mentioning the marginal DIC. Fox ([6]) is an exception here, where the generic DIC is first presented and various types of DICs (conditional and marginal) are later utilized in applications. Zhang, Tao, Wang, and Shi ([59]) also discuss various forms of DIC in the context of multilevel IRT models.

The WAIC has not yet been used much in psychometrics, exceptions being Luo and Al-Harbi ([25]) and daSilva, Bazán, and Huggins-Manley ([2]), who define the conditional version of WAIC for IRT without mentioning the marginal alternative, and Lu, Chow, and Loken ([22]) who use the marginal version in factor analysis, without discussing the issue or mentioning the conditional alternative. Moving beyond psychometrics, Gelman, Hwang, and Vehtari ([10]) point out that there is a choice to be made when defining the likelihood for Bayesian information criteria (including WAIC) between what we call conditional and marginal versions, but they use only the conditional version of WAIC in their "8-schools data" application and comparison to leave-one-out approaches. We are aware of a small number of contributions that discuss the merits of the marginal version of WAIC (Li, Qui, & Feng, [21]; Millar, [30]), but these treatments are confined to the special case of one response per latent variable (i.e., one unit per cluster). Further, Vehtari, Mononen, Tolvanen, Sivula, and Winther ([53]) define a marginal version of WAIC for Gaussian latent variable models without contrasting it to the conditional version.

From an applied perspective, researchers estimating Bayesian models are often unaware of the distinction between marginal and conditional versions of information criteria, relying on default software behavior. Users of BUGS (Lunn, Jackson, Best, Thomas, & Spiegelhalter, [23]; Lunn, Thomas, Best, & Spiegelhalter, [24]), JAGS (Plummer, [39]), and Stan (Stan Development Team, [48]) typically obtain information criteria based on a conditional likelihood, whereas users of blavaan (Merkle & Rosseel, [28]) and Mplus (e.g., Muthén & Asparouhov, [32]) obtain information criteria based on a marginal likelihood. Therefore, a first contribution of our paper is to emphasize the importance of the conditional/marginal distinction for information criteria and, subsequently, to recommend marginal criteria as a default. We relatedly clarify other differences in the definition of the DIC implemented in various pieces of software. A second contribution is to propose a marginal version of WAIC for the prevalent case of multiple units per cluster and to contrast it with the conditional version. We specifically show that the marginal WAIC corresponds to leave-one-cluster out (LOcO) cross-validation, whereas the conditional WAIC corresponds to leave-one-unit out (LOuO) cross-validation and has no rationale in the case of one unit per cluster. A third contribution is to propose an efficient version of adaptive quadrature for the case where the marginal likelihood does not have a closed form. The method differs from traditional adaptive quadrature because it is performed as a postprocessing step using MCMC samples of the model parameters and latent variables as input. Our idea of using the marginal (with respect to the model parameters) posterior first and second moments of the latent variables in the adaptive quadrature approximation can easily be applied in Monte Carlo integration by using the corresponding normal density as importance distribution, and this would be an improvement over the method proposed by Li et al. ([21]). A fourth contribution is to provide two real examples, one using SEM and the other using IRT, where conditional and marginal information criteria can lead to very different substantive conclusions. For these examples, a large number of Monte Carlo draws is often needed to obtain acceptable Monte Carlo errors for the information criteria, especially in the conditional case. We make some practical recommendations for assessing convergence of the information criteria there.

In the next section, we define the class of models considered and in Section 3 we define and contrast information criteria based on conditional and marginal likelihoods. Two examples based on real data are presented in Section 4, one using SEM (Section 4.1) and the other IRT (Section 4.2). We end with a discussion in Section 5. The supplementary material includes R code to estimate the SEM and IRT models presented in this paper and obtain the information criteria—by performing adaptive quadrature integration for IRT and by using the recently extended R package blavaan for SEM.

Models and Conditional Versus Marginal Likelihoods

We consider models with continuous latent variables, such as IRTs, SEMs and MLMs. Responses yij are observed for units i belonging to clusters j, such as students i in schools j or items/indicators i for persons j, whereas the latent variables ζj are cluster specific. The responses follow a generalized linear (or similar) model in which the conditional expectation of the responses is a function of the latent variables ζj and possibly observed covariates. The conditional likelihood is

1 fc(y|ω,ζ)=j=1Jfc(yj|ω,ζj),

Graph

where yj is the vector of responses for the units in cluster j ( j=1,...,J ), y is the vector of responses for all units (the yj stacked on one another), ζj is the vector of latent variables for cluster j, ζ is the vector of latent variables for all clusters (the ζj stacked on one another), and ω are model parameters. Under conditional independence, the joint conditional density (or probability) of the responses for cluster j factorizes as

fc(yj|ω,ζj)=i=1njfc(yij|ω,ζj),

Graph

where fc(yij|ω,ζj) is the conditional density of the response for unit i in cluster j ( i=1,...,nj ), given ω , ζj , and possibly covariates (not shown).

For example, in a confirmatory factor analysis, yij is a continuous response for indicator i and person j, ζj is a vector of common factors, fc(yij|ω,ζj) is a normal density, and ω includes intercepts, factor loadings, and unique factor variances. In an item response model with binary items, f(yij|ω,ζj) is πijyij(1-πij)1-yij where πij is the conditional probability that yij equals 1 (often an inverse logit or inverse probit of a linear predictor), ζj are dimensions of ability, and ω includes difficulty parameters and possibly discrimination and guessing parameters. We call fc(y|ω,ζ) the conditional likelihood because it is conditional on the latent variables. Note that, in likelihood-based inference, "conditional likelihood" usually refers to the likelihood in which the latent variables have been eliminated by conditioning on sufficient statistics, whereas inference based on our conditional likelihood above is sometimes referred to as joint maximum likelihood estimation.

The latent variables are independent across clusters and have a joint prior distribution

g(ζ|ψ)=j=1Jg(ζj|ψj)

Graph

with hyperparameters ψ that potentially include different values ψj for different (groups of) clusters j. Specification of the Bayesian model is completed by assuming prior distributions for ω and ψ , typically p(ω,ψ)=p(ω)p(ψ) .

The marginal likelihood is

2 fm(y|ω,ψ)=j=1Jfc(yj|ω,ζj)g(ζj|ψj)dζj.

Graph

We call this likelihood marginal because it is integrated over the latent variables. However, this likelihood is not marginal over the model parameters and should therefore not be confused with the marginal likelihood that is used, for instance, for Bayes factor computation. The marginal likelihood defined above is the standard likelihood employed in maximum likelihood estimation of latent variable or random-effects models. Its use implies that the latent variables are not of direct inferential interest, instead being ancillary parameters that are removed from the likelihood. Celeux et al. ([1]), who regard the latent variables as missing data, refer to this likelihood as the observed likelihood because it is marginal over the missing data.

Note that there are usually some parameters that can be included either in ψ or in ω , depending on how the model is formulated. For example, a linear two-level variance-components model can be written two alternative ways

A:yijN(ζj,σ2),ζjN(α,τ2)B:yijN(α+ζj,σ2),ζjN(0,τ2).

Graph

Version A corresponds to the two-stage formulation of hierarchical models (Raudenbush & Bryk, [43]), where latent variables (or random effects) appear in the model for yij and become "outcomes" in a second-stage model. In SEM, the first-stage and second-stage models are referred to as the measurement part and structural part, respectively. In the Bayesian literature, this formulation has been referred to as hierarchical centering (Gelfand, Sahu, & Carlin, [8]). If the model is formulated like this, the hyperparameters ψ include the mean α and variance τ2 . Version B above specifies the reduced form model for yij so that ζj becomes a disturbance with zero mean and there is now only one hyperparameter, ψ=τ2 .

Conditional and Marginal Information Criteria

To illustrate problems and differences associated with the use of conditional (vs marginal) likelihoods, we focus on two specific metrics: DIC and WAIC. The former is included due to its popularity (related to its inclusion in BUGS and JAGS), whereas the latter is included because it is a more recent metric that is closer to the current "state of the art" in Bayesian model assessment (e.g., Gelman et al., [9], [10]; McElreath, [27]; Vehtari, Gelman, & Gabry, [52]).

DIC

The deviance information criterion (DIC) was introduced by Spiegelhalter et al. ([46]). For a model with parameters estimated as θ~ , they define the loss in assigning the model-implied predictive density f(yr|θ~) to new, replicate (or out-of-sample) data yr as -2logf(yr|θ~) . Following Efron ([5]), they express the expectation of this out-of-sample loss over the distribution of the replicate data as the sum of the in-sample loss (evaluated for the observed data) and the "optimism" due to using the same data to estimate θ and evaluate the loss:

3 Eyr[-2logf(yr|θ~)]=-2logf(y|θ~)+optimism,

Graph

where the first term is sometimes referred to as the "plug-in deviance." Spiegelhalter et al. ([46]) use heuristic arguments to approximate the optimism by 2pD , where

4 pD=Eθ|y[-2logf(y|θ)]+2logf(y|θ~),

Graph

the posterior expectation of the deviance minus the deviance evaluated at θ~ , typically the posterior expectation of the parameters. Here, pD can be interpreted as a measure of the effective number of parameters. Combining these expressions, the DIC becomes

5 DIC=-2logf(y|θ~)+2pD=Eθ|y[-2logf(y|θ)]+pD.

Graph

Plummer ([40]) proposes an alternative approximation for the optimism, which makes use of the undirected Kullback–Leibler divergence (see also Plummer's contribution to the discussion of Spiegelhalter et al., [46]). In the general case, this alternative is given by

6 2pDP=Eθ|ylogf(yr1|θ1)f(yr1|θ2)+logf(yr2|θ2)f(yr2|θ1),

Graph

where θ1 and θ2 are independent posterior draws (e.g., from parallel chains), yr1 is a replicate drawn from f(y|θ1) , and yr2 is defined similarly. For exponential family models, including the multivariate normal that is used later, the Kullback–Leibler divergence has a closed form so that the posterior expectation can be evaluated directly without the need for replicates. However, instead of substituting the above expression for the optimism in (3), or in other words for 2pD in the first line of (5) as apparently recommended by Plummer ([40]) and as implemented in blavaan (the approach that will be called the "Plummer definition" in Section 4), JAGS substitutes half this expression for pD in the second line of (5). The JAGS approach is convenient as it does not require the postprocessing step of computing the plug-in deviance after the MCMC samples have been obtained. However, the JAGS approach to DIC in (6) differs from the BUGS approach in (4), and many researchers fail to realize this difference.

Spiegelhalter et al. ([46]) emphasize that, for hierarchical Bayesian models, the definition of pD and DIC depends on which parameters are "in focus," or in other words, which parameters are included in θ . If the parameters in focus include the latent variables ζ , then f(·|·) in Equations (3) to (5) is fc(y|ω,ζ) from (1), whereas it is fm(y|ω,ψ) from (2) if the parameters in focus exclude the latent variables. We therefore define the conditional DIC as

7 DICc=-2logfc(y|ω~,ζ~)+2pDc=Eω,ζ|y[-2logfc(y|ω,ζ)]+pDc

Graph

and the marginal DIC as

8 DICm=-2logfm(y|ω~,ψ~)+2pDm=Eω,ψ|y[-2logfm(y|ω,ψ)]+pDm

Graph

with corresponding conditional and marginal variants of (4) or (6) for pDc and pDm . Note that Celeux et al.'s ([1]) DIC7 and DIC1 correspond to our conditional and marginal DIC, respectively.

Conditioning on ζ implies that the new units yr in hypothetical replications come from the same clusters as the observed data, whereas marginalizing over ζ implies that the new units are for new clusters. We show in Appendix A that Eω,ζ|y[-2logfc(y|ω,ζ)]Eω,ψ|y[-2logfm(y|ω,ψ)] and in Appendix B that pDc tends to be much larger than pDm . Because the magnitudes of these differences will vary between models, there will be many instances where conditional DICs and marginal DICs favor different models. Plummer ([40]) shows that the large-sample behavior of the penalty pD depends on the dimension of θ and that the penalty may not approach the optimism if the dimensionality of θ increases with the sample size, as it does for latent variable models when ζ is in focus. This finding implies that the penalty term is a poor approximation of the optimism when the conditional DIC is used.

When the individual clusters are not of intrinsic interest or when we would like to choose among models that differ in the specification of the distribution for the latent variables p(ζ|ψ) , then clearly the marginal DIC should be used. However, the conditional DIC is easier to compute because it does not require integration. For this reason, users of MCMC software (e.g., BUGS, JAGS) usually obtain the conditional DIC as output and are often unaware of the important distinction between conditional and marginal versions of the DIC.

It is sometimes possible to evaluate the marginal likelihoods by exploiting software for maximum likelihood estimation. For example, blavaan uses the related lavaan package (Rosseel, [44]) to evaluate the marginal likelihood for each MCMC iteration to obtain the posterior expectation of the deviance for pDm and once after MCMC sampling is complete to compute the plug-in deviance. In the case of normal likelihoods, as in linear mixed models or SEMs with continuous responses, the marginal likelihood has a closed form. However, closed-form solutions are generally not available for non-normal likelihoods, as in models with categorical responses. For such cases, we propose approximating the integrals using a computationally efficient form of adaptive quadrature (Naylor & Smith, [34]; Rabe-Hesketh, Skrondal, & Pickles, [41]). With posterior samples of the model parameters and latent variables as input, the marginal deviances evaluated at each draw of the model parameters and at their posterior means are obtained by using the marginal (with respect to the model parameters) posterior first and second moments of the latent variables in the adaptive quadrature approximation. This quadrature method is less computationally intensive than typical adaptive quadrature methods, because it takes place after MCMC sampling and can use all posterior samples of model parameters for the "adaptive" step. While the method will eventually become computationally prohibitive as the number of latent variables increases, its performance is improved over traditional quadrature methods. See Appendix C for details, where we also point out that the normal approximation to the marginal posterior distribution could be used as the importance distribution in Monte Carlo integration, which would be an improvement over the standard Monte Carlo integration used by Li et al. ([21]) to compute predictive information criteria. For probit models and MCMC sampling with augmented variables, Fox ([6], p. 190) and Zhang et al. ([59]) exploit the closed form of the marginal likelihood of the augmented variables to obtain different kinds of marginal DICs.

Zhao and Severini ([60]) relatedly describe and compare a variety of other methods for computing integrated likelihoods. Our approach is closest to their "direct method" of importance sampling (see their Section 3.1.2), with importance density chosen to be similar to what they call the "weighted likelihood function." However, in the weighted likelihood function, we use the marginal prior instead of the conditional prior distribution to improve computational efficiency. In addition, we use Gauss–Hermite quadrature instead of Monte Carlo integration because the importance distribution is normal. For this reason, the Naylor and Smith ([34]) approach is closest to ours in the Bayesian literature.

WAIC

Gelman et al. ([10]) and Vehtari et al. ([52]) describe WAIC (Watanabe, [55]) as an approximation to minus twice the expected log pointwise predictive density (elppd) for new data. The definition of the WAIC requires that the data points, such as responses yij for units, are independent given the parameters, so we first consider the conditional case where the parameters θ include the latent variables. In this case, minus twice elppd is

9 -2j=1Ji=1njEyijrlogEω,ζ|yfc(yijr|ω,ζj)=-2j=1Ji=1njlogEω,ζ|yfc(yij|ω,ζj)+optimism.

Graph

where Eω,ζ|yfc(yijr|ω,ζj) on the left is the pointwise (posterior) predictive density for a new response yijr and its logarithm is averaged over the distribution of new responses. In contrast, the first term on the right is evaluated at the observed response yij and therefore represents the in-sample version of the elppd, called lppd by Gelman et al. ([10]). Unlike the plug-in deviance in the DICc , where we condition on point estimates ω~,ζ~ , here we take the expectation of fc(yijr|ω,ζj) over the posterior distribution of ω and ζ . This can be written as

Eω,ζ|yfc(yijr|ω,ζj)=fc(yijr|ω,ζj)p(ζj|yj,ω,ψ)p(ω,ψ|y)dψdωdζj,

Graph

where the term in square brackets evaluates to the joint posterior p(ω,ζj|y) . It is clear from this expression that, even after conditioning on ω and ψ , whose posterior depends on all the data, the data yj for cluster j provide direct information on ζj . As discussed by Gelman, Meng, and Stern ([12]), this posterior predictive density is appropriate for the situation where hypothetical new data are responses from the existing clusters. The conditional WAIC is then given by

10 WAICc=-2j=1Ji=1njlogEω,ζ|yfc(yij|ω,ζj)+2pWc,

Graph

where pWc is the effective number of parameters and can be approximated (see Gelman et al. ([10]) for an alternative approximation)

pWc=j=1Ji=1njVarω,ζ|ylogfc(yij|ω,ζj).

Graph

Vehtari et al. ([52]) discuss problems with this approximation when any of the posterior variances of the log pointwise predictive densities, Varω,ζ|y[logfc(yij|ω,ζj)] , are too large, giving 0.4 as an upper limit based on simulation evidence.

If the hypothetical new data come from new clusters with new values ζjr of the latent variables, the mixed predictive density (Gelman et al., [12]; Marshall & Spiegelhalter, [26]) should be used, which is the posterior expectation of fm(yijr|ω,ψ) and can be written as

11 Eω,ψ|yfm(yijr|ω,ψ)=fc(yijr|ω,ζj)p(ζj|ψ)dζjp(ω,ψ|y)dωdψ,

Graph

where the term in brackets, which evaluates to fm(yijr|ω,ψ) , involves only the prior of ζj . When making predictions for a new cluster, the data provide information on ζj only indirectly by providing information on ψ . For clustered data, we cannot simply use (11) to define the marginal WAIC because the responses for different units from the same cluster are not conditionally independent given ω and ψ , so that we must redefine the "data points" as the vectors of responses yj for each cluster j. The marginal WAIC therefore becomes

WAICm=-2j=1JlogEω,ψ|yfm(yj|ω,ψ)+2pWm

Graph

with

pWm=j=1JVarω,ψ|ylogfm(yj|ω,ψ),

Graph

where fm(yj|ω,ψ) is given as term j of the product in (2). Li et al. ([21]) and Millar ([30]) define WAICm for the case when there is only one unit per cluster. Examples are overdispersed Poisson or binomial regression models, where the unit-level latent variable modifies the variance function, and meta-analysis, where the data for the clusters have been aggregated into effect-size estimates.

Interpretation of WAIC via connections with LOO-CV

Watanabe ([55]) showed that WAIC and Bayesian leave-one out (LOO) cross-validation (CV) are asymptotically equivalent. Following our previous discussion, the different forms of WAIC correspond to different forms of LOO-CV. Because the predictive distribution in the conditional WAIC is for a response from a new unit in an existing cluster, it approximates leave-one-unit-out CV, given as

-2LOuO-CV=-2j=1Ji=1njlogEω,ζ|y-ijfc(yij|ω,ζj),

Graph

where y-ij represents the full data with observation ij held out. In contrast, the predictive distribution in the marginal WAIC is the joint marginal distribution for the responses of all units in a new cluster and therefore approximates leave-one-cluster-out CV:

-2LOcO-CV=-2j=1JlogEω,ψ|y-jfm(yj|ω,ψ).

Graph

In other words, the marginal WAIC generalizes to new clusters, whereas the conditional WAIC generalizes only to new units in the clusters that are in the data. Note that the first term in WAICm , the in-sample lppd, corresponds to the "hierarchical approximation" (Vehtari, Mononen, et al., [53]) or the "full-data mixed" (Marshall & Spiegelhalter, [26]) method for approximating what we call LOcO-CV.

Li et al. ([21]) and Millar ([30]) motivate their marginal WAICm as an approximation to LOO-CV in the case of one unit per cluster. Millar ([30]) shows that conditional leave-one out (our LOuO) becomes marginal leave-one out (our LOcO) when there is only one unit per cluster, because there are no data for the cluster to condition on after removing the unit. Therefore, WAICc has no clear justification in this case. Vehtari et al. ([52]) nevertheless use WAICc for the "eight schools" data, a meta-analysis of SAT preparatory programs. They find that WAICc diverges from LOO-CV when the responses are multiplied by a factor S with S>1.5 (see their Figure 1a). We replicated their results and found that WAICm , which Vehtari et al. ([52]) did not consider, continues to be a good approximation to LOO-CV for S=4 , for which we obtained LOO-CV =86.0 , WAICm=85.5 , and WAICc=68.7 . Millar ([30]) also points out that two regularity conditions required for asymptotic equivalence of WAIC and LOO-CV do not hold for the conditional versions, namely (1) the observations are not identically distributed (their distribution depends on the latent variables ζj ) and (2) the number of parameters increases with the sample size. These two regularity conditions are also violated in the clustered case, so it is not clear whether WAICc is a good approximation to LOuO-CV.

Recent work describes the fact that LOO-CV (and, consequently, WAIC) does not necessarily select the true or best model. Piironen and Vehtari ([37]) explored use of WAIC (and other metrics) for selecting subsets of predictors in a regression context, finding that variability associated with its estimation can lead to poor model selection. Gronau and Wagenmakers ([14]) provide three artificial examples where LOO-CV does not asymptotically select the data-generating model. Discussants of these issues have recommended against the sole use of point estimates for model selection, also considering, e.g., qualitative patterns of data that the model can accommodate (Navarro, [33]) and uncertainty associated with the model's predictive accuracy (Vehtari, Simpson, Yao, & Gelman, [54]). Uncertainty in a model's predictive accuracy may include explicit consideration of variability in a metric like WAIC, or it may include a different procedure such as Bayesian model averaging (Hoeting, Madigan, Raftery, & Volinsky, [15]) or stacking (Yao, Vehtari, Simpson, & Gelman, [58]).

Vehtari et al. ([52]) introduce a computationally efficient approximation to LOO-CV using Pareto-smoothed importance sampling (PSIS), and they implement this method in the R package loo (Vehtari, Gelman, & Gabry, [51]). This package computes both PSIS-LOO and WAIC with MCMC draws of pointwise predictive densities as input, and it can be used to compute PSIS-LOuO, PSIS-LOcO, WAICc , and WAICm by providing either logfc(yij|ω,ζj) or logfm(yj|ω,ψ) as input. This package is used in Section 4.2.

Examples

We now discuss two examples to highlight differences between various types of Bayesian information criteria. The first example involves a series of increasingly constrained multiple-group factor analysis models, as might be used in a measurement invariance study. This example focuses on DIC and its various definitions across software and across conditional/marginal likelihoods. The second example involves item response models, where marginal versions of the criteria do not have closed forms. In this example, we consider WAIC and PSIS-LOO in addition to DIC.

Measurement Invariance Study via Multi-Group CFA

Wicherts, Dolan, and Hessen ([57]) compared a series of four-group, one-factor models using data from a stereotype threat experiment. Here, we replicate these model comparisons via both conditional DIC and marginal DIC; the original authors' comparisons involved frequentist models. This allows us to illustrate the practical impact of using different DICs.

Method

The data consist of Differential Aptitude Test scores from 295 Dutch students comprising both majority and minority Dutch ethnicities. The test contained three subscales (verbal ability with 16 items, mathematical ability with 14 items, and abstract reasoning with 18 items), where each subscale score was computed as the number of items that a student answered correctly. Gaussian factor analysis is used here; it would perhaps be more appropriate to employ an item factor analysis model here, but our models below generally match those of the original authors.

The original authors used a "stereotype threat" manipulation during the data collection: half of the students were primed about ethnic stereotypes related to intelligence tests, while the other half received no primes. Wicherts et al. ([57]) conducted a measurement invariance study with the resulting data, to examine the impact of the stereotype threat manipulation on the students. This measurement invariance study involved four groups: two ethnicities (majority, minority) crossed with the stereotype threat manipulation (received, not received). The measurement model for subscale i ( i=1,2,3 ) and person j ( j=1,...,295 ) in group g ( g=1,2,3,4 ) is

yijg|νig,λig,σig,ζjgN(νig+λigζjg,σig2),

Graph

where νig is a group-specific intercept, λig is a group-specific factor loading for subscale i (with the loadings for the first subscale set to 1 for identification, λ1g=1 ), ζjg is the common factor, and σig2 is the group-specific variance of the unique factor for subscale i.

The parameters νig , λig , and σig2 are in ω and have the following prior distributions for g=1,2,3,4 :

ν1g,ν2g,ν3gN(0,1000)λ2g,λ3gN(0,100)σ1g-2,σ2g-2,σ3g-2Gamma(1,.5).

Graph

The distribution of the common factor, or structural part of the SEM is

ζjgN(αg,τg2)

Graph

with group-specific means αg and variances τg2 . These hyperparameters are in ψ and have the following priors:

αgN(0,100)τg-2Gamma(1,1),

Graph

with the exception that α1 is fixed to zero for identification. These prior distributions are similar to the defaults supplied by blavaan. They are just informative enough so that most models converge in JAGS, while being uninformative enough to not exert much influence on the parameters' posterior distributions.

Description of estimated models.

Model

Parameter restrictions

Parameter count

2

All λig=λii=1,...,3;g=1,...,4

30

2a

Model 2, except λ24 unrestricted

31

3

Model 2a, plus all σig2=σi2

22

3a

Model 3, except σ242 unrestricted

23

4

Model 3a, plus τ12=τ22 and τ32=τ42

21

5

Model 4, plus all νig=νi except ν24 unrestricted

16

5a

Model 5, except ν31 unrestricted

17

5b

Model 5a, except ν32 unrestricted

18

6

Model 5b, plus α1=α3

17

The parameter count is the number of free parameters in the marginal likelihood. Also see Table 3 of Wicherts et al. 2005.

The measurement invariance study involved the comparison of nine versions of the model above; specific model details are shown in Table 1. The overall strategy was to set increasingly more parameters equal across groups (i.e., across the g subscript), starting with factor loadings λig (models 2 and 2a), then unique factor variances σig2 (models 3 and 3a), then common factor variances τg2 (model 4; constant across conditions but allowed to differ between minority and non-minority groups), then intercepts, νig (models 5, 5a and 5b), and finally factor means αg (model 6; constant across conditions but allowed to differ between minority and non-minority groups). The models without letters resulted from the previous model by constraining one type of parameter, whereas the model with the same number and a letter resulted from freeing some of these parameters again based on SEM fit criteria and modification indices. For example, model 5 (intercepts restricted across all four groups) is nested in models 5a (one intercept freed) and 5b (another intercept freed), and model 6 (factor means constrained equal across conditions) is nested in model 5b. See Wicherts et al. ([57]) for further detail on the models.

We conduct a Bayesian re-analysis of the nine Wicherts et al. ([57]) models here, focusing on model comparison via DIC. We used JAGS to estimate each of the models ten times; for each of these estimations, we used "automatic" computations of chain length and convergence proposed by Raftery and Lewis ([42]) and implemented in R package runjags (Denwood, [4]). This algorithm first runs three chains for 4,000 iterations each, checking that all parameters' Gelman–Rubin statistics (Gelman & Rubin, [13]) are below 1.05. If not, the algorithm runs for 4,000 more iterations and re-checks, continuing in this fashion until either the statistics are below 1.05 or a maximum time limit has been reached. Once the 1.05 threshold is achieved for all parameters, the algorithm performs a computation to determine the number of additional samples necessary to estimate each parameter's 2.5th quantile to an accuracy of.005 with probability.95. Then the chains are run for this number of additional samples. Thus, the specific number of samples varies for each model estimation.

For each estimation, we computed four DIC statistics via blavaan: the conditional and marginal versions using both the Spiegelhalter et al. ([46]) definition in (4) and the Plummer ([40]) definition in (6) for "effective number of parameters."

Results

Spiegelhalter et al.'s ([46]) conditional and marginal DICs for ten estimations of each of the nine models are displayed in Figure 1. Here, the scale of the y-axis for the marginal case is a tenfold magnification of that for the conditional case, showing that the Monte Carlo error is far greater for the conditional DICs. However, there is substantial Monte Carlo error also in the marginal case, making model comparison unreliable for some pairs of models (e.g., Models 3 and 2a). This suggests that accurate DICs require sampling for more iterations, as compared to what is required for obtaining accurate parameter estimates.

Especially because these models were part of a measurement invariance study, we also examine where we would stop as we moved from Model 2 to 6 in a sequential/stepwise fashion (as was done in the original paper, though this is not required). In the marginal case, the DIC for Model 3 is generally larger than the DIC for Model 2a, so we may stop at Model 2a. However, Monte Carlo error is large enough that we may prefer Model 3 to Model 2a in a given replication, in which case we may stop at Model 4. But the marginal DIC is lowest for the final Model 6. The results are very different in the conditional case. Here, we would stop at Model 2a across all ten replications. Further, the models generally obtain larger DIC values as we move toward the final model 6, in stark contrast to the marginal case. Models 5b and 6 have the worst values in the conditional case, whereas they are the best in the marginal case.

Graph: Fig. 1Marginal and conditional DICs (Spiegelhalter et al. definitions) for nine models from Wicherts et al., where each model was estimated ten times.

Beyond the comparison of DICs, we further examine the four definitions of effective number of parameters: marginal versus conditional crossed with the Spiegelhalter versus Plummer definitions. These are displayed for the Wicherts et al. ([57]) models in Figure 2, with the lines showing simple parameter counts for each model (where, in the conditional case, latent variable values for the students count as parameters). Results are similar to the previous figure in that the conditional definitions exhibit greater Monte Carlo error than the marginal definitions (note the differences in y-axis scales, both within this figure and as compared to Figure 1). In the marginal case, we are counting the unrestricted item-specific parameters in ω and the unrestricted factor means and variances in ψ (note that these parameter counts increase when parameters are freed, i.e., going from model 2 to 2a, 3 to 3a, and from 5 to 5a to 5b.); the Plummer effective number of parameters is generally larger than the Spiegelhalter effective number of parameters here. In the conditional case, we are counting the unrestricted item-specific parameters in ω and the common factor values for all students. Here, we see greater discrepancy between the Spiegelhalter and Plummer definitions. The Spiegelhalter definition appears to work better here, as the values are nearly always smaller than the associated parameter counts (which we would expect based on the fact that the latent variables exhibit shrinkage). However, in fairness, Plummer ([40]) explicitly states that his definition is not intended for situations where the number of parameters increases with the number of observations, as they do in the conditional case. Beyond this, we observe situations where the marginal effective number of parameters decreases, while the conditional effective number of parameters increases or stays the same; the sequence from Models 4 to 6 exhibits multiple examples.

Graph: Fig. 2Estimated effective number of parameters under marginal and conditional DIC for nine models from Wicherts et al., where each model was estimated ten times. Lines reflect simple parameter counts for each model, where latent variables count as parameters in the conditional case. The Plummer DIC computes pD via Kullback–Leibler divergence as in (6), whereas the Spiegelhalter DIC computes pD via deviances as in (4).

Finally, we point out that much of the variability in DIC is due to Monte Carlo error in the estimate of the effective number of parameters. As shown in Appendix D, this Monte Carlo error is easy to estimate (in the Plummer case by ignoring any Monte Carlo error in the plug-in deviance). Figure 3 plots the Plummer DICs across the 10 replications, with error bars representing ±2 times the estimated Monte Carlo error (from a single replication) only in the effective number of parameters. We observe that the error bars reflect the magnitude of variability in the overall DIC values. To support this claim, we computed the proportion of the time that the error bars covered the DICs across the ten replications and nine models. This coverage was 76% in the conditional case and 83% in the marginal case (the coverage for the corresponding effective numbers of parameters was 89% and 99%, respectively). While the coverage for the DIC is lower than the nominal 95%, it illustrates that the deviance evaluated at the posterior means exhibits small variability compared to the variability in expected deviance (Spiegelhalter case) or K–L distance (Plummer case).

Graph: Fig. 3Marginal and conditional DICs (Plummer definitions) for ten replications of the Wicherts et al. models, with error bars (± 2 SDs) stemming from a single replication.

Prior Sensitivity

While the prior distributions in the previous section were related to blavaan default settings, other sets of priors could be considered. Two plausible sets include (i) less-informative priors, closer to Mplus defaults, and (ii) informative priors based on substantive knowledge of the data. We conducted the same analyses described in the previous section under these two new sets of priors.

The less-informative set of priors was

ν1g,ν2g,ν3gN(0,10,000)λ2g,λ3gN(0,1000)σ1g-2,σ2g-2,σ3g-2Gamma(.01,.01)αgN(0,1000)τg-2Gamma(.1,.1),

Graph

while the informative set of priors was

ν1g,ν2g,ν3gN(7,10)λ2g,λ3gN(0,1)σ1g-2,σ2g-2,σ3g-2Gamma(2.5,5)αgN(0,1)τg-2Gamma(2.5,5).

Graph

The informative priors were selected so that the posterior density was generally in the range of plausible parameter values: intercept priors are chosen to reflect the fact that the observed variables are scale scores that range between 0 and 18; loading priors are based on the fact that one loading is fixed to 1, and other loadings are expected to be similar; and the gamma distributions on precisions are selected based again on the scale scores' ranges (knowing that the ranges of the scale scores cannot produce extremely large variances).

The results for the uninformative set of priors (shown in Appendix E) are similar to the previous results, with some additional Monte Carlo error and convergence issues. The results for the informative set of priors are more interesting. Figure 4 is similar to Figure 1, except that the conditional y-axis scale is seven times the marginal y-axis scale (as opposed to Figure 1, where marginal was ten times as large). We generally observe that the conditional metrics' Monte Carlo errors have been drastically reduced under informative priors, though the error is still larger than that of the marginal metrics. The conditional and marginal metrics' selections of the best model continue to disagree with one another. Additionally, the Plummer computation of conditional DIC prefers a different model as compared to the Spiegelhalter computation of conditional DIC (see Appendix E).

Graph: Fig. 4Marginal and conditional DICs (Spiegelhalter et al. definitions) under informative prior distributions for nine models from Wicherts et al.

The results in this section illustrate that the use of marginal vs conditional DIC can impact substantive conclusions, in that different models can be selected with different versions of DIC and different conclusions can be reached about the effective number of parameters. We speculate that conditional DIC will generally prefer models of greater complexity, due to the relationship between conditional/marginal DIC and cross-validation. Because conditional DIC is related to predicting a held-out unit from an existing cluster, the remaining data from that cluster are helpful for making the prediction. In contrast, when predicting data from a held-out cluster (which is related to marginal DIC), we have greater uncertainty about the specific properties of the cluster. Thus, conditional DIC may support models of greater complexity, as compared to marginal DIC.

Beyond model preference, the conditional DIC exhibits larger Monte Carlo error and depends more strongly on the approximation used (Spiegelhalter versus Plummer), as compared to the marginal version. Further, the Monte Carlo error in conditional DIC is drastically reduced through use of informative prior distributions on model parameters. While strong prior information is not always available, this result suggests that even mild prior information can aid in precise computation of DIC. In the next section, we extend these results to item response models and other information criteria.

Explaining Individual Differences in Verbal Aggression via IRT

Here, we consider item response models with different person covariates, comparing them via conditional and marginal versions of DIC (Spiegelhalter et al. definitions), WAIC, and PSIS-LOO. In the case of IRT, the marginal information criteria utilize the numerical integration methods described in Appendix C.

Method

Several latent regression Rasch models are fit to the dataset on verbal aggression (Vansteelandt, [50]) that consists of J=316 persons and I=24 items. Participants were instructed to imagine four frustrating scenarios, and for each they responded to items regarding whether they would react by cursing, scolding, and shouting. They also responded to parallel items regarding whether they would want to engage in the three behaviors, resulting in six items per scenario (cursing/scolding/shouting × doing/wanting). An example item is, "A bus fails to stop for me. I would want to curse." The response options for all items were "yes", "perhaps", and "no." The items have been dichotomized for this example by combining "yes" and "perhaps" responses. Two person-specific covariates are included: the respondent's trait anger score (Spielberger, [47]), which is a raw score from a separate measure taking values between 11 and 39, and an indicator for whether the respondent is male, which takes the values 0 and 1. This leads us to consider five possible models: a model without person covariates, two models with one person covariate, a model with both person covariates, and a model with both person covariates plus their interaction.

The model that defines the conditional likelihood fc(y|ω,ζ) is

12 yij|xj,γ,ζj,δiBernoullilogit-1(xjγ+ζj-δi),

Graph

where yij=1 if person j responds positively to item i and yij=0 otherwise, xj is a vector of a constant and K-1 person-specific covariates, γ is a vector of regression coefficients, ζj is a person residual, and δi is an item difficulty parameter. The last item difficulty is constrained such that δI=-i(I-1)δi . The priors for the item difficulties and regression coefficients that comprise ω are

δ1,...,δI-1N(0,9)γ1,...,γKt1(0,1).

Graph

Unlike the previous example, ζj is just a disturbance here with zero mean and hyperparameter ψ=τ :

ζjN(0,τ2)τExp(.1).

Graph

Weakly informative priors are specified for the regression coefficients γ , following Gelman, Jakulin, Pittau, and Su ([11]). Specifically, after standardizing continuous covariates to have zero mean and standard deviation 0.5 and binary covariates to have mean 0 and range 1, t-distributions with one degree of freedom are used as priors for the corresponding regression coefficients, γ . The MCMC draws of the coefficients are then transformed to reflect the original scales of the covariates. Next, the priors for item difficulties have about 95% of their density lying between (-6,6) , reflecting the fact that the difficulties are on the logit scale (so it would be surprising to observe values outside of this range). Finally, the prior on τ is simply chosen to be uninformative.

Focus is placed on ζj for the conditional approach, which yields a prediction inference involving new responses from the same persons (and items). The marginal approach, perhaps more realistically, places focus on τ , implying a prediction inference involving new responses from new people. Five competing models are considered, differing only in what person covariates are included: Model 1 includes no covariates ( K=1 ), Model 2 has the trait anger score ( K=2 ), Model 3 has the indicator for male ( K=2 ), Model 4 has both covariates ( K=3 ), and Model 5 has both covariates and their interaction ( K=4 ).

Results

The five models are estimated via Stan using 5 chains of 2,500 draws with the first 500 draws of each discarded, resulting in a total of 10,000 kept posterior draws. The unusually large number of posterior draws (for Hamiltonian Monte Carlo) is chosen here due to the anticipated Monte Carlo errors for the information criteria, but such a large number is not ordinarily necessary for estimating the posterior means and standard deviations of the parameters. The adaptive quadrature method described in Appendix C is used to integrate out ζ for computing the marginal versions of DIC, WAIC, and PSIS-LOO. Eleven quadrature points were found to be sufficient for one of the models (Model 4) using the method described in Appendix C, and this number of points is used throughout. On a Windows desktop with 4 cores and 16GB of memory, the quadrature method took about 90 seconds per model to complete (with parallel processing).

Figure 5 provides the estimates for the information criteria and PSIS-LOO, where the conditional and marginal versions now have the same y-axis scales. The DIC and WAIC differ from each other for any given model, though they seem to show a similar pattern between models. The WAIC is a much better approximation to the PSIS-LOO in the marginal case than the conditional case. The high degree of Monte Carlo error in the conditional versions renders differentiating the predictive performance of the models difficult. In the marginal case, the amount of Monte Carlo error is less but still poses a degree of difficulty in making comparisons. However, Models 1 and 3 now clearly provide poorer predictions in comparison to the others. These models do not include trait anger, suggesting that this variable is an important predictor of verbal aggression. There is some evidence supporting Model 4 (both covariates, no interaction) as the best among the candidates.

Graph: Fig. 5Information criteria for the five latent regression Rasch models. Points represent the results of the 10 independent MCMC simulations per model. A small amount of horizontal jitter is added to the points.

Figure 6 provides the estimated effective number of parameters (dots) and actual number of parameters (lines) for the conditional and marginal versions of the DIC and WAIC. For conditional information criteria, the effective number of parameters is substantially less than the actual numbers of parameters, owing mainly to the fact that each ζj , with its hierarchical prior, contributes less than one to the effective number of parameters. For marginal information criteria, on the other hand, the effective number of parameters is close to the actual numbers of model parameters and, for WAIC, is somewhat larger than the number of parameters.

Graph: Fig. 6Estimated effective number of parameters (p^) for the five latent regression Rasch models. Points represent the results of the 10 independent MCMC simulations per model. A small amount of horizontal jitter is added to the points. The horizontal lines represent the counts of parameters associated with each model and focus.

For the WAIC, contributions to the effective number of parameters (from item-person combinations for pWc and from persons for pWm ) should be less than.4 which was always satisfied for the marginal WAIC but was violated for an average of between 2.1 and 3.4 contributions for the conditional WAIC across the five models (for details, see Furr, [7]).

Discussion

In this paper, we sought to clarify and illustrate the various forms of Bayesian information criteria that are used in practice. Researchers often rely on pre-packaged software to obtain these criteria, failing to realize the fact that multiple versions of the criteria are available depending on the software (BUGS vs. JAGS) and depending on whether or not the latent variables appear in the model likelihood (leading to conditional and marginal information criteria, respectively). Additionally, we showed that the criteria can suffer from large amounts of Monte Carlo error. Hence, long chains will often be necessary to obtain precise point estimates of the criteria. If ignored, these issues can lead to model comparisons that are irrelevant and/or irreproducible. While the conditional and marginal criteria will sometimes agree on the best model, the agreement is dependent on the number and types of models under consideration. In Appendices A and B, we provide theoretical results showing that the conditional and marginal information criteria will generally differ from one another.

Recommendations. Similarly to the use of marginal likelihoods in general latent variable models, we recommend use of marginal information criteria as the defaults in Bayesian analyses. As previously discussed, marginal criteria assess a model's predictive ability when applied to new clusters. This is exactly what is desired in many psychometric contexts, where clusters may be defined by, e.g., countries, schools, or students. In these cases, we wish to discern general properties of scales or items that are not specific to the clusters that were observed. As a side advantage, the marginal information criteria also tend to have less Monte Carlo error than their conditional counterparts. We would expect similar issues to arise when one uses posterior predictive estimates to study model fit, and this is a topic of future study.

Because the chain length necessary to obtain precise information criteria is generally larger than the length required to obtain stable posterior means of individual parameters, it may be useful to compute the information criteria at multiple points. First, the criteria can be computed when individual parameters' posterior means have converged and stabilized. The researcher can then continue running the chains for the same number of iterations, re-computing the information criteria and effective number of parameters, and assessing changes in results. This process can be continued until the information criteria have reached the desired level of precision. Alternatively, researchers might apply the Raftery and Lewis ([42]) automatic chain length computations directly to samples of the model deviance, as opposed to samples of individual model parameters. Regardless of strategy employed, our examples suggest that the Monte Carlo error of the effective number of parameters largely tracks the imprecision of the information criteria.

Beyond Monte Carlo error, there are further issues that lead us to not recommend conditional information criteria. For the DIC, Plummer ([40]) showed that the penalty does not approach the optimism when the number of parameters increases with the sample size (as they do in the conditional case), and we observed in the first example that the penalty term depended largely on the definition used (Plummer versus Spiegelhalter). In the second example, we observed that the WAIC differed much more from PSIS-LOO in the conditional case than in the marginal case. This corresponds to the work of Millar ([30]), who points out that two regularity conditions are not satisfied in the conditional case, one of them again being the number of parameters increasing with sample size. In the context of multilevel IRT models, Zhang et al. ([59]) also recommend against the conditional version of the DIC.

Concluding remarks. Researchers may also be interested in the abilities of the information criteria to select the true model. As mentioned previously, the information criteria are not designed to select the true model but instead to select the model with best out-of-sample predictive accuracy. In small datasets with complex data-generating models, the model with best out-of-sample predictive accuracy will tend to be simpler than the true model, because the true model would lead to overfitting and poor out-of-sample predictive accuracy.

Zhang et al. ([59]) conducted simulation studies using DIC and obtained related results. Their studies involved multilevel IRT models where item responses were nested in individuals, and individuals were nested in schools. For these models, one can marginalize only over person parameters or over both person and school parameters, leading to different forms of marginal DIC. Additionally, the authors considered "joint" forms of DIC that, in our notation, employed likelihoods that are roughly of the form f(yj,ζj|ω) . Zhang et al. found that the conditional DIC sometimes preferred models that were more complex than the data-generating model, and they ultimately recommended a version of DIC based on the person-level joint likelihood. This joint likelihood does not appear to have an immediate interpretation via leave-one-out cross-validation, because the data are modeled jointly with an unknown parameter. More work could be done to expand on this point.

The Zhang et al. ([59]) models also illustrate that, while the dichotomous "conditional/marginal" distinction is most relevant for psychometric models, middle grounds are evident in some situations. Multiple marginal versions of the information criteria exist when the model includes more than two levels and/or (partially-)crossed random effects. Further, in non-multilevel IRT contexts, both the item parameters and person parameters can be modeled as random effects (e.g., De Boeck, [3]). Here, a researcher might be interested in the model's predictive ability when new people complete the same items that originally entered in the model. This would lead the researcher to compute information criteria where the person random effects are marginalized out of the likelihood, but the item random effects are not. These scenarios highlight that, for latent variable models, it is insufficient to report a model information criterion without providing additional detail on how the criterion was computed. Further, the values that are automatically obtained from MCMC software are generally not suitable for further use.

A. Posterior Expectations of Marginal and Conditional Likelihoods

Following Trevisani and Gelfand ([49]), who studied DIC in the context of linear mixed models, we can use Jensen's inequality to show that the posterior expected value of the marginal log-likelihood is less than the posterior expected value of the conditional log-likelihood.

First, consider the function h(x)=xlog(x) . It is convex, so Jensen's inequality states that

13 h(E(x))E(h(x)).

Graph

Setting x=fc(y|ω,ζ) and taking expected values with respect to ζ , we have that

14 h(E(x))=logfc(y|ω,ζ)g(ζ|ψ)dζfc(y|ω,ζ)g(ζ|ψ)dζ

Graph

15 E(h(x))=log(fc(y|ω,ζ))fc(y|ω,ζ)g(ζ|ψ)dζ,

Graph

so that

16 logfc(y|ω,ζ)g(ζ|ψ)dζfc(y|ω,ζ)g(ζ|ψ)dζlog(fc(y|ω,ζ))fc(y|ω,ζ)g(ζ|ψ)dζ.

Graph

We now multiply both sides of this inequality by p(ω,ψ)/c , where p(ω,ψ) is a prior distribution and

17 c=ψωζfc(y|ω,ζ)g(ζ|ψ)dζ·p(ω,ψ)dωdψ

Graph

is the posterior normalizing constant. Finally, we integrate both sides with respect to ω and ψ to obtain

18 ψωlogζfc(y|ω,ζ)g(ζ|ψ)dζζfc(y|ω,ζ)g(ζ|ψ)dζ·p(ω,ψ)/cdωdψψωζlogfc(y|ω,ζ)fc(y|ω,ζ)g(ζ|ψ)dζ·p(ω,ψ)/cdωdψ.

Graph

We can now recognize both sides of (18) as expected values of log-likelihoods with respect to the model's posterior distribution, leading to

19 Eω,ψ|ylogfm(y|ω,ψ)Eω,ζ|ylogfc(y|ω,ζ).

Graph

Note that the above results do not rely on normality, so they also apply to, e.g., the two-parameter logistic model estimated via marginal likelihood.

B. Effective Number of Parameters for Marginal and Conditional DIC

To consider the effective number of parameters for normal likelihoods, we rely on results from Spiegelhalter et al. ([46]). They showed that the effective number of parameters pD can be viewed as the fraction of information about model parameters in the likelihood, relative to the total information contained in both the likelihood and prior. Under this view, a specific model parameter gets a value of "1" if all of its information is contained in the likelihood, and it gets a value below "1" if some information is contained in the prior. We sum these values across all parameters to obtain pD .

Spiegelhalter et al. ([46]) relatedly showed that, for normal likelihoods, pD can be approximated by

20 pDtr(I(θ^)V),

Graph

where θ includes all model parameters (including ζ in the conditional model), I(θ^) is the observed Fisher information matrix, and V is the posterior covariance matrix of θ . When the prior distribution of θ is non-informative, then I(θ^)V-1 . Consequently, matching the discussion in the previous paragraph, the effective number of parameters under non-informative priors will approximate the total number of model parameters.

This result implies that the conditional pD will tend to be much larger than the marginal pD . In particular, in the conditional case, each individual has a unique ζj vector that is included as part of the total parameter count. The resulting pD will not necessarily be close to the total parameter count because the "prior distribution" of ζj is a hyperdistribution, whereby individuals' ζj estimates are shrunk toward the mean. Thus, for these parameters, the "prior" is informative. However, even when the fraction of information in the likelihood is low for these parameters, the fact that we are summing over hundreds or thousands of ζj vectors implies that the conditional pD will be larger than the marginal pD .

C. Adaptive Gaussian Quadrature for Marginal Likelihoods

We modify the adaptive quadrature method proposed by Rabe-Hesketh et al. ([41]) for generalized linear mixed models to a form designed to exploit MCMC draws from the joint posterior of all latent variables and model parameters. Here, we describe one-dimensional integration, but the method is straightforward to generalize to multidimensional integration as in Rabe-Hesketh et al. ([41]). We assume that ζj represents a disturbance with zero mean and variance τ2 so that there is only one hyperparameter ψ=τ .

In a non-Bayesian setting, standard (non-adaptive) Gauss–Hermite quadrature can be viewed as approximating the conditional prior density g(ζj|τ) by a discrete distribution with masses wm , m=1,...,M at locations amτ so that the integrals in (2) are approximated by sums of M terms, where M is the number of quadrature points,

21 fm(y|ω,τ)j=1Jm=1Mwmfc(yj|ω,ζj=amτ).

Graph

To obtain information criteria, this method can easily be applied to the conditional likelihood function for each draw ωs and τs ( s=1,...,S ) of the model parameters from MCMC output. This approximation can be thought of as a deterministic version of Monte Carlo integration. Adaptive quadrature is then a deterministic version of Monte Carlo integration via importance sampling.

If applied to each MCMC draw ωs and τs , the importance density is a normal approximation to the conditional posterior density of ζj , given the current draws of the model parameters. Rabe-Hesketh et al. ([41]) used a normal density with mean and variance equal to the mean E(ζj|yj,ωs,τs) and variance var(ζj|yj,ωs,τs) of the conditional posterior density of ζj , whereas Pinheiro and Bates ([38]) and some software use a normal density with mean equal to the mode of the conditional posterior and variance equal to minus the reciprocal of the second derivative of the conditional log posterior. Here, we modify the method by Rabe-Hesketh et al. ([41]) for the Bayesian setting by using a normal approximation to the unconditional posterior density of ζj as importance density. Specifically, we use a normal density with mean

22 μ~j=E~(ζj|y)=1Ss=1Sζjs,

Graph

and standard deviation

23 ϕ~j=var~(ζj|y)=1S-1s=1S(ζjs-μ~j)2,

Graph

where ζjs is the draw of ζj from its unconditional posterior in the sth MCMC iteration. The tildes indicate that these quantities are subject to Monte Carlo error.

Note that this version of adaptive quadrature is computationally more efficient than the one based on the mean and standard deviation of the conditional posterior distributions because the latter would have to be evaluated for each MCMC draw and would require numerical integration, necessitating a procedure that iterates between updating the quadrature locations and weights and updating the conditional posterior means and standard deviations. A disadvantage of our approach is that the quadrature locations and weights (and hence importance density) are not as targeted, but the computational efficiency gained also makes it more feasible to increase M.

The adaptive quadrature approximation to the marginal likelihood for cluster j at posterior draw s becomes

24 fm(y|ω,τ)j=1Jm=1Mwjmsfc(yj|ω,ζj=ajm),

Graph

where the adapted locations are

25 ajm=μ~j+ϕ~j×am,

Graph

and the corresponding weights or masses are

26 wjms=2π×ϕ~j×expam22×gajm;0,τ2,s×wm,

Graph

where gajm;0,τ2,s is the normal density function with mean zero and variance τ2,s , evaluated at ajm .

The number of integration points M required to obtain a sufficiently accurate approximation is determined by evaluating the approximation of the target quantity (DIC, WAIC) with increasing values of M (7, 11, 17, etc.) and choosing the value of M for which the target quantity changes by less than 0.01 from the previous value. Here, the candidate values for M are chosen to increase approximately by 50% while remaining odd so that one of the quadrature locations is at the posterior mean. Furr ([7]) finds this approach to be accurate in simulations for linear mixed models where the adaptive quadrature approximation can be compared with the closed-form integrals.

D. Monte Carlo Error for the DIC and WAIC Effective Number of Parameters

For the DIC effective number of parameters, pD , we can make use of the well-known method for estimating the Monte Carlo error for the mean of a quantity across MCMC iterations. Let a quantity computed in MCMC iteration s ( s=1,...,S ) be denoted γs , so the point estimate of the expectation of γ is

γ¯=1Ss=1Sγs.

Graph

Then the squared Monte Carlo error (or Monte Carlo error variance) is estimated as

27 MCerr2(γ¯)=1Seff1S-1s=1S(γs-γ¯)2,

Graph

where Seff is the effective sample size.

For the effective number of parameter approximation in (6) proposed by Plummer ([40]), we can obtain the Monte Carlo error variance by substituting

γs=12logf(ysr1|θs1)f(ysr1|θs2)+12logf(ysr2|θs2)f(ysr2|θs1)

Graph

in (27).

For the effective number of parameter approximation in (4) proposed by Spiegelhalter et al. ([46]), we assume that the variation due to Eθ|y[-2logf(y|θ)] dominates and use

γs=-2logf(y|θs).

Graph

For the WAIC effective number of parameters, pW , we use expressions for the Monte Carlo error of sample variances (see, e.g., White, [56]). Let the variance of γs over MCMC iterations be denoted v(γ) ,

v(γ)=1S-1s=1S(γs-γ¯)2=1Ss=1STs,Ts=SS-1(γs-γ¯)2.

Graph

Then the Monte Carlo error variance is estimated as

28 MCerr2(v(γ))=1Seff×Ss=1S(Ts-v(γ))2,

Graph

The conditional version of the effective number of parameters is given by the sum over all units of the posterior variances of the pointwise log posterior densities,

pWc=j=1Ji=1njVarω,ζ|ylogfc(yij|ω,ζj).

Graph

The posterior variance Varω,ζ|ylogfc(yij|ω,ζj) for a given unit is estimated by v(γij) with

γijs=logfc(yij|ωs,ζjs),

Graph

where we have added subscripts ij to identify the unit, and has Monte Carlo error variance MCerr2(v(γij)) given in (28). The variance of the sum of the independent contributions v(γij) to p^Wc is the sum of the variances of these contributions,

MCerr2(p^Wc)=j=1Ji=1njMCerr2(v(γij)).

Graph

For the marginal version of the effective number of parameters, pWm , we define

γjs=logfm(yj|ωs,ψs)

Graph

and

MCerr2(p^Wm)=j=1JMCerr2(v(γj)).

Graph

E. Additional Results

This section contains additional results from the CFA example that were not included in the main text.

Figure 7 shows Spiegelhalter DIC values for models that use the uninformative priors described in the "Prior sensitivity" subsection. Of note here is that Models 2 and 2a sometimes failed to converge, resulting in fewer than ten points in the graphs. Because we used the automatic convergence procedure described in the main text, "failure to converge" here means that the chains did not achieve Gelman–Rubin statistics below 1.05 in the five minutes allotted. When we removed the 5-minute maximum time to convergence, we encountered situations where chains ran for days without converging. In our experience, these convergence issues are often observed for CFA models in JAGS with flat priors. Chains sometimes get stuck in extreme values of the parameter space and cannot recover.

Graph: Fig. 7 Marginal and conditional DICs (Spiegelhalter et al. definitions) under uninformative prior distributions for nine models from Wicherts et al.

Figure 8 shows Plummer DIC values for models that use the informative priors described in the "Prior sensitivity" subsection. The figure also contains error bars ( ±2 SDs) from a single replication, similarly to Figure 3 in the main text. These error bars appear to continue to track Monte Carlo error in DIC. Comparing Figure 8 to Figure 4, we observe a different pattern in the conditional DICs across the Plummer and Spiegelhalter definitions. The Spiegelhalter conditional DICs (Figure 4) consistently prefer Model 2a, whereas the Plummer conditional DICs (Figure 8) generally decrease across models and become lowest for the final models, labeled 5b and 6 (though Models 4 and 5a are also similar). On the other hand, the marginal DICs are similar across the Spiegelhalter and Plummer definitions.

Graph: Fig. 8 Marginal and conditional DICs (Plummer definitions) under informative prior distributions for nine models from Wicherts et al.

Electronic supplementary material

Graph: Supplementary material 1 (zip 7 KB)

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References 1 Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Analysis. 2006; 1; 4: 651-673 2 daSilva MA, Bazán JL, Huggins-Manley AC. Sensitivity analysis and choosing between alternative polytomous IRT models using Bayesian model comparison criteria. Communications in Statistics-Simulation and Computation. 2019; 48; 2: 601-620 3 De Boeck P. Random item IRT models Random item IRT models. Psychometrika. 2008; 73: 533-559 4 Denwood MJ. runjags: An R package providing interface utilities, model templates, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of Statistical Software. 2016; 71; 9: 1-25 5 Efron B. How biased is the apparent error rate of a prediction rule?. Journal of the American Statistical Association. 1986; 81: 461-470 6 Fox JP. Bayesian item response modeling: Theory and applications. 2010: New York, NY; Springer 7 Furr, D. C. (2017). Bayesian and frequentist cross-validation methods for explanatory item response models. (Unpublished doctoral dissertation). University of California Berkeley, CA. 8 Gelfand AE, Sahu SK, Carlin BP. Efficient parametrisations for normal linear mixed models. Biometrika. 1995; 82: 379-488 9 Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. 20133: New York; Chapman & Hall/CRC Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Statistics and Computing. 2014; 24: 997-1016 Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics. 2008; 2: 1360-1383 Gelman A, Meng XL, Stern H. Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica. 1996; 6: 733-807 Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences (with discussion). Statistical Science. 1992; 7: 457-511 Gronau QF, Wagenmakers EJ. Limitations of Bayesian leave-one-out cross-validation for model selection. Computational Brain & Behavior. 2018; 2; 1: 1-11 Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: A tutorial. Statistical Science. 1999; 14: 382-417 Kang T, Cohen AS, Sung HJ. Model selection indices for polytomous items. Applied Psychological Medicine. 2009; 35: 499-518 Kaplan D. Bayesian statistics for the social sciences. 2014: New York, NY; The Guildford Press Lancaster T. The incidental parameter problem since 1948. Journal of Econometrics. 2000; 95: 391-413 Levy R, Mislevy RJ. Bayesian psychometric modeling. 2016: Boca Raton, FL; Chapman & Hall Li F, Cohen AS, Kim SH, Cho SJ. Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement. 2009; 33: 353-373 Li L, Qui S, Feng CX. Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC. Statistics and Computing. 2016; 26: 881-897 Lu ZH, Chow SM, Loken E. A comparison of Bayesian and frequentist model selection methods for factor analysis models. Psychological Methods. 2017; 22; 2: 361-3815527973 Lunn D, Jackson C, Best N, Thomas A, Spiegelhalter D. The BUGS book: A practical introduction to Bayesian analysis. 2012: New York, NY; Chapman & Hall/CRC Lunn D, Thomas A, Best N, Spiegelhalter D. WinBUGS—a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing. 2000; 10: 325-337 Luo U, Al-Harbi K. Performances of LOO and WAIC as IRT model selection methods. Psychological Test and Assessment Modeling. 2017; 59: 183-205 Marshall EC, Spiegelhalter DJ. Identifying outliers in Bayesian hierarchical models: A simulation-based approach. Bayesian Analysis. 2007; 2; 2: 409-444 McElreath R. Statistical rethinking: A Bayesian course with examples in R and Stan. 2015: New York, NY; Chapman & Hall/CRC Merkle EC, Rosseel Y. blavaan: Bayesian structural equation models via parameter expansion. Journal of Statistical Software. 2018; 85; 4: 1-30 Millar RB. Comparison of hierarchical Bayesian models for overdispersed count data using DIC and Bayes' factors. Biometrics. 2009; 65: 962-969 Millar RB. Conditional vs. marginal estimation of predictive loss of hierarchical models using WAIC and cross-validation. Statistics and Computing. 2018; 28: 375-385 Mislevy RJ. Bayes modal estimation in item response models. Psychometrika. 1986; 51: 177-195 Muthén B, Asparouhov T. Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods. 2012; 17: 313-335 Navarro D. Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior. 2018; 2; 1: 28-34 Naylor JC, Smith AF. Applications of a method for the efficient computation of posterior distributions. Journal of the Royal Statistical Society C (Applied Statistics). 1982; 31: 214-225 Neyman J, Scott EL. Consistent estimates based on partially consistent observations. Econometrica. 1948; 16: 1-32 O'Hagan A. On posterior joint and marginal modes. Biometrika. 1976; 63: 329-333 Piironen J, Vehtari A. Comparison of Bayesian predictive methods for model selection. Statistics and Computing. 2017; 27: 711-735 Pinheiro JC, Bates DM. Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational Graphics and Statistics. 1995; 4: 12-35 Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, Leisch, F. & Zeileis, A. (Eds.), Proceedings of the 3rd international workshop on distributed statistical computing. Plummer M. Penalized loss functions for Bayesian model comparison. Biostatistics. 2008; 9; 3: 523-539 Rabe-Hesketh S, Skrondal A, Pickles A. Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics. 2005; 128; 2: 301-323 Raftery AE, Lewis SM. The number of iterations, convergence diagnostics, and generic Metropolis algorithms. 1995: London; Chapman and Hall Raudenbush SW, Bryk AS. Hierarchical linear models: Applications and data analysis methods. 20022: Thousand Oaks, CA; Sage Rosseel Y. lavaan: An R package for structural equation modeling. Journal of Statistical Software. 2012; 48; 2: 1-36 Song XY, Lee SY. Basic and advanced Bayesian structural equation modeling: With applications in the medical and behavioral sciences. 2012: Chichester, UK; Wiley Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series. 2002; B64: 583-639 Spielberger, C. (1988). State-trait anger expression inventory research edition [Computer software manual]. FL: Odessa. Stan Development Team. (2014). Stan modeling language users guide and reference manual, version 2.5.0 [Computer software manual]. http://mc-stan.org/. Trevisani M, Gelfand AE. Inequalities between expected marginal log-likelihoods, with implications for likelihood-based model complexity and comparison measures. The Canadian Journal of Statistics. 2003; 31: 239-250 Vansteelandt K. Formal models for contextualized personality psychology (Unpublished doctoral dissertation). 2000: Belgium; University of Leuven Leuven Vehtari, A., Gelman, A., & Gabry, J. (2016). loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 0.1.6. https://github.com/stan-dev/loo. Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 2017; 27: 1413-1432 Vehtari A, Mononen T, Tolvanen V, Sivula T, Winther O. Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research. 2016; 17: 1-38 Vehtari A, Simpson DP, Yao Y, Gelman A. Limitations of "Limitations of Bayesian leave-one-out cross-validation for model selection". Computational Brain & Behavior. 2018; 2; 1: 22-27 Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research. 2010; 11: 3571-3594 White IR. simsum: Analyses of simulation studies including Monte Carlo error. The Stata Journal. 2010; 10: 369-385 Wicherts JM, Dolan CV, Hessen DJ. Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology. 2005; 89; 5: 696-716 Yao Y, Vehtari A, Simpson D, Gelman A. Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis. 2018; 13: 917-1007 Zhang X, Tao J, Wang C, Shi NZ. Bayesian model selection methods for multilevel IRT models: A comparison of five DIC-based indices. Journal of Educational Measurement. 2019; 56: 3-27 Zhao Z, Severini TA. Integrated likelihood computation methods. Computational Statistics. 2017; 32: 281-313 Zhu X, Stone CA. Bayesian comparison of alternative graded response models for performance assessment applications. Educational and Psychological Measurement. 2012; 7; 2: 5774-799

By Edgar C. Merkle; Daniel Furr and Sophia Rabe-Hesketh

Reported by Author; Author; Author

Titel:
Bayesian Comparison of Latent Variable Models: Conditional vs Marginal Likelihoods
Autor/in / Beteiligte Person: Merkle, E. C. ; Furr, D. ; Rabe-Hesketh, S.
Link:
Zeitschrift: Grantee Submission, Jg. 84 (2019), S. 31
Veröffentlichung: 2019
Medientyp: academicJournal
DOI: 10.1007/s11336-019-09679-0
Schlagwort:
  • Descriptors: Bayesian Statistics Comparative Analysis Computer Software Models Evaluation Criteria Prediction Validity Item Response Theory
Sonstiges:
  • Nachgewiesen in: ERIC
  • Sprachen: English
  • Language: English
  • Peer Reviewed: Y
  • Page Count: 31
  • Sponsoring Agency: National Science Foundation (NSF) ; Institute of Education Sciences (ED)
  • Contract Number: 1460719 ; R305D140059 ; R305D190048
  • Document Type: Journal Articles ; Reports - Research
  • Abstractor: As Provided
  • IES Funded: Yes
  • Entry Date: 2023

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

oder
oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

oder
oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.

xs 0 - 576
sm 576 - 768
md 768 - 992
lg 992 - 1200
xl 1200 - 1366
xxl 1366 -