CLUSTER-SAMPLE METHODS IN APPLIED ECONOMETRICS: AN EXTENDED ANALYSIS Jeffrey M. Wooldridge Department of Economics Michigan State University East Lansing, MI 48824-1038 (517) 353-5972 wooldri1@msu.edu This version: June 2006 1 ABSTRACT This is an expanded version of Wooldridge (2003), which provided an overview of cluster-sample methods in linear models. Here I include additional details for linear models and provide, as much as possible, a parallel treatment for nonlinear models, including a summary of strategies for dealing with data sets having a small number of clusters. Keywords: Cluster Correlation; Generalized Estimating Equations; Minimum Distance Estimation; Panel Data; Robust Variance Matrix; Unobserved Effect JEL Classification Codes: C13, C21, C23 2 1. INTRODUCTION In Wooldridge (2003), I provided a brief overview of econometric approaches to analyzing cluster samples in the context of a linear regression model. I considered cases with both large and small cluster sizes (relative to the number of clusters). That treatment was necessarily terse, and some subtle issues were only briefly mentioned or neglected entirely. The asymptotic theory for the case with a large number of clusters (and relatively small cluster sizes), either in linear or nonlinear models, has been pretty well worked out; for a summary, see, for example, Wooldridge (2002). Just as importantly, popular statistical packages, such as Stata ® , allow for computation of variance matrices that are robust to arbitrary cluster correlation for a variety of linear and nonlinear estimation methods. Still, while accounting for clustering in data is much more common than it was 10 years ago, inference methods robust to cluster correlation are still not used routinely in all relevant applications. I think that is partly because empirical researchers are not entirely sure when certain estimators are robust to various kinds of misspecification. I hope this expanded paper helps to fill that gap. For nonlinear models, there are some open modeling questions for cases where the group sizes vary – a common situation with true cluster samples – and one wants to allow correlation between the unobserved group effect (or heterogeneity) and the covariates that vary within group. I discuss the modeling issues, and offer some tentative solutions, in Section 3.1. This is very preliminary and hopefully generates some interest in the problem. The case of a small number of clusters has received much attention recently, particularly 3 for linear models. In Wooldridge (2003), I summarized two possible ways to estimate the effects of cluster-level covariates on individual-specific outcomes when the number of clusters is small (while the sizes of the clusters are moderately large). One approach, suggested by Donald and Lang (2001), is to effectively treat the number of groups as the number of observations, and use finite sample analysis (with individual-specific unobservables becoming unimportant – relative to the cluster effect – as the cluster sizes get large). A second approach is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a set of individual-specific regression models, and then imposing and testing the restrictions using minimum distance estimation. The minimum distance (MD) approach, when valid, has the benefit of allowing standard asymptotic analysis under a weak set of assumptions, at least when the group sample sizes are large. When it is valid, the MD approach can be expected to lead to stronger statistical significance than the Donald and Lang approach when there are few groups. Campolieti (2004) applies both the Donald and Lang (2001) (or DL, for short) and minimum distance approaches. Conveniently, both the DL and minimum distance approaches can be extended to a broad class of nonlinear models. As in the linear case, the conditions under which minimum distance can be justified are less restrictive than those for the DL approach (which, importantly, requires unobserved cluster effects to be drawn from a homoskedastic normal distribution). Plus, the DL approach does not allow one to obtain standard errors for the partial effects themselves in nonlinear models. I cover this case in Section 3.2. 2. THE LINEAR MODEL WITH CLUSTER 4 EFFECTS This section considers linear models estimated using cluster samples (of which a panel data set is a special case). For each group or cluster g, let y gm , x g , z gm : m 1, . . . , M g be the observable data, where M g is the number of units in cluster g, y gm is a scalar response, x g is a 1 K vector containing explanatory variables that vary only at the group level, and z gm is a 1 L vector of covariates that vary within (as well as across) groups. 2.1. Specification of the Model The linear model with an additive error is y gm x g z gm v gm , m 1, . . . , M g ; g 1, . . . , G. (1) Our approach to estimation and inference in equation (1) depends on several factors, including whether we are interested in the effects of aggregate variables or individual-specific variables . Plus, we need to make assumptions about the error terms. In the context of pure cluster sampling, an important issue is whether the v gm contain a common group effect that can be separated in an additive fashion, as in v gm c g u gm , m 1, . . . , M g , where c g is an unobserved cluster effect and u gm is the idiosyncratic error. (In the statistics literature, (1) and (2) are referred to as a “hierarchical linear model.”) One important issue is whether the explanatory variables in (1) can be taken to be appropriately exogenous. Under 5 (2) (2), exogeneity issues are usefully broken down by separately considering c g and u gm . Throughout we assume that the sampling scheme generates observations that are independent across g. This assumption can be restrictive, particularly when the clusters are large geographical units. This paper does not formally consider problems of “spatial correlation” across clusters, although below I will mention some benefits of fixed effects estimators in such settings. Appropriate sampling assumptions within cluster are more complicated. Theoretically, the simplest case also allows the most flexibility for robust inference: from a large population of relatively small clusters, we draw a large number of clusters (G), where cluster g has M g members. This setup is appropriate, for example, in randomly sampling a large number of families, classrooms, or firms from a large population. The key feature is that the number of groups is large enough relative to the group sizes so that we can allow essentially unrestricted within-cluster correlation. Randomly sampling a large number of clusters also applies to many panel data sets, where the cross-sectional population size is large (say, individuals, firms, even cities or counties) and the number of time periods is relatively small. In the panel data setting, G is the number of cross-sectional units and M g is the number of time periods for unit g. A different sampling scheme results in data sets that also can be arranged by group. We first stratify the population into G ≥ 2 nonoverlapping groups and then obtain a random sample of size M g from each group. Ideally, the strata sizes are large in the population, hopefully resulting in large M g . I adopt this perspective for the “small G” case in Section 2.3. 2.2. Large Group Asymptotics 6 In this section I review methods and estimators justified when the asymptotic approximations theory is with G → and the group sizes are fixed. The theory is well developed; see, for example, White (1984) and Wooldridge (2002, Chapters 10, 11). Here, I summarize the “large G” theory, emphasizing how one might wish to use methods robust to cluster sampling even when it is not so obvious. First suppose that the covariates satisfy Ev gm |x g , z gm 0, m 1, . . . , M g ; g 1, . . . , G. For consistency, we can, of course, get by with zero correlation assumptions, but we use (3) for convenience because it meshes well with assumptions concerning conditional second moments. Importantly, the exogeneity in (3) only requires that z gm and v gm are uncorrelated. In particular, it does not specify assumptions concerning v gm and z gp for m ≠ p. In the panel data literature, where m is a time index, (3) is called the “contemporaneous exogeneity” assumption; see Wooldridge (2002, Chapter 7). Allowing for correlation between v gm and z gp , m ≠ p is useful for some panel data applications and possibly even cluster samples (if the covariates of one unit can affect another unit’s response). Under (3) and a standard rank condition on the covariates, the pooled OLS estimator, where we regress y gm on 1, x g , z gm , m 1, . . . , M g ; g 1, . . . , G, is consistent for ≡ , ′ , ′ ′ (as G → with M g fixed) and G -asymptotically normal. See, for example, Wooldridge (2002, Sections 7.8). Without more assumptions, a robust variance matrix is needed to account for correlation within clusters or heteroskedasticity in Varv gm |x g , Z g , or both. When v gm has the form in (2), the amount of within-cluster correlation can be substantial, which means the usual OLS 7 (3) standard errors can be very misleading (and, in most cases, systematically too small). In the case of common cluster sizes, Wooldridge (2002, Section 7.8) gives the formula for a variance-matrix estimator that assumes no particular kind of within-cluster correlation nor a particular form of heteroskedasticity. Allowing for different group sizes is straightforward. Write the equation by cluster as y g W g v g , g 1, . . . , G, (4) where W g is the M g 1 K L matrix of all regressors for group g. Then the 1 K L 1 K L variance matrix estimator is Avar̂ G ∑ W ′g W g g1 −1 G ∑ W ′g v̂ g v̂ ′g W g g1 G −1 ∑ W ′g W g (5) g1 where v̂ g is the M g 1 vector of pooled OLS residuals for group g. This asymptotic variance is now computed routinely by packages such as Stata ® (see the appendix). It is important, however, to remember that the standard errors and test statistics obtained are known to be valid only as G → with each M g fixed. Probably they are valid if M g increases at a rate somewhat less than G. Practically, the important issue is how well the robust variance matrix estimator works for common cluster sample sizes, something I comment on below. In estimating the parameters in (1), the pooled OLS estimator ignores the within-cluster correlation of the v gm . If we strengthen the exogeneity assumption to Ev gm |x g , Z g 0, m 1, . . . , M g ; g 1, . . . , G, where Z g is the M g L matrix of unit-specific covariates, then we can exploit the presence of c g in (2) in a generalized least squares (GLS) analysis. [In the panel data case, (6) is known as the “strict exogeneity” assumption on z gm : m 1, . . . , M g .] The standard “random effects” 8 (6) approach makes enough assumptions so that the M g M g variance-covariance matrix of v g v g1 , v g2 , . . . , v g,M g ′ has the so-called “random effects” form, Varv g 2c j ′M g j M g 2u I M g , (7) where j M g is the M g 1 vector of ones and I M g is the M g M g identity matrix. In the standard setup, we also make the “system homoskedasticity” assumption, Varv g |x g , Z g Varv g . (8) It is important to understand the role of assumption (8): it implies that the conditional variance-covariance matrix is the same as the unconditional variance-covariance matrix, but it does not restrict Varv g . The particular random effects structure on Varv g is given by (7). Under (7) and (8), the resulting GLS estimator is the well-known random effects (RE) estimator; see, for example, Wooldridge (2002, Section 10.3). The random effects estimator is asymptotically more efficient than pooled OLS under (6), (7), and (8) (and a standard rank condition). The RE estimates and test statistics are computed routinely by popular software packages. Nevertheless, an important point is often overlooked in applications of RE: one can, and in many cases should, make inference completely robust to an unknown form of Varv g |x g , Z g . Wooldridge [2002, equation (7.49)] gives the robust formula in the balanced case (same group sizes), and the formula extends easily to the unbalanced case. One way to think about the robust formula is to characterize RE as a pooled OLS estimator on quasi-time demeaned data. For each g, define ̂ g 1 − 1/1 M g ̂ 2c /̂ 2u 1/2 , where ̂ 2c and ̂ 2u are estimators of the two variance parameters. Then the RE estimator is identical to the pooled OLS estimator of y gm − ̂ g ȳ g on 1 − ̂ g , 1 − ̂ g x g , z gm − ̂ g z̄ g , m 1, . . . , M g ; g 1, . . . , G; 9 (9) see, for example, Hsiao (1986). For fully robust inference, we can just apply the fully robust variance matrix estimator in (5) but on the transformed data. The idea in obtaining a fully robust variance matrix of RE is straightforward, and dates back at least to the “generalized estimating equation” (GEE) literature [Liang and Zeger (1986)]. Even if Varv g |x g , Z g does not have the RE form, the RE estimator is still consistent and G -asymptotically normal under (6), and it is likely to be more efficient than pooled OLS. [Except in unusual cases, incorrectly accounting for within-group correlation by assuming constant correlation within-group correlation is likely to be more efficient than entirely ignoring within-group correlation.] Yet we should recognize that the RE second moment assumptions can be violated without causing inconsistency in the RE estimator. Making inference robust to serial correlation in the idiosyncratic errors for panel data applications, particularly with more than a few time periods, can be very important. But within-group correlation in the idiosyncratic errors can arise for cluster samples, too, especially if underlying (1) is a random coefficient model, where z gm g replaces z gm . By estimating a standard random effects model that assumes common slopes , we effectively include z gm g − in the idiosyncratic error; this generally creates within-group correlation because z gm g − and z gp g − will be correlated for m ≠ p, conditional on Z g . Also, the idiosyncratic error will have heteroskedasticity that is a function of z gm . Nevertheless, if we assume E g |X g , Z g E g ≡ along with (6), the random effects estimator still consistently estimates the average slopes, . Therefore, in applying random effects to panel data or cluster samples, it is sensible (with large G) to make the variance estimator of random effects robust to arbitrary heteroskedasticity and within-group correlation. With panel data – so that there is a natural ordering of the observations within group – it 10 may make sense to estimate an unrestricted version of Varv g , especially if G is large. Even in that case, we can use Wooldridge [2002, equation (7.49)] to obtain a variance matrix robust to Varv gm |x g , Z g ≠ Varv g . In economics, the prevailing view is still that robust inference is not necessary when using GLS. Adopting the perspective of the GEE literature – that one often chooses a “working” variance-covariance matrix for convenience, but should recognize it may not represent the conditional variance-covariance matrix – is relatively straightforward these days. The appendix shows how to use GEE software to obtain fully robust inference after random effects estimation. If we are only interested in estimating , the “fixed effects” (FE) or “within” estimator is attractive. The within transformation subtracts off group averages from the dependent variable and explanatory variables: y gm − ȳ g z gm − z̄ g u gm − ū g , m 1, . . . , M g ; g 1, . . . , G, and this equation is estimated by pooled OLS. (The x g get swept away by the within-group demeaning.) Under a full set of “fixed effects” assumptions – which, unlike random effects, allows arbitrary correlation between c g and the z gm – inference is straightforward using standard software. Nevertheless, analogous to the random effects case, it is often important to allow Varu g |Z g to have an arbitrary form, including within-group correlation and heteroskedasticity. Arellano (1987) proposed a fully robust variance matrix estimator for the fixed effects estimator, and it is consistent (as G increases with the M g fixed) with cluster samples or panel data; see also Wooldridge [2002, equation (10.59)]. For panel data, the idiosyncratic errors can always have serial correlation or heteroskedasticity, and it is easy to 11 (10) guard against these problems in inference. Reasons for wanting a fully robust variance matrix estimator for FE applied to cluster samples are similar to the RE case. For example, if g replaces in (1), then z gm − z̄ g g − appears in u gm in (4). The FE estimator is still consistent if E g |z g1 − z̄ g , . . . , z g,M g − z̄ g E g – a reasonable assumption that allows g to be correlated with z̄ g provide the group slopes are mean-independent of the within-group deviations from the mean. Nevertheless, u gm , u gp will be correlated for m ≠ p. A fully robust variance matrix estimator is G Avar̂ FE ∑ Z̈ ′g Z̈ g g1 −1 G ∑ Z̈ ′g û g û ′g Z̈ g g1 G ∑ Z̈ ′g Z̈ g −1 , g1 where Z̈ g is the matrix of within-group deviations from means and û g is the M g 1 vector of fixed effects residuals. This estimator is justified with large-G asymptotics. One benefit of a fixed effects approach, especially in the standard model with constant slopes but c g in the composite error term, is that no adjustments are necessary if the c g are correlated across groups. When the groups represent different geographical units, we might expect correlation across groups close to each other. If we think such correlation is largely captured through the unobserved effect c g , then its elimination via the within transformation effectively solves the problem. If we use pooled OLS or a random effects approach, we would have to deal with spatial correlation across g, in addition to within-group correlation, and this is a difficult problem. Recently, in the context of fixed effects estimation and panel data, Kézde (2001) and Bertrand, Duflo, and Mullainathan (2002) study the finite-sample properties of robust variance matrix estimators that are theoretically justified only as G → . One common finding is that the fully robust estimator works reasonably well even when the cross-sectional sample size (G) 12 (11) is not especially large relative to the time-series dimension. When Varu g |Z g does not depend on Z g , a variance matrix that exploits system homoskedasticity can perform better than the fully robust variance matrix estimator. Kézde (2001) also finds that, without serial correlation, an estimator that adjusts the variance matrix only for heteroskedasticity works well. This finding is probably more relevant for cluster sample applications, where independence of the idiosyncratic errors is often reasonable unless slopes vary by group. Importantly, the encouraging simulaton findings for fixed effects with panel data are not in conflict with findings that the robust variance matrix for the pooled OLS estimator with a small number of clusters can behave poorly. An important difference in the two problems is that for FE estimation using panel data, the issue is serial correlation in the idiosyncratic errors u gm : m 1, . . . , M g , and in simulation studies to date this correlation has been assumed to die out as the time periods get far apart. [Typically, a stable AR(1) model is assumed.] The pooled OLS estimator that keeps c g in the error term suffers because of the constant correlation across all observations within cluster. Plus, fixed effects can only estimate , while for pooled OLS on cluster samples the focus is on typically on . If we are interested in , an interesting open question for a small cluster sample size is whether inference with the random effects estimator, whether carried out under the usual RE assumptions or made robust to arbitrary within-group correlation and heteroskedasticity, can improve on robust inference for pooled OLS. The previous discussion extends immediately to instrumental variables versions of all estimators. With large G, one can typically afford to make pooled two stage least squares (2SLS), random effects 2SLS, and fixed effects 2SLS robust to arbitrary within-cluster correlation and heteroskedasticity. Also, more efficient estimation is possible by applying 13 generalized method of moments (GMM); again, GMM is justified with large G. Unfortunately, for any of the IV methods there appears to be little simulation evidence for how “large G” standard errors work for IV methods when G is not so large. 2.3. Large Group Size Asymptotics The previous subsection discussed the mixed performance of standard “large G” variance matrix estimators when G is not so large. I now discuss estimation of the parameters in (1) under a standard stratified sampling scheme. In particular, suppose we have defined G strata in the population, and we obtain our data by randomly sampling from each stratum. As before, M g is the sample size for stratum (group) g. Without specifying how the sample was obtained, the resulting data set is essentially indistinguishable from that described in Section 2.2, where we randomly draw clusters from a population of many clusters. But the difference turns out to be important, and in this subsection I act as if the usual “large G” asymptotics do not apply. We are mostly interested in estimating in (1), but that generally requires estimating , too. Donald and Lang (2001) (hereafter, DL) study this problem, and I draw on their work while offering a complementary approach. DL’s examples are for very small G, and one of their examples is Card and Krueger (1994), who had G 2 states, New Jersey and Pennsylvania. DL’s approach can be, and has been, applied with larger numbers of groups. Moulton (1990), the clusters are states in the U.S. (G 50) and, in Loeb and Bound (1996), G 36 cohort-division groups. I explicitly consider the case where G and the M g are both (moderately) “large” in Section 2.4. 14 To illustrate the pitfalls in applying standard methods such as pooled OLS when G is “small,” consider a special case of (1), where x g is a scalar (as in DL) and z gm is not in the equation. The equation is y gm x g c g u gm , m 1, . . . , M g ; g 1, . . . , G, (12) where c g and u gm : m 1, . . . , M g are independent of x g and u gm : m 1, . . . , M g is a mean-zero, independent, identically distributed sequence for each g. Now, if the cluster effect c g is absent from the model, then we can estimate (12) using pooled ordinary least squares, and inference is straightforward. If we assume Varu gm is constant across g then, provided N ≡ M 1 . . . M G is large enough (often N as small as 30 suffices) – even if G is not large – we can use the usual t statistics from the pooled OLS regression as having an approximate standard normal distribution. Making inference robust to heteroskedasticity is straightforward, and tends to work well even for moderate N. Donald and Lang (2001) allow for the presence of c g Normal0, 2c , which they assume to be independent of u gm : m 1, . . . , M g . With a common cluster effect, there is no averaging out within cluster that allows application of the central limit theorem. One way to see the problem is to note that the pooled OLS estimator, ̂ , is identical to the “between” estimator obtained from the regression ȳ g on 1, x g , g 1, . . . , G. Conditional on the x g , ̂ inherits its distribution from v̄ g : g 1, . . . , G, the within-group averages of the composite errors v gm ≡ c g u gm . The presence of c g means new observations within group do not provide additional information for estimating beyond how they affect the group average, ȳ g . 15 (13) If we add some strong assumptions, there is an exact solution to the inference problem. In particular, assume u gm Normal0, 2u and M g M for all g (same sample size across strata). Then v̄ g Normal0, 2c 2u /M for all g. Because we assume independence across g, the equation ȳ g x g v̄ g , g 1, . . . , G satisfies the classical linear model assumptions. Therefore, we can use inference based on the t G−2 distribution to test hypotheses about , provided G 2. When G is very small, the requirements for a significant t statistic using the t G−2 distribution are much more stringent then if we use the t M 1 M 2 ...M G −2 distribution – which is what we would be doing if we use the usual pooled OLS statistics. When x g is a 1 K vector, we need G K 1 to use the t G−K−1 distribution for inference. [In Moulton (1990), x g contains 17 elements, which can be handled in this setup because G 50. ] As pointed out by DL, performing the correct inference in the presence of c g is not just a matter of correcting the pooled OLS standard errors for cluster correlation, or using the RE estimator. There is only one estimator: pooled OLS, random effects, and the between regression in (13) all lead to the same ̂ . But it is the regression in (13) that gives the appropriate standard error and reports the correct (small) degrees of freedom in the t distribution. We can apply the DL method without normality of the u gm if the common group size M is large: by the central limit theorem, ū g will be approximately normally distributed very generally. Then, because c g is normally distributed, we can treat v̄ g as approximately normal with constant variance. Further, even if the group sizes differ across g, for very large group 16 (14) sizes ū g will be a negligible part of v̄ g : Varv̄ g 2c 2u /M g . Provided c g is normally distributed and it dominates v̄ g , a classical linear model analysis on (14) should be roughly valid. The broadest applicability of DL’s setup is when the average of the idiosyncratic errors, ū g , can be ignored – either because 2u is small relative to 2c , M g is large, or both. It is important to see that applying DL with different group sizes or nonnormality of the u gm is identical to ignoring the estimation error in the sample averages, ȳ g . In other words, it is as if we are analyzing the simple regression g x g c g using the classical linear model assumptions (where ȳ g is used in place of the unknown group mean, g ). With small G, we need to assume c g is normally distributed. The DL solution to the inference problem is actually pretty common as a strategy to check robustness of results obtained from cluster samples, but typically it is implemented with large G. Often with cluster samples one estimates the parameters using the disaggregated data and also the averaged data. With covariates that vary within cluster, using averaged data is generally inefficient. But it does mean that standard errors need not be made robust to within-cluster correlation. DL essentially point out that using (14) with small G ensures that inference is conservative (at least if normality of c g holds and the group sizes are large). For small G and large M g , inference obtained from analyzing (14) as a classical linear model will be very conservative in the absense of a cluster effect. Perhaps this is desirable, but it rules out some widely-used staples of policy analysis. For example, suppose we have two populations (maybe men and women, two different cities, or a treatment and a control group) with means g , g 1, 2, and we would like to obtain a confidence interval for their difference. Under random sampling from each population, and assuming normality and equal population 17 variances, the usual comparison-of-means statistic is distributed exactly as t M 1 M 2 −2 under the null hypothesis of equal population means. (Or, we can construct an exact 95% confidence interval of the difference in population means.) With even moderate sizes for M 1 and M 2 , the t M 1 M 2 −2 distribution is close to the standard normal distribution. Plus, we can relax normality to obtain approximately valid inference, and it is easy to adjust the t statistic to allow for different population variances. The comparison-of-means setup is most student’s first exposure to policy analysis. But we cannot even study the difference-in-means estimator in the DL setup because G 2. DL criticize Card and Krueger (1994) for comparing mean wage changes of fast-food workers across two states because Card and Krueger fail to account for the state effect (New Jersery or Pennsylvania), c g , in the composite error, v gm . But the DL criticism in the G 2 case is indistinguishable from a common question raised for any difference-in-differences analyses: How can we be sure that any observed difference in means is due entirely to the policy change? To characterize the problem as failing to account for an unobserved group effect is not necessarily helpful. If the experiment is appropriately randomized, then c g should be part of the effect to be estimated. Consider a related example. Suppose that, over the summer, a school district with two high schools, say A and B, decides to provide personal computers for students at high school B who just finished their first year. The announcement is made just before the school year starts, so that students cannot switch high schools. The response variable is the change in a standardized test score given in the district from the first to the second year. If we sample students from each high school, a comparsion of means should be appropriate if the experiment is effectively randomized. Of course, if there are other confounding factors – say, the average increase in 18 test scores is already higher at school B – then the difference-in-differences estimator will not be appropriate. If z gm appears in (1), we can modify (14) by adding the group averages, z̄ g , as explanatory variables, but only if G K L 1. Under the normality and homoskedasticity assumptions with common group sizes, inference can be carried out using the t G−K−L−1 distribution. DL propose an estimation method that has an exact t G−K−1 distribution, regardless of the size of L, but they assume that z̄ g is constant across g, z̄ g z̄, in which case z̄ gets absorbed into the intercept in (14) and we are back to the case without z gm . With large group sizes M g , a different perspective on estimating sheds additional insight on problems of inference, and allows more flexibility on the kinds of questions that can be answered. Write for each goup g y gm g z gm g u gm , m 1, . . . , M g , (15) where we assume random sampling within group and independent sampling across groups. We make the standard assumptions for OLS to be consistent (as M g → ) and M g -asymptotically normal; see, for example, Wooldridge (2002, Chapter 4). The presence of group-level variables x g in a “structural” model can be viewed as putting restrictions on the intercepts, g , in the separate group models in (15). In particular, g x g , g 1, . . . , G, where we now think of x g as fixed, observed attributes of heterogeneous groups. With K attributes we must have G ≥ K 1. If M g is large enough to estimate the g precisely, a simple two-step estimation strategy suggests itself. First, obtain the ̂ g , along with ̂ g , from an OLS regression within each group. If G K 1 then, typically, we can solve for ̂ ≡ ̂ , ̂ ′ ′ 19 (16) uniquely in terms of the G 1 vector ̂ :. ̂ X −1 ̂ , where X is the K 1 K 1 matrix with g th row 1, x g . If G K 1 then, in a second step, we can use a minimum distance approach, as described in Wooldridge (2002, Section 14.6). If we use as the weighting matrix I G , the G G identity matrix, then the minimum distance estimator can be computed from the OLS regression ̂ g on 1, x g , g 1, . . . , G. Under asymptotics such that M g g M where 0 g ≤ 1 and M → , the minimum distance estimator ̂ is consistent and M -asymptotically normal. Still, this particular minimum distance estimator is asymptotically inefficient except under strong assumptions. Because the samples are assumed to be independent, it is not difficult to obtain the efficient minimum distance (MD) estimator – also called the “minimum chi-square” estimator. First consider the case where z gm does not appear in the first stage estimation, so that the ̂ g is just ȳ g , the sample mean for group g. Let ̂ 2g denote the usual sample variance for group g. Because the ȳ g are independent across g, the efficient MD estimator uses a diagonal weighting matrix. As a computational device, the minimum chi-square estimator can be computed by using the weighted least squares (WLS) version of (17), where group g is weighted by M g /̂ 2g (groups that have more data and smaller variance receive greater weight). Conveniently, the reported t statistics from the WLS regression are asymptotically standard normal as the group sizes M g get large. (With fixed G, the WLS nature of the estimation is just a computational device; the standard asymptotic analysis of the WLS estimator has G → .). The minimum distance approach works with small G provided G ≥ K 1 and each M g is large enough so that normality is a good approximation to the distribution of the (properly scaled) sample average 20 (17) within each group. If z gm is present in the first-stage estimation, we use as the minimum chi-square weights the inverses of the asymptotic variances for the g intercepts in the separate G regressions. With large M g , we might make these fully robust to heteroskedasticity in Eu 2gm |z gm using the White (1980) sandwich variance estimator, or at least allow for different 2g . In any case, once we have the Avar̂ g – which are just the squared reported standard errors for the ̂ g – we use as weights 1/Avar̂ g in the computationally simple WLS procedure. We are still using independence across g in obtaining a diagonal weighting matrix in the MD estimation. An important by-product of the WLS regression is a minimum chi-square statistic that can be used to test the G − K − 1 overidentifying restrictions. The statistic is easily obtained as the weighted sum of squared residuals, say SSR w . Under the null hypothesis in (16), a 2 SSR w G−K−1 as the group sizes, M g , get large. If we reject H 0 at a reasonably small significance level, the x g are not sufficient for characterizing the changing intercepts across groups. If we fail to reject H 0 , we can have some confidence in our specification, and perform inference using the standard normal distribution for t statistics for testing linear combinations of the population averages. We might also be interested in how one of the slopes in g depends on the group features, x g . Then, we simple replace ̂ g with, say ̂ g1 , the slope on the first element of z gm . Naturally, we would use 1/Avar̂ g1 as the weights in the MD estimation. The minimum distance approach can also be applied if we impose g for all g, as in the original model (1). Obtaining the ̂ g themselves is easy: run the pooled regression y gm on d1 g , d2 g , . . . , dG g , z gm , m 1, . . . , M g ; g 1, . . . , G 21 (18) where d1 g , d2 g , . . . , dG g are group dummy variables. Using the ̂ g from the pooled regression (18) in MD estimation is complicated by the fact that the ̂ g are no longer asymptotically independent; in fact, ̂ g ȳ g − z̄ g ̂ , where ̂ is the vector of common slopes, and the presence of ̂ induces correlation among the intercept estimators. Let V̂ be the G G estimated (asymptotic) variance matrix of the G 1 vector ̂ . Then the MD estimator is ̂ X ′ V̂ −1 X −1 X ′ V̂ −1 ̂ and its estimated asymptotic variance is X ′ V̂ −1 X −1 . If the OLS regression (17) is used, or the WLS version, the resulting standard errors will be incorrect because they ignore the across group correlation in the estimators. (With large group sizes the errors might be small; see the next subsection.) Intermediate approaches are available, too. Loeb and Bound (1996) (LB for short) allow different group intercepts and group-specific slopes on education, but impose common slopes on demographic and family background variable. The main group-level covariate is the student-teacher ratio. Thus, LB are interested in seeing how the student-teach ratio affects the relationship between test scores and education levels. LB use both the unweighted estimator and the weighted estimator and find that the results differ in unimportant ways. Because they impose common slopes on a set of regressors, the estimated slopes on education (say ̂ g1 ) are not asymptotically independent, and perhaps using a nondiagonal estimated variance matrix V̂ (which would be 36 36 in this case) is more appropriate; but see Section 2.4. As another example, suppose x g is a binary treatment indicator, equal to unity if group g is in the treatment group and zero otherwise. Then ̂ is an estimate of an average treatment effect (ATE). If G 2, there are no restrictions to test, as in the simple comparison-of-means setup. If G 2 we can test the overidentifying restrictions. If we reject them, then there are unaccounted for group-level differences, and we might want to re-specify our model by adding 22 more elements to x g , even if we think the new elements of x g are not systematically related to the original elements. Of course, if we add elements until G K 1, there are no restrictions to test. If we reject the overidentifying restrictions, we are essentially concluding that g x g c g , where c g is the error made in imposing the restrictions (16) on the g . One possibility is to apply the Donald and Lang approach, which is to analyze the OLS regression in (17) in the context of the classical linear model (CLM), where inference is based on the t G−K−1 distribution. Why is a CLM analysis justified? Since ̂ g g O p M −1/2 g , we can ingore the estimation error in ̂ g for large M g (Recall that the same “large M g ” assumption underlies the minimum distance approach.) Then, it is as if we are estimating the equation g x g c g , g 1, . . . , G by OLS. If the c g are drawn from a normal distribution, classical analysis is applicable because c g is assumed to be independent of x g . This approach is desirable when one cannot, or does not want to, find group-level observables that completely determine the g . It is predicated on the assumption that the other factors in c g are not systematically related to x g , a reasonable assumption if, say, x g is a randomly assigned treatment at the group level, a case considered by Angrist and Lavy (2002). Before settling for conservative inference, which is justified only under normality of c g when we can ignore the estimation error in ̂ g , it makes sense to choose x g carefully and test Varc g 0. This is precisely the purpose of the minimum chi-square overidentification statistic. The minimum chi-square approach is justified very generally when the group sizes, M g , are moderately large; normality is not needed anywhere, nor is homoskedasticity within or 2 across groups. If the M g are not very large but G − K − 1 is large, the G−K−1 distribution may not be a good approximation to the distribution of the overidentification statistic. 23 Even if the overidentification statistic rejects (16), it may be too rash to adopt the DL approach with small degrees of freedom. To see why, consider a treatment effect setup. Suppose there are three large groups (G 3), and groups one and two are the control groups while group three is the treatment group. So x g is a binary variable with x 1 x 2 0 and x 3 1. For simplicity, there are no control variables. Then the estimated average treatment effect using the DL regression (13) or the (unweighted) minimum distance regression (18) are identical, and simply equal to the difference in means for the treatment and control group. It is conveniently written as ̂ ȳ 3 − p 1 ȳ 1 p 2 ȳ 2 , where p 1 M 1 /M 1 M 2 and p 2 M 2 /M 1 M 2 are the proportions of observations for groups one and two, respectively, out of the total number of control units. In other words, the mean outcome for the control group is just a weighted average. The efficient minimum distance estimator would use a different weighting scheme, but that is not important for the main point here. Because G 3, we can test the single overidentifying restriction. This is tantamount to testing whether 1 2 , that is, the population means of the two control groups are the same. Here is the important point: should we really reject the standard analysis just because there is a heterogeneous response in the two control groups? I think not. If we do and apply the DL approach, we wind up using a t distribution with a single degree of freedom to conduct inference, even though we may have hundreds, or even thousands, of observations from each group or stratum. Instead, we can be satisfied with (19) as the estimate of the treatment effect without worrying whether 1 2 . In fact, if the sample proportions p 1 and p 2 converge to the 24 (19) relative population frequencies, say 1 and 2 , then the probability limit of (19) is simply 3 − 1 1 2 2 , where the term in parentheses is the average across the two control group populations. If p 1 and p 2 are not consistent estimates of the population frequencies, then the control group mean estimates a weighted average of 1 and 2 . We may or may not be interested in the particular weighted average implied by (19). Nevertheless, the DL method produces exactly the same estimate of the ATE, but DL would insist that the estimated effect be deemed statistically significant only if the t statistic, from the regression of ȳ g on 1, x g , g 1, 2, 3, is on the order of 12.8 (for a two-sided, 5% level test). Rather than either rejecting the minimum distance approach if 1 ≠ 2 , or simply accepting the estimated treatment effect as possibly estimating an unteresting weighted combination of the mean effects for different groups, we can specify ahead of time a particular linear combination of the means that is of interest. So, for example, suppose that the first G 1 groups are the controls and the remaining G 2 groups are the treated groups. Then, we might define the treatment effect in terms of two simple averages, say C 1 . . . G 1 /G 1 and T G 1 1 . . . G /G 2 ; the treatment effect would be defined as T − C . Even with only a moderate sample size for each g, we can typically get fairly precise estimates of the g using the sample average. Estimation of is then straightforward, and inference is standard because asymptotic standard errors for ̂ C and ̂ T are easily obtained. Of course, one might use weighted averages in each case, particularly if one has knowledge of population frequencies. The general point is that once we have specified an interesting population parameter, we can obtain a consistent, approximately normal estimator, and inference can be based on the standard normal distribution when the group sizes are moderately large. There is no reason to rely on the small degrees of freedom approach of DL. 25 Beyond the treatment effect case, the issue of how to define parameters of interest appears complicated, and deserves further research. 2.4. What if G and the M g are Both “Large”? In Section 2.2, I reviewed methods appropriate for a large number of groups and relatively small group sizes. In Section 2.3, I considered two approaches appropriate for large group sizes and a small number of groups. The DL and minimum distance approaches use the large group sizes assumption differently: in its most applicable setting, DL use the large M g assumption to ignore the first-stage estimation error entirely, while the asymptotics underlying the MD approach is based on applying the central limit theorem within each group. Not surprisingly, more flexibility is afforded if G and M g are both “large.” For example, suppose we adope the DL specification (with an unobserved cluster effect) and G 50 (say, states in the U.S.). Further, assume first that the group sizes are large enough (or the cluster effects are so strong) that the first-stage estimation error can be ignored. Then, it matters not whether we impose some common slopes or run separate regressions for each group (state) in the first stage: we can treat the ̂ g , g 1, . . . , G, as independent “observations” to be used in the second stage. This means we apply regression (17) and apply the usual inference procedures. The difference now is that with G 50, the usual t statistics have some robustness to nonnormality of the c g , assuming the CLT approximation works well With small G, the exact inference was based on normality of the c g . Loeb and Bound (1996) essentially use regression (17), but with estimated slopes as the 26 dependent variable in place of estimated intercepts. As mentioned in Section 2.3, LB impose some common slopes across groups, which means all estimated parameters are dependent across group. The minimum distance approach without cluster effects is one way to account for the dependence. Alternatively, one can simply adopt the DL perspective and just assume the estimation error is swamped by c g ; then standard OLS analysis is approximately justfied. LB also apply a weighted least squares procedure that is intended to account for the different sampling variations in their slope estimates (largely due to different group sample sizes). With a cluster effect, the LB weights are not the correct ones for eliminating heteroskedasticity. [Plus, we do not have the actual variances of the ̂ g of ̂ g1 ; we have only estimates of their asymptotic variances. This is fine from the MD perspective, but not really correct from a WLS perspective.] There is nothing really wrong with using the incorrect weights provided the standard errors are appropriately adjusted, but with G 36 one might be concerned about the finite-sample properties of robust standard errors. With a cluster effect, the same problem arises with OLS, as it ignores the heteroskedasticity. From the MD perspective, the LB WLS approach would be correct if each first-stage regression were estimated separately; but they were not. In summary, with G somewhat large, the DL approach gains robustness to nonnormality of the c g if the first-stage estimation error can be safely ignored. A WLS approach is justified in the MD approach when separate regressions can be run for each group. But that weighting is not the correct one when either a cluster effect is present or one must account for dependence in the first-stage estimators across groups. 27 3. NONLINEAR MODELS Many of the issues for nonlinear models are the same as for linear models. The biggest difference is that, in many cases, standard approaches require distributional assumptions about the unobserved group effects. In addition, it is more difficult in nonlinear models to allow for group effects correlated with covariates, especially when group sizes differ. For the small G case, we offer extensions of the Donald and Lang (2001) approach (with large group sizes) and the minimum distance approach. Rather than using a general, abstract setting, the issues for nonlinear models are easily illustrated with a few widely-used examples. Here I discuss binary response, count data, tobit models, and fractional response models. 3.1. Large Groups Asymptotics As in Section 2, we begin with the case where the asymptotic analysis is with fixed group sizes M g with the number of groups G getting large. 3.1.1. Binary Response Models We can illustrate many issues using an unobserved effects probit model. Let y gm be a binary response, with x g and z gm , m 1, . . . , M g , g 1, . . . , G defined as in Section 2. Assume 28 that y gm 1 x g z gm c g u gm ≥ 0 (20) u gm |x g , Z g , c g ~Normal0, 1 (21) (where 1 is the indicator function). Equations (20) and (21) imply Py gm 1|x g , z gm , c g Py gm 1|x g , Z g , c g x g z gm c g , (22) where is the standard normal cumulative distribution function (cdf). We assume throughout that only z gm affects the response probability of y gm conditional on x g and c g ; the outcomes of z gp for p ≠ m are assumed not to matter. This is captured in (22). For pooled methods we could relax this restriction (as in the linear case), but, with the presence of c g , this affords little generality in practice. The presence of c g in (22) raises several important issues. Before even considering estimation, we should know how to define the partial effects of the explanatory variables, x g , z gm . The elements of and provide the directions of effects and, for continuous elements, the relative sizes. For example, if the first element of x g is continuous, ∂Py gm 1|x g , z gm , c g 1 x g z gm c g , ∂x g1 where is the standard normal density function. From (23), the sign of 1 tells us whether x g1 has a positive or negative partial effect on the response probability. Further, if x g2 is another continuous variable, the ratio of partial effects is simply 1 / 2 . For discrete covariates, the signs of the parameters give the direction of effects, although the relative magnitudes are not free of the unobservable, c g . In any case, the magnitude of the partial effects depends directly on c g , as (23) shows in the case of a continuous covariate. Often, we need to a measure of the change in the response probability given a change in a regressor. To 29 (23) address the dependence of (23) [and its discrete analogs] on c g , we can average (23) across the population distribution of c g , for fixed values of the covariates. This results in what I have called the “average partial effect” (APE); see, for example, Wooldridge (2002, 2005). The simplest way to obtain APEs requires is to specify a distribution (or a conditional distribution) for c g . We can easily derive the APEs, and then consistently estimate them, if we assume c g is independent of x g , Z g and is normally distributed: c g |x g , Z g ~Normal0, 2c , (24) where the zero mean is without loss of generality because (20) contains an intercept, . The unconditional normality assumption for c g implies that the APEs are obtained from the function Py gm 1|x g , Z g x g z gm /1 2c 1/2 ≡ c x g c z gm c , where c /1 2c 1/2 , and so on; see, for example, Wooldridge (2002, Chapter 15). Conveniently, the scaled coefficients are exactly the coefficients estimated by using a simple pooled probit procedure. So, for estimating the average partial effects, pooled probit is perfectly acceptable. With large G and small group sizes, we can easily make the standard errors and test statistics robust to arbitarary within group correlation. This is especially convenient for panel data applications, where serial correlation in the idiosyncratic errors, u gm (where m indexes time) causes extra dependence in the responses y gm that is not accounted for by c g . In other words, the “cluster-robust” variance matrix estimator allows for any kind of within-group correlation, just as in the linear case. If we supplement (20),.(21), and (24) with 30 (25) u g1 , . . . , u g,M g are independent conditional on x g , Z g , c g then we have the so-called “random effects probit” model. Under the RE probit assumptions, , , and 2c are all identified, which means we can estimate the APEs as well as the partial effects evaluated at the everage value of c g , which is zero. We can also compute partial effects at other values of c g that we might select from the normal distribution with estimated standard deviation c . The details for random effects probit in the balanced panel data case are given in Wooldridge (2002, Chapter 15). The unbalanced case is similar. The random effects probit estimator has no known robustness properties – that is, if any of the assumptions (20),.(21), (24), and (26) fail, the parameter estimators are believed to be inconsistent. Plus, the APEs would not be consistently estimated in general. Therefore, if we insist that the parameter estimators and APEs are consistently estimated, there is no logical reason to use robust inference; the usual MLE inference is appropriate. Still, one might view the RE probit model as an approximation to the true model, and use a “sandwich” estimator for robust inference; see White (1982) and Wooldridge (2002, Chapter 12). Random effects probit relies on a full set of distributional assumptions, while pooled probit relaxes assumptions on the joint distribution at the cost of inefficient estimators (of the scaled parameters indexing the APEs). If we maintain the strict exogeneity assumption of the explanatory variables, there are estimators that are generally more efficient than pooled probit while remaining consistent under (20), (21), and (24). The “generalized estimating equation” (GEE) literature applies to grouped data (panel data as a special case); see Liang and Zeger (1986). Perhaps the best way to think about GEE is as a multivariate weighted nonlinear least squares (WNLS) estimator, with explicit recognition that the variance-covariance matrix underlying the WNLS estimation is likely misspecified. The GEE approach begins by 31 (26) specifying a conditional mean for the M g 1 vector y g . In the binary response case, Ey g |x g , Z g is simply the vector of response probabilities in (25). [In the context of GEE, the expression in (25) would be called the “population averaged model” because the unobserved effect has been averaged out.] Because y gm is binary, Vary gm |x g , Z g c x g c z gm c 1 − c x g c z gm c . The pooled weighted nonlinear least squares estimator, using as weights the inverse of the estimated variance function, produces an an estimator asymptotically equivalent to the pooled probit estimator. (Remember, this is all as G → with the M g fixed.) What GEE does is exploit the within-group correlation to obtain a more efficient estimator. It does this by specifying a “working correlation matrix” (WCM) which is, for each g, an M g M g matrix of correlations. This matrix is supposed to approximate Corry g |x g , Z g . In most implementations of GEE, the WCM does not depend on the condititioning variables. For cluster samples and panel data, a useful WCM is the “exchangeable” matrix that assumes a constant correlation, , across any two pairs m and p. Combined with (27), the WCM gives an approximation to Vary g |x g , Z g . The resulting (simple) structure is known to be wrong if the random effects probit model, or any extensions (such as modeling the within-group correlation in u g1 , . . . , u g,M g ). The idea is that accounting for the within-cluster correlation in even a crude way generally leads to more efficient estimation than ignoring the correlation in estimation, as with pooled probit. Because the M g M g variance matrices are assumed to be misspecified, a “sandwich” estimator should be used for inference, and this option is standard (and should always be used) for GEE applications. The GEE estimates are directly comparable to the pooled probit estimates – the parameters index the APEs in both cases – and can be 32 (27) compared to random effects estimates after scaling the latter by 1/1 ̂ 2c 1/2 . In many large G applications, interest centers on the causal effect of the unit specific covariates, z gm , controlling for differences in unobserved heterogeneity, c g . Then, x g (and an intercept) is absorbed into c g and correlation between c g and z g1 , z g2 , . . . , z g,M g is allowed. For linear models, we saw that the within or fixed effects estimator allows arbitrary correlation, and does not restrict the within-cluster dependence of u g1 , . . . , u g,M g . Unfortunately, allowing general correlation is much more difficult in nonlinear models. In the balanced case, where the M g are the same, a device due to Chamberlain (1980) can be used. Actually, in practice a more restrictive version is often used: the distribution of c g given z g1 , z g2 , . . . , z gM is assumed to be homoskedasticity normal with a mean linear in the group average: c g |Z g ~Normal z̄ g , 2a , (28) where 2a is the conditional variance Varc g |Z g . If we use all random effects probit assumptions but with (28) in place of (24), then we obtain a simple extension of the RE probit model: simply add the group averages, z̄ g , as a set of additional explanatory variables. The marginal distributions are Py gm 1|Z g z gm z̄ g /1 2a 1/2 ≡ a z gm a z̄ g a (29) where now the coefficients are scaled by a function of the conditional variance. As shown in Wooldridge (2002, Chapter 15), the average partial effects are consistently estimated by taking derivatives and changes of G G −1 ∑ ̂ a ẑ a z̄ g ̂ a g1 33 (30) with respect to elements of z. In other words, we average out the effects of the group averages, z̄ g . The RE probit assumptions identify the original coefficients and 2a . We can also apply pooled probit or GEE without (26) and directly estimate the scaled coefficients (and only the scaled coefficients are identified). Also, one is free to add x g to the set of covariates, but, typically, one would not interpret the partial effects in a causal way (because c g and x g might be (partially) correlated.) Unfortunately, while the Chamberlain-Mundlak device is sensible for balanced cluster samples (including balanced panels), it needs to be modified for the unbalanced case. [Of course, one can always balance a cluster sample under the assumption that the cluster sizes are exogenous, and that might be desirable if there is not much variation in the cluster sizes. But sometimes it entails throwing away a lot of data.] Here I will simply mention what might be done. At a minimum, (28) should be modified to allow the variances to depend on the cluster size, M g . Under restrictive assumptions, such as joint normality of c g , z g1 , . . . , z g,M g , with the z gm independent and identically distributed within a cluster, one can derive Varc g |Z g . But these are strong assumptions. With G large and little variation in the group sizes, M g , one might just allow a different variance for each M g . Then the marginal distributions are Py gm 1|Z g z gm z̄ g /1 2a,M g 1/2 . A simple approach, and one that allows , to be M g , M g , is to estimate a different equation by pooled probit for each group size [with covariates 1, z gm , z̄ g ]. Then, we can estimate APEs for each M g , and then average those. Obtaining standard errors via the delta method would be tricky, and this approach ignores the constancy of across groups. [In effect, we would just be obtaining ̂ M g , ̂ M g , ̂ M g for each group size.] To impose a common 34 (31) , a minimum distance approach can be used. Alternatively, for each different value of M g we can estimate a random effects probit model with the z̄ g included as covariates. In fact, we could be more general and allow , to be M g , M g . Then, after estimating the slope and variance parameters for a set of extended RE probits, we can then use a minimum distance approach to estimate the common . This might be a useful topic for future research. In estimating APEs, one can dispense with parametric assumptions entirely, provided nonparametric estimation is feasible. In particular, consider the two assumptions Py gm 1|Z g , c g Py gm 1|z gm , c g ≡ Fz gm , c g (32) Dc g |z g1 , z g2 , . . . , z g,M g Dc g |z̄ g . (33) and Assumption (32) implies strict exogeneity conditional on c g while (33) means that the distribution of the group effect, c g , given all within-group covariates, depends only on the group average; a specific distributional assumption, such as homoskedastic normality with a mean linear in z̄ g , is clearly a special case. Define H g z gm , z̄ g Py gm 1|z gm , z̄ g . Under (32) and (33), it can be show that the APEs are obtained from E z̄ g H g z, z̄ g ; see, for example, Wooldridge (2005). If the group sizes differ, H g , generally does depend on g. If there are relatively few group sizes, it makes sense to estimate the H g , separately for each group size M g . Then, the APEs can be estimated from 35 (34) G G −1 ∑ Ĥ g z, z̄ g . (35) g1 As a practical matter, we might just use flexible parametric models, such as pooled probit or random effects probit. This approach does not impose the constant function F, as specified in (32), and so more efficient methods should be available, but I will not explore that here. Other strategies are available for estimating APEs. A procedure commonly known as “fixed effects probit” treats the c g as parameters to estimate in Py gm 1|Z g , c g Py gm 1|z gm , c g z gm c g . (36) With small group sizes M g (say, siblings or short panel data sets), treating the c g as parameters to estimate creates an incidental parameters problem. In particular, the estimator of can be badly biased. Interestingly, as shown in recent work by Fernández-Val (2005) for the balanced panel case and independence across m conditional on Z g , c g , the average partial effects, obtained from G G −1 ∑ ẑ ĉ g , g1 are unbiased for the population APEs up to order M −2 when the group effects are not present. When group effects are present, Fernández-Val (2005) proposes a bias correction for the APEs. See Fernández-Val (2005) for details. Some prefer to use a logit response function rather than probit. The use of logit for any of the methods discussed previously makes estimation of parameters and average partial effects more difficult, so logit has little to offer as an alternative. Nevertheless, for allowing correlation between c g and z gm , logit has an advantage over probit: under the conditional 36 (37) independence assumption y g1 , . . . , y g,M g are independent conditional on Z g , c g , (38) the conditional maximum likelihood estimator eliminates c g and leads to consistent estimation of , without having to specify Dc g |Z g . While this is useful for obtaining directions and relative magnitudes, it does not allow us to estimate the APEs. Plus, for panel data applications it is limited by the conditional independence assumption. 3.1.2. Count Data Almost all of the discussion for binary response models extends to other nonlinear models for cluster samples. An important application is to count data (or, in fact, any nonnegative response variable). A popular model for the conditional mean is Ey gm |x g , Z g , c g Ey gm |x g , z gm , c g exp x g z gm c g . As with the probit model we assume strict exogeneity of the unit-specific covariates, z gm , conditional on x g , c g . As usual, strict exogeneity can be relaxed for pooled methods but not for random effects, fixed effects, and GLS-type methods. A convenient normalization in (39) is Eexpc g 1. Under (39), a simple way to estimate , , and is the pooled Poisson quasi-maximum likelihood estimator (QMLE). In other words, act as if y gm has a Poisson distribution given x g , z gm and ignore the within-group correlation in estimation. In many statistical packages, it is simple to make inference robust to arbitrary within-group correlation (including serial correlation in a panel data setting). The Poisson QMLE is known to be fully robust to 37 (39) distributional misspecification, as well as within-group correlation. All that is required is Ey gm |x g , Z m exp x g z gm (40) (and we can replace Z m with z gm ) along with regularity conditions. See Wooldridge (2002, Chapter 19) for references and a summary. As with the linear case, any pooled method is likely to be inefficient. A typical random effects Poisson approach adds the assumptions y gm |x g , Z g , c g ~Poissonexp x g z gm c g , y g1 , . . . , y g,M g are independent conditional on x g , Z g , c g , (41) (42) and expc g |x g , Z g ~Gamma, , (43) where the gamma distribution is parameterized to have mean equal to one. [Of course, other distributions are possible, but the gamma is very common.] This estimator is programmed in many packages, and is, of course, efficient under the maintained assumptions. Unfortunately, the estimator has no known robustness properties to violations of (41), (42), or (43). As in the binary response case, a middle ground that maintains robustness while likely being more efficient than the pooled estimator is available. A GEE approach assumes (40), uses the nominal Poisson variance assumption Vary gm |x g , Z g Ey gm |x g , Z g exp x g z gm and specifies a working correlation matrix. Typically, with a cluster sample the “exchangeable” option would be used. As described in Section 3.1.1, the GEE estimator is a multivariate weighted nonlinear least squares estimator with a (probably) misspecified 38 (44) variance matrix. Thus, inference should be made fully robust using a sandwich estimator, and this option is available (see appendix). Unlike in the probit case, for count data we can derive the correct conditional variance matrix under the nominal random effects Poisson assumption, and the resulting correlation matrix does not have the exchangeable form. As shown in Wooldridge (2002, Chapter 19), Vary gm |x g , Z g exp x g z gm 2 exp x g z gm 2 Covy gm , y mp |x g , Z g 2 exp x g z gm exp x g z gp (45) (46) where 2 1/ Varexpc g . The conditional correlation clearly depends on the covariates. Therefore, rather than using a constant working correlation, one uses the variance matrix implied by (45) and (46). Such an estimator is still fully robust but would be more efficient than the exchangeable GEE estimator under the RE Poisson assumptions. Wooldridge (2002, Chapter 19) provides details on estimating 2 as well as fully robust inference [to guard against (45) or (46) being incorrect]. As far as I know, this estimator is not programmed in standard econometrics software, although it would not be difficult to do so. It is straightforward to extend the previous models to allow correlation between c g and z gm . Is in the binary response case, we initially drop x g . Then, we might be willing to assume Eexpc g |Z g Eexpc g |z̄ g exp g z̄ g g (47) where g , g depends on g through the group size, M g . Then Ey gm |Z m exp g z gm z̄ g g and now we can apply the pooled Poisson estimator, the GEE estimator, or the WNLS estimator implied by (45) and (46), but but where we allow the slope and intercepts on z̄ g to depend on the group sizes. One might want to use a different estimate of 2 for each group 39 (48) size, but that is not necessary for consistency. One could also extend the RE Poisson estimator to allow expc g |Z g to have a gamma distribution with mean in (47), although estimation would be much more complicated than extending the other methods. As in the binary response case, one can add x g to the model, subject to the issue of how to interpret the parameters. Because the elements of are semi-elasticities (or elasticities), often estimation of is sufficient. Still, estimation of APEs can be based on G G −1 ∑ exp̂ g ẑ z̄ g ̂ g . (49) g1 Rather than restricting the nature of the relationship between c g and Z g , we can use the “fixed effects Poisson” estimator to consistently estimate . Originally, this estimator was derived as a conditional MLE by Hausman, Hall, and Griliches (1984). Blundell, Griffith, and Windmeijer (2002) showed that treating the c g as parameters to estimate is the same as the conditional MLE, so the name “fixed effects Poisson” cannot cause confusion. As shown by Wooldridge (1999), the FE Poisson estimator has very desirable robustness properties: it is consistent and G -asymptotically normal under the assumption Ey gm |Z g , c g expz gm c g . The actual distribution of y gm is practically unrestricted (except for regularity conditions), as is the conditition dependence withing group. If interest lies in , the FE Poisson estimator is very attractive. Of course, one should generally use a sandwich variance matrix estimator [see Wooldridge (1999) or Wooldridge (2002, Section 19.6.4)]; unfortunately, this is not currently a standard option in statistical packages. 40 (50) 3.1.2. Tobit Models for Corner Solutions The discussion for Tobit models is very similar for probit models. Here, I assume that Tobit is applied to a “corner solution” response, so that y gm , the nonnegative variable with probability mass at zero, is the variable we wish to explain. In particular, I will focus on estimating APEs on the mean response. The standard Tobit model with a group effect is y gm max0, x g z gm c g u gm u gm |x g , Z g , c g ~Normal0, 2u (51) (52) Among other things, (51) and (52) imply Ey gm |x g , Z g , c g x g z gm c g / u x g z gm c g u x g z gm c g / u ≡ m x g z gm c g , 2u . (53) The APEs are obtained by integrating this expression with respect to the distribution of c g . A typical “andom effects approach assumes c g |x g , Z g ~Normal0, 2c , (54) in which case the APEs are remarkably simple to obtain. They are estimated from m̂ x̂ ẑ , ̂ 2v where 2v 2c 2u . Conveniently, the pooled Tobit directly provides consistent and G -asymptotically normal estimators of , , and 2v . A sandwich estimator is easily obtained to allow arbitrary within-group correlation. The full random effects Tobit specification adds the same conditional independence 41 (55) assumption as random effects probit; see (26). Then 2c and 2u are separately identified, but consistency requires all RE Tobit assumptions. In principle, a GEE-type approach can be used in the Tobit case as well. This could be based on the scores from the Tobit log-likelihood, although, as far as I know, details remain to be worked out. As with probit, we can modify pooled Tobit and RE Tobit to allow for correlation between c g and Z g , as in (28). But, with different group sizes, we should replace 2a Varc g |z̄ g with 2a,M g . As in the probit case, we can estimate a different variance for each group size M g if the group sizes are relatively few. An even easier approach is to allow all parameters to differ by M g , estimate the APEs in each case, and average them together. Either pooled Tobit or RE Tobit can be used. The APEs would be obtained from G G −1 ∑ m̂ M g ẑ M g z̄ g ̂ M g , ̂ 2v,M g (56) g1 where the “M g ” just indicates a different set of estimates for each different group size. With larger M g , one might use “fixed effects” Tobit. While it is known that the estimator of suffers from an incidental parameters problem, one can use G G −1 ∑ mẑ ĉ g , ̂ 2u g1 to estimate APEs. I know of no simulation evidence that studies APEs based on (57); they might be reasonable estimates for moderate M g . Fernández-Val (2005) contains some relevant theory. 3.1.4. Fractional Responses 42 (57) Recently, more applied econometric work is recognizing that fractional responses can require special treatment. Naturally, one can always opt for a linear model (as in the case of binary responses, count, and corner solutions responses). But there is more interest in using functional forms that ensure expected values are in the unit interval. One possibility for responses that have probability mass at zero and one is to adopt a two-limit Tobit model. The analysis would then be very similar to Section 3.1.3. In particular, the formulas for the expected value in the two-limit case can be easily adapted to allow the presence of unobserved heterogeneity, either assumed independent of the covariates or with Dc g |z̄ g assumed normal. The drawback of the two-limit Tobit model is that it makes sense only when there is pile up a zero and one. Even in such cases, if fully specifies the distribution Dy gm |x g , Z g . If we are primarily interested in effects on the mean response, more robust methods are available. These methods produce consistent estimators with pile up at neither, both, or only one endpoint. Papke and Wooldridge (2005) recently proposed methods for fractional responses for balanced panel data sets. Here I draw on that work and make suggestions for the unbalanced case. In fact, the methods are virtually identical to the results for the probit case, but with a change in how we interpret the probit response function. With y gm a fractional response, we begin with Ey gm |x g , Z g , c g x g z gm , m 1, . . . , M g . Equation (58) is attractive as a mean response In other words, we simply assume the mean response has the probit form. [The logit form is also possible but does not lead to simple estimating equations.] If we add the independence and normality assumption 43 (58) c g |x g , Z g ~Normal0, 2c , (59) just as in (24), then we have an extension of (25): Ey gm |x g , Z g c x g c z gm c , where the “c” index denotes scaling by 1/1 2c 1/2 . Because the Bernoulli quasi-likelihood function identifies the parameters of a correctly specified conditional mean – see Gourieroux, Monfort, and Trognon (1984) or Papke and Wooldridge (1996) – we can use pooled “probit” to consistently estimate the scaled parameters and APEs. Naturally, we should use a fully robust sandwich variance matrix estimator to account for within-group correlation but also for the fact that the Bernoulli variance assumption can no longer be expected to hold. See Papke and Wooldridge (2005) for more discussion. A random effects probit analysis does not make sense here because we do not have a fully specified distribution. But the GEE methods described for probit apply immediately. In fact, using standard GEE commands, you would not even know y gm is fractional or binary (except possibly for a “warning” that you applying “probit” to a nonbinary response). The discussion about allowing correlation between c g and Z g is identical to the binary response case. Again, it is important to allow at least Varc g |z̄ g depend on the group size, and possibly the slopes on z̄ g as well. A simple, flexible, but inefficient approach is to estimate a different model for each M g , provided there is not much variation in M g . Better would be to apply a minimum distance approach that recognizes a constant vector on z gm in (58). 3.2 Large Group Size Asymptotics 44 (60) Unlike in the linear case covered by Donald and Lang (2001), for nonlinear models no exact inference is available. Nevertheless, if the group sizes M g are reasonably large, we can extend the approximate inference using the DL approach to nonlinear models. Plus, the minimum distance approach carries over essentially without change. We can apply the methods to any of the nonlinear models in Section 3.1 (as well as others). Here, I illustrate the probit response function (which means it can apply to binary or fractional responses). With small G but random sampling of y gm , z gm : m 1, . . . , M g within each g, write Py gm 1|z gm g z gm m , m 1, . . . , M g g x g , g 1, . . . , G. (61) (62) As with the linear model, we assume the intercept, g in (61), is a function of the group features x g . With the M g moderately large, we can get good estimates of the g . The ̂ g , g 1, . . . , G, are easily obtained by estimating a separate probit for each group. Under (62), we can apply the minimum distance approach just as before. Let Avar̂ g denote the estimated asymptotic variances of the ̂ g (so these shrink to zero at the rate 1/M g ). For binary response, these are just the usual MLE estimated variances. For fractional response, these would be from a sandwich estimate of the asymptotic variance. Then, we obtain the minimum distance estimates as the WLS estimates from ̂ g on 1, x g , g 1, . . . , G using weights 1/Avar̂ g are used as the weights. This is the efficient minimum distance 45 (63) estimator and, conveniently, the proper asymptotic standard errors are reported from the WLS estimation (even though we are doing large M g , not large G, asymptotics.) The overidentification test is obtained exactly as in the linear case: there are G − K − 1 degrees-of-freedom in the test. The same cautions about using the overidentification test to reject the minimum distance approach apply here as well. In particular, in the treatment effect setup, where x g is zero or one, one might reject a weighted comparision of means simply because the means within the control or within the treatment group differ, or both. It might make more sense to define, using other considerations, weighted averages of the population means, and then to estimate those means using within-group sample averages. Similar considerations may be relevant when x g is a vector with a more complicated structure. If we impose common slopes g , g 1, . . . , G, then in applying MD estimation we should account for the across group correlation in the ̂ g , as described in Section 2.3. This is also true if we impose that a subset of the slopes are common across g. If we reject the overidentification restrictions and wish to have conservative inference, then we can adapt Donald and Lang and treat ̂ g x g error g , g 1, . . . , G as approximately satisfying the classical linear model assumptions, provided G K 1, just as before. As in the linear case, this approach is justified if g x g c g with c g independent of x g and c g drawn from a homoskedastic normal distribution. It assumes that we can ignore the estimation error in ̂ g , based on ̂ g g O1/ M g . Because the DL approach ignores the estimation error in ̂ g , it is unchanged if one imposes 46 (64) some constant slopes across the groups, as with the linear model. Once one estimates and , the estimated effect on the response probability can be obtained by averaging the response probability for a given x: Mg G G −1 ∑ g1 M −1 g ∑ ̂ x̂ z gm ̂ g , m1 where derivatives or differences with respect to the elements of x can be computed. Here, the minimum distance approach has an important advantage of the DL approach: the finite sample properties of (65) are viritually impossible to obtain, whereas the large-M g asymptotics underlying minimum distance would be straightforward using the delta method. Whether bootstrapping can be applied is an interesting question. Particularly with binary response problems, the two-step methods described here are problematical when the response does not vary within group. For example, suppose that x g is a binary treatment – equal to one for receiving a voucher to attend college – and y gm is an indicator of attending college. Each group is a high school class, say. If some high schools have all students attend college, one cannot use probit (or logit) of y gm on z gm , m 1, . . . , M g . A linear regression returns zero slope coefficients and intercept equal to unity. Of course, if randomization occurs at the group level – that is, x g is independent of group attributes – then it is not necessary to control for the z gm . Instead, the within-group averages can be used in a simple minimum distance approach. In this case, as y gm is binary, the DL approximation will not be valid, as the CLM assumptions will not even approximately hold in the model ȳ g x g e g (because ȳ g is always a fraction regardless of the size of M g ). As in the linear case, more flexibility is afforded if G is somewhat large along with large M g . Then, in implementing the DL approach – with or without common slopes imposed in the 47 (65) first stage – one gains robustness to nonnormality of c g if G is large enough so that G G G −1 ∑ g1 c g and G −1 ∑ g1 x g c g are approximately normally distributed. If we estimate separate models in the first stage and do not wish to ignore the estimation error in ̂ g , then error g in (64) is heteroskedastic, and we can, perhaps, use heteroskedasticity-robust standard errors in (64). 4. CONCLUDING REMARKS My purpose is writing this expanded version of Wooldridge (2003) is to flesh out some issues only touched on in the shorter paper, while also considering more general frameworks. This version of the paper contains what are best described as conjectures, or at least suggestions that may not turn out to work very well. Future research could resolve the issue of how to allow group heterogeneity to be correlated with the unit-specific covariates when the group sizes differ. Minimum distance estimation strikes me as promising, but that remains to be seen. Less parametric approaches should also be considered. Much more remains to be learned about the “small G” case, too, particularly for nonlinear models. A similuation study comparing Donald and Lang’s (2001) conservative approach to minimum distance estimation, when neither method is entirely appropriate, would be worthwhile. For example, in a treatment effect case with G 4, two treatment groups and two control groups, with different means within each group, one might select different, modest sizes of M g (ranging from maybe 20 to 40). The underlying population distributions can be normal. Which method, DL or minimum distance, produces the best confidence intervals for 48 the ATE (defined appropriately for a fictitious population)? 49 REFERENCES Angrist, J.D. and V. Lavy (2002), “The Effect of High School Matriculation Awards: Evidence from Randomized Trials,” NBER Working Paper 9389. Arellano, M. (1987), “Computing Robust Standard Errors for Within-Groups Estimators,” Oxford Bulletin of Economics and Statistics 49, 431-434. Baker, M. and N.M. Fortin (2001), “Occupational Gender Composition and Wages in Canada, 1987-1988,” Canadian Journal of Economics 34, 345-376. Bertrand, M., E. Duflo, and S. Mullainathan (2002), “How Much Should We Trust Differences-in-Differences Estimates?” mimeo, MIT Department of Economics. Blundell, R., R. Griffith, and F. Windmeijer (2002), “Individual Effects and Dynamics in Count Data Models,” Journal of Econometrics 108, 113-131. Campolieti, M. (2004), “Disability Insurance Benefits and Labor Supply: Some Additional Evidence,” Journal of Labor Economics 22, 863-889. Card, D., and A.B. Krueger (1994), “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review 84, 772-793. Donald, S.G. and K. Lang (2001), “Inference with Difference in Differences and Other Panel Data,” mimeo, University of Texas Department of Economics. Fernández-Val, I. (2005), “Estimation of Structural Parameters and Marginal Effects in Binary Choice Panel Data Models with Fixed Effects,” mimeo, Boston University Department of Economics. 50 Hausman, J.A., B.H. Hall, and Z. Griliches (1984), “Econometric Models for Count Data with an Application to the Patents-R&D Relationship,” Econometrica 52, 909-938. Hsiao, C. (1986), Analysis of Panel Data. Cambridge: Cambridge University Press. Kézdi, G. (2001), “Robust Standard Error Estimation in Fixed-Effects Panel Models,” mimeo, University of Michigan Department of Economics. Liang, K.-Y., and S.L. Zeger (1986), “Longitudinal Data Analysis Using Generalized Linear Models,” Biometrika 73, 13-22. Loeb, S. and J. Bound (1996), “The Effect of Measured School Inputs on Academic Achievement: Evidence form the 1920s, 1930s and 1940s Birth Cohorts,” Review of Economics and Statistics 78, 653-664 Moulton, B.R. (1990), “An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units,” Review of Economics and Statistics 72, 334-338. Papke, L.E. and J.M. Wooldridge (1996), “Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates,” Journal of Applied Econometrics 11, 619-632. Papke, L.E. and J.M. Wooldridge (2005), “Panel Data Methods for Fractional Response Variables with an Application to Test Pass Rates,” mimeo, Michigan State University Department of Economics. Pepper, J.V. (2002), “Robust Inferences from Random Clustered Samples: An Application Using Data from the Panel Study of Income Dynamics,” Economics Letters 75, 341-345. White, H. (1980), “A Heterosekdasticity-Consistent Covariance Matrix Estimator and A Direct Test for Heteroskeasticity,” Econometrica 48, 817-838. White, H. (1982), “Maximum Likelihood Estimation with Misspecified Models,” 51 Econometrica 50, 1-26. White, H. (1984), Asymptotic Theory for Econometricians. Academic Press: Orlando, FL. Wooldridge, J.M. (1999) “Distribution-Free Estimation of Some Nonlinear Panel Data Models,” Journal of Econometrics 90, 77-97. Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data. MIT Press: Cambridge, MA. Wooldridge, J.M. (2003), “Cluster-Sample Methods in Applied Econometrics,” American Economic Review 93, 133-138. Wooldridge, J.M. (2005), “Unobserved Heterogeneity and Estimation of Average Partial Effects,” in Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. D.W.K. Andrews and J.H. Stock (eds.). Cambridge: Cambridge University Press, 27-55. 52 APPENDIX Below are commands in version 8.0 of Stata that can be used for robust inference with cluster or panel data sets. All commands assume that there is an identifier, “id,” that identifies the cluster or group for each observation. Section 2.2 Pooled OLS reg y x1 x2 ... xK z1 z2 ... zL, cluster(id) Random Effects: Usual Inference iis id xtreg y x1 ... xK z1 ... zL, re Random Effects: Fully Robust Inference iis id xtgee y x1 x2 ... xK z1 ... zL, corr(exch) robust Note: The xtgee and xtreg estimates will generally differ for two reasons. First, xtgee uses estimates of 2c based on an initial pooled OLS estimation, while xtreg uses a fixed-effects based estimate. Second, xtgee iterates, while xtreg does not. Section 2.3 A separate linear model is estimated for each group identifier to obtain the asymptotic 53 variances of the intercepts. In the second step, OLS can be used (as in DL), relying on the reported t statistics and critical values, or minimum distance (implemented as WLS) using the standard normal approximation. reg y z1 ... zL if id g Section 3.1.1 Pooled Probit probit y x1 x2 ... xK z1 z2 ... zL, robust cluster(id) Random Effects Probit iis id xtprobit y x1 x2 ... xK z1 z2 ... zK, re GEE with an Exchangeable Working Correlation Matrix iis id xtgee y x1 x2 ... xK z1 z2 ... zK, family(bin) link(probit) corr(exch) robust or xtprobit y x1 x2 ... xK z1 z2 ... zK, pa (The qualifier “pa” stands for “population averaged.” Stata simply uses the first command when “pa” is used with xtprobit.) Using the Chamberlain-Mundlak Device All of the previous commands can include the within-group averages of the unit-specific covariates. For example, Chamberlain’s random effects probit model can be estimated by iis id 54 xtprobit y x1 x2 ... xK z1 z2 ... zL z1bar z2bar ... zLbar, re But, as discussed in Section 3.1.1, at a minimum these coefficients are all scaled by a variance that depends on the group size, M g . One could do this estimation for each group size and average the APEs. Even better would be to use minimum distance to impose the restriction of a common in the structural model. Section 3.1.2 Pooled Poisson poisson y x1 ... xK z1 ... zL, robust cluster(id) Random Effects Poisson iis id xtpois y x1 ... xK z1 ... zL, re GEE with an Exchangeable WCM xtgee y x1 ... xK z1 ... zL, family(pois) link(log) corr(exc) robust The Fixed Effects Poisson Estimator iis id xtpois y x1 ... xK z1 ... zL, fe Note: Currently, Stata does not report robust standard errors for FE Poisson Section 3.1.3 Pooled Tobit tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) cluster(id) 55 Random Effects Tobit xttobit y x1... xK z1 ... zL z1bar ... zLbar, re Section 3.1.4 Pooled Two-Limit Tobit tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) ul(1) cluster(id) Note: Unfortunately, xttobit does not currently allow specifying an upper bound, so the RE version of the two-limit Tobit is not immediately available. Pooled Quasi-MLE glm y x1 ... xK z1 ... zL, family(bin) link(probit) robust cluster(id) GEE with Exchangeable WCM xtgee y x1 ... xK z1 ... zL, family(bin) link(logit) corr(exch) robust Using the Chamberlain-Mundlak Device For GEE, this would be xtgee y x1 ... xK z1 ... zL z1bar ... zLbar, family(bin) link(probit) corr(exch) robust Section 3.2 A separate model is estimated for each group identifier to obtain the asymptotic variances of the intercepts. So, the commands are carried out for each group size, g. In the second step, OLS can be used (as in DL), relying on the reported t statistics and critical values, or minimum distance (implemented as WLS) using the standard normal approximation. Binary Response 56 probit y z1 ... zL if id g Count Response glm y z1 ... zL if id g, family(poisson) link(log) robust Fractional Response glm y z1 ... zL if id g, family(bin) link(probit) robust or glm y z1 ... zL if id g, family(bin) link(logit) robust 57
© Copyright 2025