Optimal allocation algorithm for a multi-way stratification design Paolo Righi1, Piero Demetrio Falorsi2 1 Italian National Statistical Institute, e-mail: parighi@istat.it Italian National Statistical Institute, e-mail: falorsi@istat.it 2 Abstract The paper shows a sampling allocation algorithm for obtaining planned sample size for domains belonging to different partitions of the population guaranteeing the sampling errors of domain estimates be lower than given thresholds. The algorithm is innovative because may be applied when a multi-way stratified design is used. That kind of design is useful when the overall sample size is bounded and consequently the standard solution of using a stratified sample with the strata given by cross-classification of variables defining the different partitions (one-way design) is not feasible since the number of strata is larger than the overall sample size. Moreover, the one-way designs could define too detailed strata with a small number of population units, leading to a heavy response burden for such units that have to be drawn in every survey occasion. Multi-way design overcomes this critical aspect and it allows to allocate the sample in a more efficient way of one-way design, because it has not constraints on the number of sample units in the cross-classified strata. The algorithm is implemented to cover a multivariate-multidomain allocation usually considered in the large scale surveys. Keywords: Optimal allocation, Multi-way stratification, Large scale survey 1. Introduction Large scale surveys in Official Statistics usually produce estimates for a set of parameters by a huge number of highly detailed estimation domains. These domains generally define not nested partitions of the target population. When the domain indicator variables are available for each sampling unit at the sampling framework level, the survey sampling planner could try to select a sample covering each domain. Using this kind of design direct estimates can be obtained for each domain. Furthermore, it can be possible to control the sampling errors at domain level as well. A standard solution to obtain planned sample sizes for the domains of two or more partitions is to use a stratified simple random sampling in which strata are identified by cross-classification of variables defining the different partitions. In the following, this design will be denoted as crossclassification design or one-way design. However, in many practical situations the crossclassification design is unsuitable since it needs the selection of at least a number of sampling units as large as the product of the number of categories of the stratification variables. Cochran well illustrates (1977, p. 124) this problem giving a clear example in which the cross-classification design is unfeasible. In some circumstances even though it may be applied, the cross-classification design is quite expensive. For instance, in the business surveys, following the European Council Regulation on Structural Business Statistics (SBS) establishes that the parameters of interest refer to estimation domains defined by three different partition of the population of enterprises (Economic activity class NACE classification, Economic activity group NACE classification by Size class, Economic activity division NACE classification by Region). In Italy, the total number of estimation domains is about 1,800; while the number of non-empty strata of the crossclassification design is larger than 37,000. In the social context, an example regards the Italian Graduates’ Career Survey conducted by the Italian National Statistical Institute (ISTAT). This survey disseminates estimates on the employment status of graduates three years after the degree at level of several types of domains: gender, course, group of course, university center, defining an overall number of domains greater than 600. The usual sample design considers the crossclassification of all the variables defining the domains, obtaining more than 2,200 strata. There are further drawbacks in the cross-classification design such as: a greater statistical burden due to the high level of detail entailing domains with very few units that will be contacted in every survey occasion (this is the case of the SBS surveys); an ineffectiveness of the sample allocation when we have to select al least two units per stratum (for obtaining unbiased variance estimates) because in some strata, according to the used allocation method, a not integer or less than two sample units could be required. In order to overcome some problems of cross-classification designs, an easy strategy is to drop one or more stratifying variables or to group some of the categories. Nevertheless, some planned domains become unplanned and some of them can have small or null sample size. A second approach is to use a multi-way stratified sampling design. This design keeps under control the sample size in all the domains without using cross-classification approach. There are several methods implementing multi-way design, but, usually, in the large scale surveys they have problem of application (Falorsi et al., 2006). This is not the case of the cube method (Deville and Tillé, 2004) that may select a random sample of multi-way stratified design for a large population and a lot of domains. Given the inclusion probability vector, the method implemented by an algorithm selects a sample satisfying the expected sample size at domain level. In this context, the paper illustrates a sampling allocation algorithm useful for obtaining planned sample size (or equivalently the inclusion probability vector) for domains belonging to different partitions of the population and in order to guarantee the sampling errors of domain estimates be lower than given thresholds. The algorithm covers the multivariate-multidomain problem is implemented to take into account the selection is made by means of the cube method. In the following, section 2 introduces the parameter of interest, section 3 describes the sampling strategy, section 4 gives the details of the proposed algorithm, while section 4 is devoted to a Monte Carlo simulation on real survey data. 2. Parameters of interest Let us denote with U a population of N elements and with b a specific partition of U (b=1,…, B) in which b-th partition defines M b different non overlapping domains, U bd (d=1, …, M b ) , of size N bd being ∑d =1 N bd = N Mb and, finally let ∑b =1 M b = Q be the B overall number of domains. Let yr ,k and bd δ k denote respectively the value of the r-th (r =1, …, R) variable of interest in the k-th population unit and the domain membership indicator, being if k ∈ U bd and bd δ k = 0 , otherwise. Let us suppose that the each unit in the population. The parameters of interest are the M = Q × R domains totals t = bd r ∑ k ∈U y r ,k bd δk = ∑ k ∈U bd y r ,k bd δ k bd δ k =1 values are known for (r = 1,…,R ; b=1,…, B; d=1,…, M b ). (2.1) The expression (2.1.) defines a multivariate-multidomain problem since there are R variables of interest (multivariate aspect) and Q > 1 domains (multidomain aspect). 3. Sampling strategy We describe the sampling allocation algorithm for a multi-way stratification design with the Horvitz-Thompson (H-T) estimator. However, applications of generalized regression estimator is straightforward (Falorsi and Righi, 2008). Multi-way stratification designs can be treated in the context of the balanced sampling. Following the design based (or model assisted) approach a sample is balanced when the H-T estimates of the auxiliary variable totals are equal to their known population totals (Deville and Tillé, 2004). For defining the balanced sampling we introduce the general definition of sampling design as a probability distribution p(.) on the set S of all the subset s of the population U such that ∑s∈S p ( s ) = 1 , where p(s) is the probability of the sample s to be drawn. Each set s may be represented by the outcome λ ′ = (λ1 , ..., λk , ..., λ N ) of a vector of N random variables. Let π ′ = (π 1 ,..., π k ,..., π N ) be the vector of inclusion probabilities, where π = E p ( λ ) = ∑s∈S p( s )λ , being E p (⋅) the expected value over repeated sampling. Let z ′k = ( z1k ,..., z hk ,..., zQk ) be a vector of Q auxiliary variables available for each population unit. The sampling design p(s) with inclusion probabilities π is said to be balanced with respect to the Q auxiliary variables if and only if tˆz ,ht = t z (balancing equations) for all s∈S such that p(s)>0 where tˆz,ht = ∑k∈U z k λ k a k denote the H-T estimates of t z = ∑k∈U z k being a k = 1 / π k . We assume that a vector of inclusion probabilities π is such that ∑π k ∈U bd k = nbd (b=1,…, B; d=1,…, M b ), (3.1) where nbd is the sample size of the domain U bd , being nbd an integer number. Multiway stratification design represents a special case of balanced design where for unit k the auxiliary variable vector is given by b =1 b= B 647 48 647 48 z′k = (0,..., π k ,...,0,..., 0,..., π k ,...,0) = π k (11δ k ,..., bd δ k ,..., BM B δ k ) = π k δ′k . 14444244443 (3.2) Q The expression (3.2) defines the z k as vectors of (Q-B) zeros and with B entries equal to π k in the places indicating the domains which the unit k belongs to. When defining the z k vector as (3.2), if condition (3.1) holds, the selection of sample satisfying the system of balancing equations, guarantees that the nbd values are non random quantities. Deville and Tillé (2004) proposed the cube method that allows one the random selection of balanced samples involving a large set of auxiliary variables of type (3.2) and with respect to different vectors of inclusion probabilities. The cube method is implemented by an enhanced algorithm for large data sets (Chauvet and Tillé, 2006) available in a free software code that may be downloaded in the website http://www.insee.fr/fr/nom_df_met/outils_stat/cube/accueil_cube.htm . It is also available an R-package at http://cran.r-project.org/web/packages/sampling/index.html . 4. Allocation algorithm for multi-way stratification design The aim of paper is to define procedure defining the inclusion probabilities π k , and the domain sample sizes nbd , which attempts to minimize the overall sample size, n, guaranteeing that the sampling variances are lower than prefixed level of precision thresholds, bd Vr : V p ( bd tˆr ,ht | tˆz ,ht = t z ) ≤ bd Vr (b=1,…, B; d=1,…, M b ; r=1,…, R) where bd Vr is the variance threshold and V p ( bd tˆr ,ht | tˆz ,ht = t z ) is the variance of the H-T estimator bd tˆr ,ht of bd t r given tˆz ,ht = t z (balanced sample). We use an approximated expression of the variance of balanced sampling given by Deville and Tillé (2005), assuming that, through Poisson sampling, the vector (tˆr ,ht , tˆz′,ht )′ has approximately a multinormal distribution. The authors suggest the sampling variance can be approximated by V p ( bd tˆr ,ht | tˆz , ht = t z ) = V p ( bd tˆr ,ht + (t z − tˆz ,ht )′ B z , y ) = V p ( bd tˆr , ht − tˆz′,ht B z , y ) = = V p ( ∑ a k ( bd δ k y r ,k − z ′k B z , y )) ≅ k∈s where bd N ∑ N − Q k∈U 1 − 1 bdη r2,k , (4.1) πk η = ( bd δ k yr , k − z′k B z , y ) 2 with 2 r ,k B z , y = [∑k ∈U z k z ′k (1 / π k − 1)] −1 ∑ k ∈U bd δ k z k yr ,k (1 / π k − 1) . In order to define the inclusion probabilities π k a two steps procedure is executed: in the first step, denoted as optimization, the preliminary inclusion probabilities, π k′ , are determined solving a minimum constrained problem; in the second step, denoted as calibration, the final inclusion probabilities, π k , are obtained as a slight modification of the π k′ . The calibration step assures the domain sample sizes nbd are integers. As illustrated in the following, the π k values may be expressed as implicit functions of the unknown residuals bd η r2,k . But, in real survey context, the determination of the inclusion probabilities π k may be done using the predictions bd ηˆr2,k instead of bd η r2,k . This is a general problem concerning the planning of the sampling designs, because the variances are generally unknown quantities that may be suitably estimated. We describe the allocation algorithm first assuming bd η r2,k as they were known. The allocation takes into account a purely design based variance. Later we consider the uncertainty on the bd ηˆ r2,k values leading to consider the anticipated variance (Isaki and Fuller, 1982). 4.1. Sampling allocation algorithm when bd η r2,k are known First step: Optimization The inclusion probabilities π k′ can be defined as solution of the following non linear programming problem with N unknowns, π k′ , and ( N + Q × R ) constraints Min ∑ π k′ k∈U 1 (4.2) N − 1 bdη r2, k ≤ bdVη , r (b = 1,..., B; d = 1,..., M b ; r = 1,..., R) N − Q k∑ ∈U π k′ 0 < π k′ ≤ 1 (k = 1,..., N ) The problem (4.2) without the inequalities on the inclusion probabilities has been dealt with by Bethel (1989) that invokes the Kuhn-Tucker theorem to show that there exists a solution to the above problem. He describes a simple algorithm and discusses its convergence properties. Chromy (1987) develops an algorithm, suitable for automated spreadsheets but without an explicit proof that always converges. Originally the algorithm it has been for the multivariate allocation in stratified surveys. In general allows one to find the unknown values vh > 0 (h=1,2…) which represent the solution of the following non linear problem Min (∑hν h ) under the constraints ∑h Arh ν h ≤ Ar , where Arh and Ar (r=1,2,…) are known positive quantities. In our context we have to take into account two modifications. The first is needed for guaranteeing the inequalities 0 < π k′ ≤ 1 ( k = 1, ..., N ) are respected. The second modification is related to the dependence of 2 bd η r ,k on B z , y that is function of the unknown terms. For our purpose we develop the square of bd η r2, k = bd δ k yr2, k + π k2 g k2 − 2 π k bd 2 bd η r ,k δ k yr , k g k , being g k = δ′k B z , y . Then the inequality in (4.2) is equivalent to as 1 N bd δ k yk2 ≤ bdV y ,r . − 1 ∑ k ′ N −Q πk where (4.3) [ ] N ∑k 2 (1 − π k′ )bd δ k yr ,k g k − ∑k π k′ (1 − π k′ ) g k2 . (4.4) N −Q To face this problem we launch Chromy algorithm several time as follows. We fix the inclusion probabilities at initial values (for instance 0.5) in the right side of the (4.4) and we find the solution of the (4.3) by means of the Chromy algorithm. In the iterative process the bdV y ,r do not change. When the solution is found the new set of inclusion Vy , r =bdVη , r + bd probabilities are plugged in the V y ,r and the Chromy algorithm is launch again. When bd the bdV y ,r do not modify from a lunch to another the inclusion probabilities are defined. After the Initialization, the Chromy algorithm finds the inclusion probabilities, in each launch, by iterating the two actions of Calculus and Check. As far as the convergence issue is concerned, it is worthwhile to note that the Chromy’s algorithm have been mostly used for stratified sampling design, and indeed, the documentation refers to stratified samples. Let us note that the modification of the Chromy’s algorithm, herein proposed, treats the sampling units as strata and the resulting allocation, being fractional, defines the inclusion probabilities. Initialization: at initial iteration ( τ = 0 ), set τ γ k = 1 (k=1, …, N). Calculus: the generic iteration ( τ = 1,2,... ) consists of a sequence of steps denoted with u ( = 0, 1, 2, ...) . N τ ,u 2 τ − At initial step ( u = 0 ), set bd φr = 1 and calculate bdτ V0r = ∑ bdη r ,k γ k . N − Q k∈U − At subsequent steps (u=1,2,…), calculate the values of the following equations 1/ 2 Mb R B N 2 τ ,u . π k = (1−τ γ k ) + τ γ k ∑ ∑ ∑ bd φr bdη r , k N − Q b =1 d =1 r =1 N 1 τ ,u τ ,u τ τ ,u 2 τ ∑ bd Vr = bdη r ,k γ k , and bd V r′ = bd V r + bd V0 r . τ ,u N − Q k∈U πk τ ,u (4.5) (4.6) − If the following two conditions: τ ,u τ ,u τ ,u (4.7) bd Vr′ ≤ bd Vr and bd φr ( bd Vr′ − bd Vr ) = 0 , are respected (for all b=1, …, B; d=1, …, M b ; r=1, …, R) then the action of Calculus stops and the inclusion probabilities τ π k are those calculated in equation (4.5). Otherwise, the updated quantities τ ,u +1 bd τ ,u +1 bd φr are computed τ ,u φr = bd φr [τbd,u Vr /(τbd,u Vr′ − bdτ Vr )] 2 and the equations (4.5) and (4.6) are calculated at u+1, over and over again with τ , u +1 τ ,u bd φr replacing bd φr until conditions (4.7) are respected. Check: if the condition τ π k ≤ 1 is true for all k , then the algorithm stops and the π k′ values are set equal to π k′ =τ π k . Otherwise the τ +1 γ k = 1 if τ π k ≤ 1 and τ γ k values are updated as τ +1 γ k = 0 if τ π k > 1. The calculus is iterated at τ + 1 with τ +1 γk τ replacing γ k . A SAS macro that allows one to solve the problem (4.2) has been developed by the authors of this paper and may be released on demand. 4.2. Sampling allocation algorithm when 2 bd η r ,k are predicted In practice, yr ,k and bd η r2,k are unknown and we must use a prediction in the algorithm achieved by means of a statistical model. The algorithm has to take into account the model uncertainty. Let ξ be the superpopulation model yk ,r + uk yk ,r = ~ , 2 2 Eξ (u k ) = 0 ∀k ; Eξ (u k ) = σ k ; Eξ (u k , ul ) = 0 ∀k ≠ l being ~yk ,r = x′k β . We consider a generic model and do not make any reference to the previous model underlines the balanced sampling. Then, bd η r ,k = bd δ k ( ~ yr ,k + uk ) − z′k B z , y and an upward approximation of the anticipated variance is given by 1 1 AV p ( bd tˆr , ht | tˆz ,ht = t z ) =& ∑ k ∈U bd η~r2,k − 1 + ∑k ∈U bd δ k σ r2,k − 1 (4.8) πk πk The (4.8) outlines in this case the input variance has a term of variability depending on the explicative power of the prediction model. 5. Simulation We have made a Monte Carlo experiment for an evaluation of the performances of the algorithm, in particular to verify the convergence properties. We have used a subpopulation of universe of the Italian Graduates’ Career Survey already introduced in section 1. The survey is based on about 300,000 of graduates and produces estimates for 8 types of domains (8 partition of the population). In particular two very detailed partitions are not nested, while the rest of partitions are aggregation of these two domain types. Then the current stratification is based on the cross-classification given by the variables degree by sex (first partition) and university by subject area degree (second partition). More than 6,000 strata are obtained while for the two partitions we have respectively about 160 and 90 domains. The sample allocation is defined to control the sampling errors of two driving variables: the employed status (yes/no), denoted by y1 and for the not employed graduates actively seeking work (yes/no) variable, denoted by y2 . The sample allocation uses the standard Chromy algorithm according to Bethel approach (1989). The H-T estimator is currently used to estimate the totals of the two variables at domain level. In the simulation, we have selected a subpopulation of 3.427 graduates covering respectively 20 and 15 domains of the two partitions. The subpopulation distributions on the two partitions are skewed as the overall population. Table 1 shows the number of domains by the percentage of graduates of the subpopulation. Table 1: Number of domains by percentage class of graduates Class percentage of graduates in a domain Partition >50% >1% 15%⁄8% 8%⁄5% 5%⁄1% First 1 (50.6%) 2 0 10 7 Second 1(53.3%) 2 1 9 1 Total 20 15 We may see that more the 50% of graduates belong to a single domain in the first and second partition. Moreover, for the first partition 10 domains include a percentage of graduates between 5% and 1%. In the second partition the number of domains with such percentage is 9. The proposed approach needs a prediction of the binary variables y1 and y2 . We use a logistic model exploiting some auxiliary variables known at population level. We introduce briefly the models for the two variables because the aim of the paper is mainly to verify the convergence of the algorithm. Nevertheless, the choice of the prediction model is crucial for reducing the AV in (4.8). For the two models we used the following auxiliary variables: degree mark, sex, age class and an aggregation of subject area degree. The aggregation in the last variable is different in two models. Then we fixed the precision thresholds in terms of coefficient of variation (Table 2). These values have been really given for the last sample allocation of the survey. Table 2: Coefficient of variation threshold (%) for the y1 and y 2 variables by the types of domains Partition y1 y2 First 13.2 21.0 Second 12.2 20.0 Allocation when bd η r2,k are known Although we are in an ideal condition, the experiment resembles the case of a use of a highly explicative model. The algorithm converges after six launches obtaining a sample size of about 171 units rounded by calibration to 182. Table 3 gives expected coefficient of variation according to (4.1) and the values obtained by means a Monte Carlo experiment. In the simulation a 1,000 balanced sample of 182 units with the inclusion probabilities resulting by the algorithm have been selected. For the sake of brevity we show the results for the first partition (Table 3). Table 3: Sample allocation, expected and simulated Coefficient of variation (%) of the HT estimates of the totals of y1 and y2 in the first partition of domains First partition domain Population size 1 91 Sample size 10 Expected CV(%) y1 7.2 Expected CV(%) y2 9.3 Simulated CV(%) y1 7.6 Simulated CV(%) y2 9.7 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 97 52 380 136 3 65 64 1733 14 339 33 120 8 150 2 18 22 18 82 Allocation when 2 bd η r ,k When the terms 2 bd η r ,k 11 7 25 9 2 9 9 20 4 20 7 9 3 12 1 5 5 5 9 8.7 1.3 7.4 3.1 7.2 6.4 2.9 9.2 0.1 9.7 1.7 2.4 0.2 8.6 1.8 5.8 0.1 1.7 7.0 14.5 7.0 7.7 9.3 13.0 6.9 14.9 16.0 12.2 13.3 13.0 6.8 11.5 11.2 11.2 15.9 5.0 9.3 10.9 9.0 1.4 7.9 3.3 9.0 6.5 3.1 9.7 0.1 10.1 1.8 2.5 0.2 8.6 2.6 6.0 0.1 1.8 7.2 14.4 7.2 7.9 9.4 16.2 6.9 15.1 15.7 12.7 13.1 13.3 7.0 12.2 10.9 15.7 15.9 5.2 9.5 10.9 are predicted are predicted we take into account the expression (4.8). In this experiment we used an estimation of σ r2,k given by ~yr ,k (1 − ~yr ,k ) being ~yr ,k the yes probability for the yr variable. The algorithm converges after seven launches with about 699 units even though after two launches the sample size was already around 706 units. The calibration step fixes the sample size to 707 units. Table 4 shows the sample allocation and the values of the coefficient of variation (%) computed according to the (4.5) for the first partition. The empirical coefficient of variation base on 1,000 Monte Carlo samples is shown as well. The evidences of the results confirm the (4.8) gives an upward AV with respect to the empirical values of the simulation. 6. Conclusion Commonly, when the National Statistical Institute (NSI) plan a sampling survey, want to observe population units belonging to all domains of interest. Moreover, the NSI have the objective to define the minimum sample size in such a way that the estimates are reliable for each domain. That represents a sample allocation problem that several algorithms are able to solve. However, generally they are based on a one-way stratified design because it guarantees that the sample covers all the domains. Nevertheless, such design can be expensive and in some cases unfeasible. A multi-way design can be the alternative solution. In the paper, we propose an algorithm for the sample allocation for such kind of design assuming that the random selection in the domain is made by the cube method. The algorithm defines the sample size for domains belonging to different partitions of the population guaranteeing the sampling errors of domain estimates be lower than given thresholds. On the other hand the cube method overcomes the problem of the random selection for obtaining planned sample sizes in domains without using the cross-classified strata. In fact other multi-way selection methods have problem in case of large scale surveys. The innovative sampling strategy has been evaluated in a Monte Carlo simulation on real survey data, that has highlighted the following conclusion: the analytical expressions of the algorithm input variances are corrected; the proposed algorithm converges in the analyzed data. Table 4: Sample allocation, expected and simulated coefficient of variation (%) of the HT estimates of the totals of y1 and y2 in the first partition of domains. Case of bd η r2,k estimated First partition domain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Population size 91 97 52 380 136 3 65 64 1733 14 339 33 120 8 150 2 18 22 18 82 Sample size 40 40 28 90 52 3 32 31 110 11 78 19 44 7 39 2 14 17 13 37 Expected CV(%) y1 8.9 9.6 9.1 7.0 6.3 0.0 10.4 7.5 5.6 8.1 9.6 9.5 7.5 7.2 11.3 0.0 12.1 8.9 10.5 13.1 Expected CV(%) y2 21.1 21.1 21.1 16.2 21.0 0.0 21.0 19.2 21.1 19.5 17.7 20.7 21.0 16.9 20.9 0.0 19.3 21.0 21.0 16.1 Simulated CV(%) y1 7.3 8.0 7.8 5.4 5.1 0.0 8.8 6.2 4.4 6.9 7.5 7.4 6.1 6.2 9.3 0.0 10.4 6.8 9.0 10.5 Simulated CV(%) y2 17.6 16.3 16.1 12.6 16.7 0.0 16.9 16.2 17.1 16.7 14.4 16.6 16.5 15.8 17.0 0.0 17.7 19.5 18.2 13.1 References Bethel, J. (1989) Sample Allocation in Multivariate Surveys, Survey Methodology, 15, 47-57 Chauvet G., Tillé Y. (2006) A Fast Algorithm of Balanced Sampling, Computational Statistics, 21, 53-62 Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings of the Survey Research Methods Section. American Statistical Association, 194-199 Cochran W.G. (1977) Sampling Techniques, Wiley, New York Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91, 893-912 Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, 128, 569-591 Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223-234 Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198 Isaki C.T. and Fuller W.A. (1982) Survey design under a regression superpopulation model. Journal of the American Statistical Association, 77, 89-96
© Copyright 2024