Optimal Sample Size Allocation for the German Census Ulf Friedrich, Ralf M¨unnich, Sven de Vries, Matthias Wagner 1. The German Census 2011 R. Münnich/S. Gabler/M. Ganninger/J. P. Burgard/J.-P. Kolb Aims and Background I First census after the German reunification. I Instead of interviewing the entire population, the census is sample-based and register-assisted. I Aim 1: Determination of the official population size. I Aim 2: Estimation of variables that are not covered by official registers, e.g., professional profile, immigration data. Methodology I Partitioning of Germany into 2 391 sampling points (SMP). I Each SMP is stratified according to the number of persons per address (8 strata). I Drawing of up to 2 391 · 8 = 19 128 partial samples. I Objective: Minimization of a statistical quality criterion based on stratum variances. Data collected from [4]. The Census Problem n X di2 min xi i=1 n X s.t. xi ≤ β i=1 mi ≤ xi ≤ Mi , i = 1, . . . , n x ∈ Zn (x1, . . . , xn ) describes the sample sizes in the different strata. 2 I d is defined as the product of the known stratum variance i and the squared population size of stratum i. I Upper bound on total sample size β = 7 900 000, i.e., 10% of the population. I Upper and lower bounds Mi > mi > 0 to avoid over- and underestimation of certain strata. I The 2 391 SMPs, colored according toin Deutschland population Abbildung 2.1: Darstellung der Stichprobenbasiseinheiten size, source: [4]. Hierbei sei angemerkt, dass der Auswertung nicht die letzten aktuellen Registerauswertungen zugrunde liegen und damit geringfügige Abweichungen8 auftreten können. 8 2. Using Greedy to Solve the Census Problem 24 Letztendlich hat sich die tatsächlich verwendete Anzahl an SMPs noch um 26 reduziert. Die tatsächlich verwendeten Zahlen sind in (Berg und Bihler 2011, S. 321) zu �nden. Key Observation: The feasible set is a polymatroid. Simple Greedy I Greedy algorithm on polymatroids works for linear objective functions. I Not directly applicable in the non-linear case. I Reformulation as linear matroid optimization problem. I Solution by matroid Greedy, see [2]. I Disadvantage: exponential running time. A polymatroid in R3. Reformulation Mi n X X min rij xij s.t. i=1 j=1 Mi n X X xij ≤ β i=1 j=1 xij ∈ {0, 1} with ri1 = di2 and rij = di2 j − di2 j−1 for j > 1. 3. Our Approach: Binary Search Statistisches Bundesamt, Statistik und Wissenschaft, Bd. 21/2012 Capacity Scaling I Aim: Greedy algorithm with larger increments rather than unit increments. I First, the problem is solved using the step size log2(maxi Mi ), which is then successively divided by 2 until step size 1 is reached. I Correctness and polynomial running time are known, see [3]. Underlying Idea: I Simple Greedy divides the problem variables into many binary variables and chooses one by one. I Thus, the entire solution can be reconstructed from the coefficient rij of the last variable xij set to 1. I Apply a binary search to find this coefficient. I Use the convexity of the objective function to speed up the calculation of medians. PseudoPCode While i xi 6= β do i +1 b Mi +m c 2 di2 x¯i −1 ; Find the (lower) medians x¯i = and c¯i = − Compute the super-median s = Median(¯c1, . . . , c¯n ); p Find the inverses yi = 0.5 + 0.25 − di2s −1; Truncate the inverses at the box constraints if necessary; P If yi < β then update lower bounds: mi ← yi ; i P Else-If i yi > β then update upper bounds: Mi ← yi ; End-While Return y ; 4. Numerical Results 5. Binary Search: Worst-Case Running Time Performance I Easy to code: Binary Search requires just 100 lines of C++. I Code running on Intel(R) Core(TM)2 Duo CPU with 3.00GHz and 4 GB RAM. I Benchmark: Fixed-point iteration based on the Lagrangian relaxation and rounding, see [5]. Each iteration: Super-median O(n), remainder O(n). I Number of iterations bounded by O(n log2 (max Mi )). 2 I Total running time O(n log2 (max Mi )). I Theorem: Binary Search solves the Census Problem in a running time polynomial in the coding length of the input. Time [s] # Iterat. Algorithm Benchmark 0.02 – Simple Greedy 0.67 4,567,312 Capacity Scaling 0.06 227,492 Binary Search 0.01 23 Comparison to Relaxed Solution I Compared to the solution obtained by a rounding heuristic in [5], there is a slight improvement in the objective. I 25 SMPs are missing exactly one element. I The differences appear randomly distributed over all strata. di2 x¯i 6. Conclusions Greedy strategies find the global minimum of a separable, convex objective function over the integer points of a polymatroid. I However, the worst-case running time for Simple Greedy is exponential. I Binary Search solves the problem in polynomial time. I Competitive with algorithms for the continuous relaxation. I Practically relevant and fast enough for very large instances of sampling problems and similar applications. I SMPs with differences in partial samples computed by rounding and Binary Search, German state of Hesse. References: [1] [2] [3] [4] [5] U. H. D. R. R. ¨ nnich, S. de Vries, M. Wagner, Integer optimization in stratified sampling, Computational Statistics and Data Analysis, submitted for publication. Friedrich, R. Mu Groenevelt, Two algorithms for maximizing a separable concave function over a polymatroid feasible region, European Journal of Operational Research, 54(2):227-236, 1991. Hochbaum, Lower and upper bounds for the allocation problem and other nonlinear optimization problems, Mathematics of Operations Research, 19(2);390-409,1994. ¨ nnich, S. Gabler, M. Ganninger, J. Burgard, J.-P. Kolb, Statistik und Wissenschaft, Vol. 21: Stichprobenoptimierung und Sch¨atzung im Zensus 2011, Destatis, Wiesbaden, 2012. Mu ¨ nnich, E. Sachs, M. Wagner, Numerical solution of optimal allocation problems in stratified random sampling under box constraints, AStA Advances in Statistical Analysis, 96:435-450, 2012. Mu Ulf Friedrich, Department of Mathematics, University of Trier, 54286 Trier, Germany friedrich@uni-trier.de