Optimal Sample Size Allocation for the German Census Ulf Friedrich, Ralf M¨

Optimal Sample Size Allocation
for the German Census
Ulf Friedrich, Ralf M¨unnich, Sven de Vries, Matthias Wagner
1. The German Census 2011
R. Münnich/S. Gabler/M. Ganninger/J. P. Burgard/J.-P. Kolb
Aims and Background
I First census after the German reunification.
I Instead of interviewing the entire population, the census
is sample-based and register-assisted.
I Aim 1: Determination of the official population size.
I Aim 2: Estimation of variables that are not covered by
official registers, e.g., professional profile, immigration
data.
Methodology
I Partitioning of Germany into 2 391 sampling points
(SMP).
I Each SMP is stratified according to the number of persons per address (8 strata).
I Drawing of up to 2 391 · 8 = 19 128 partial samples.
I Objective: Minimization of a statistical quality
criterion based on stratum variances.
Data collected from [4].
The Census Problem
n
X
di2
min
xi
i=1
n
X
s.t.
xi ≤ β
i=1
mi ≤ xi ≤ Mi , i = 1, . . . , n
x ∈ Zn
(x1, . . . , xn ) describes the sample sizes in the different strata.
2
I d is defined as the product of the known stratum variance
i
and the squared population size of stratum i.
I Upper bound on total sample size β = 7 900 000, i.e., 10%
of the population.
I Upper and lower bounds Mi > mi > 0 to avoid over- and
underestimation of certain strata.
I
The 2 391
SMPs,
colored
according toin Deutschland
population
Abbildung
2.1: Darstellung
der Stichprobenbasiseinheiten
size,
source: [4].
Hierbei sei angemerkt, dass der Auswertung nicht die letzten aktuellen Registerauswertungen zugrunde liegen und damit geringfügige Abweichungen8 auftreten können.
8
2. Using Greedy to Solve the Census Problem
24
Letztendlich hat sich die tatsächlich verwendete Anzahl an SMPs noch um 26 reduziert. Die tatsächlich verwendeten Zahlen sind in (Berg und Bihler 2011, S. 321) zu �nden.
Key Observation: The feasible set is a polymatroid.
Simple Greedy
I Greedy algorithm on polymatroids works for linear objective functions.
I Not directly applicable in the non-linear case.
I Reformulation as linear matroid optimization
problem.
I Solution by matroid Greedy, see [2].
I Disadvantage: exponential running time.
A polymatroid in R3.
Reformulation
Mi
n X
X
min
rij xij
s.t.
i=1 j=1
Mi
n X
X
xij ≤ β
i=1 j=1
xij ∈ {0, 1}
with ri1 =
di2
and rij =
di2
j
−
di2
j−1
for j > 1.
3. Our Approach: Binary Search
Statistisches Bundesamt, Statistik und Wissenschaft, Bd. 21/2012
Capacity Scaling
I Aim:
Greedy algorithm with larger increments rather than unit increments.
I First, the problem is solved using the step size
log2(maxi Mi ), which is then successively divided by 2 until step size 1 is reached.
I Correctness and polynomial running time are
known, see [3].
Underlying Idea:
I Simple Greedy divides the problem variables into many binary
variables and chooses one by one.
I Thus, the entire solution can be reconstructed from the coefficient rij of the last variable xij set to 1.
I Apply a binary search to find this coefficient.
I Use the convexity of the objective function to speed up the
calculation of medians.
PseudoPCode
While
i xi 6= β do
i +1
b Mi +m
c
2
di2
x¯i −1 ;
Find the (lower) medians x¯i =
and c¯i = −
Compute the super-median s = Median(¯c1, . . . , c¯n );
p
Find the inverses yi = 0.5 + 0.25 − di2s −1;
Truncate the inverses at the box constraints if necessary;
P
If
yi < β then update lower bounds: mi ← yi ;
i
P
Else-If
i yi > β then update upper bounds: Mi ← yi ;
End-While
Return y ;
4. Numerical Results
5. Binary Search: Worst-Case Running Time
Performance
I Easy to code: Binary Search requires just 100 lines of C++.
I Code running on Intel(R) Core(TM)2 Duo CPU with 3.00GHz
and 4 GB RAM.
I Benchmark: Fixed-point iteration based on the Lagrangian
relaxation and rounding, see [5].
Each iteration: Super-median O(n), remainder O(n).
I Number of iterations bounded by O(n log2 (max Mi )).
2
I Total running time O(n log2 (max Mi )).
I
Theorem: Binary Search solves the Census Problem in a
running time polynomial in the coding length of the input.
Time [s] # Iterat.
Algorithm
Benchmark
0.02
–
Simple Greedy
0.67 4,567,312
Capacity Scaling
0.06 227,492
Binary Search
0.01
23
Comparison to Relaxed Solution
I Compared to the solution obtained by a rounding heuristic in
[5], there is a slight improvement in the objective.
I 25 SMPs are missing exactly one element.
I The differences appear randomly distributed over all strata.
di2
x¯i
6. Conclusions
Greedy strategies find the global minimum of a separable, convex objective function over the integer points of a polymatroid.
I However, the worst-case running time for Simple Greedy is
exponential.
I Binary Search solves the problem in polynomial time.
I Competitive with algorithms for the continuous relaxation.
I Practically relevant and fast enough for very large instances
of sampling problems and similar applications.
I
SMPs with differences in partial samples computed by rounding and Binary
Search, German state of Hesse.
References:
[1]
[2]
[3]
[4]
[5]
U.
H.
D.
R.
R.
¨ nnich, S. de Vries, M. Wagner, Integer optimization in stratified sampling, Computational Statistics and Data Analysis, submitted for publication.
Friedrich, R. Mu
Groenevelt, Two algorithms for maximizing a separable concave function over a polymatroid feasible region, European Journal of Operational Research, 54(2):227-236, 1991.
Hochbaum, Lower and upper bounds for the allocation problem and other nonlinear optimization problems, Mathematics of Operations Research, 19(2);390-409,1994.
¨ nnich, S. Gabler, M. Ganninger, J. Burgard, J.-P. Kolb, Statistik und Wissenschaft, Vol. 21: Stichprobenoptimierung und Sch¨atzung im Zensus 2011, Destatis, Wiesbaden, 2012.
Mu
¨ nnich, E. Sachs, M. Wagner, Numerical solution of optimal allocation problems in stratified random sampling under box constraints, AStA Advances in Statistical Analysis, 96:435-450, 2012.
Mu
Ulf Friedrich, Department of Mathematics, University of Trier, 54286 Trier, Germany
friedrich@uni-trier.de