Document 275921

Sample selection using
Proc SURVEYSELECT
Sylvain Tremblay
SAS Canada – Education Group
Copyright © 2006, SAS Institute Inc. All rights reserved.
Stratified random sampling in SAS
In the good old times…
Copyright © 2006, SAS Institute Inc. All rights reserved.
Stratified random sampling in SAS
Today!
ONE procedure, that’s it!
Copyright © 2006, SAS Institute Inc. All rights reserved.
Agenda
ƒ Sample design
ƒ Selection methods
ƒ Proc SURVEYSELECT syntax
ƒ Demo 1 – Separate sampling
ƒ Demo 2 – Cluster sampling
ƒ Conclusion
ƒ To learn more
ƒ Questions
Copyright © 2006, SAS Institute Inc. All rights reserved.
Sample Design
Process for selecting sampling units
ƒ Equal vs non-equal probabilities
ƒ Element vs cluster sampling
ƒ Unstratified vs stratified selection
ƒ Random vs systematic selection
ƒ One phase vs multi-phase sampling
Copyright © 2006, SAS Institute Inc. All rights reserved.
Selection Methods
Supported by Proc SURVEYSELECT
ƒ simple random sampling
ƒ unrestricted random sampling (with replacement)
ƒ systematic
ƒ sequential
ƒ selection probability proportional to size (PPS)
with and without replacement
ƒ PPS systematic
ƒ PPS for two units per stratum
ƒ sequential PPS with minimum replacement
Copyright © 2006, SAS Institute Inc. All rights reserved.
Proc SURVEYSELECT - Basic syntax & options
PROC
PROCSURVEYSELECT
SURVEYSELECT
DATA=SAS-data-set
DATA=SAS-data-set
OUT=
OUT=
METHOD=
METHOD=
SEED=
SEED=
SAMPSIZE=
SAMPSIZE=
SAMPRATE=
SAMPRATE=;;
STRATA
STRATAvariables
variables;;
ID
IDvariables
variables ;;
RUN;
RUN;
Copyright © 2006, SAS Institute Inc. All rights reserved.
Demo 1 - Oversampling (separate sampling)
to construct a training dataset
Frame
0
95 %
Sample
(Training dataset)
0
X%
1
Target
1
Target
5%
Y%
X <> Y
Generally, Y=100%
Copyright © 2006, SAS Institute Inc. All rights reserved.
50 %
50 %
SII1
Demo 2 - Two-Stage Cluster Sampling
Cluster sampling is appropriate in these situations:
ƒ when a sampling frame of the elements of interest is
unavailable
ƒ when data collection is expensive because the elements
of interest are widely dispersed
Copyright © 2006, SAS Institute Inc. All rights reserved.
Slide 9
SII1
Cluster sampling can provide efficiency in frame construction and other survey operations. However, it can also result in a loss in
precision of your estimates, compared to a nonclustered sample of the same size. To minimize this effect, units within clusters should
be as heterogeneous as possible for the characteristics of interest.
SAS Institute Inc., 10/19/2007
Two-Stage Cluster Sampling
Examples:
STAGE 1
PSU
City Block
Household
Clinic
Classroom
Geographic Area
Stream
Copyright © 2006, SAS Institute Inc. All rights reserved.
STAGE 2
SSU
Household
Family Member
Patient
Student
Small Plot
3-meter Section
Two-Stage Cluster Sampling
Data for the demo:
Copyright © 2006, SAS Institute Inc. All rights reserved.
Two-Stage Cluster Sampling
Stage One: Select Primary Sampling Units (PSU)
PSU: City block
Region
Notes: 1 - PSUs selection: stratified on REGION
2 - For household (hh) samples, selection of PSU: PPS on # hh
Copyright © 2006, SAS Institute Inc. All rights reserved.
Two-Stage Cluster Sampling
Stage Two: Select Secondary Sampling Units (SSU)
SSU: Household
Notes: 1 - SSU are randomly selected ONLY from PSUs selected in stage 1
Copyright © 2006, SAS Institute Inc. All rights reserved.
In conclusion
Proc SURVEYSELECT
ƒ Can be used for simple or complex sample designs
ƒ Covers all the major selection methods
ƒ Is an easy way to construct your training and
validation datasets for predictive modeling
ƒ is your best friend is you need to use re-sampling
techniques like bootstrap, jacknife and cross
validation
Copyright © 2006, SAS Institute Inc. All rights reserved.
To learn more
ƒ SAS Training: ‘Design and Analysis of Probability Surveys’
ƒ SAS Support Website – samples on SURVEYSELECT
http://support.sas.com/kb
Copyright © 2006, SAS Institute Inc. All rights reserved.
Questions?
THANK YOU!
Copyright © 2006, SAS Institute Inc. All rights reserved.
Sylvain Tremblay
sylvain.tremblay@sas.com