Reducing Selection Bias in Quasi-Experimental Educational Studies

Reducing Selection Bias in Quasi-Experimental
Educational Studies
Christopher Brooks
Omar Chavez
School of Information
University of Michigan
Department of Statistics and
Data Sciences
University of Texas at Austin
brooksch@umich.edu
Jared Tritz
ochavez@utexas.edu
Stephanie Teasley
School of Information
University of Michigan
School of Information
University of Michigan
jtritz@umich.edu
steasley@umich.edu
ABSTRACT
In this paper we examine the issue of selection bias in quasiexperimental (non-randomly controlled) educational studies. We provide background about common sources of selection bias and the issues involved in evaluating the outcomes of quasi-experimental studies. We describe two methods, matched sampling and propensity score matching, that
can be used to overcome this bias. Using these methods,
we describe their application through one case study that
leverages large educational datasets drawn from higher education institutional data warehouses. The contribution of
this work is the recommendation of a methodology and case
study that educational researchers can use to understand,
measure, and reduce selection bias in real-world educational
interventions.
1.
INTRODUCTION
Evaluating the impact of novel educational pedagogies,
strategies, programs, and interventions in quasi-experimental
studies can be highly error-prone due to selection biases.
The effect of these errors can be significant, and can lead
to harm being done to learners, instructors, and institutions through misinformed decision-making. Further, the
lack of confidence researchers have in their analyses of realworld deployments can lead to a decrease in situated experimentation. In this paper we describe a methodology to
understand and correct for selection bias, restoring the confidence researchers and policy-makers can have in the results
of quasi-experimental studies.
A quasi-experimental study is one in which there is no randomized control population. This design contains the potential for selection bias of learners and is a principal challenge
for measuring the effectiveness of the intervention delivered.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
LAK ’15 March 16 - 20, 2015, Poughkeepsie, NY, USA
ACM 978-1-4503-3417-4/15/03 ...$15.00.
http://dx.doi.org/10.1145/2723576.2723614.
In educational studies selection bias is often very difficult to
eliminate: there are ethical considerations around the equal
access to new learning technologies and programs, as well as
pragmatic considerations such as dealing with open recruitment of learners where the bias is the result of self-selection.
For instance, one would expect a series of workshops aimed
at helping non-traditional learners excel in first year university would increase the grades of these students. But
students may not elect to attend workshops randomly – students with existing strong study skills may be more predisposed to attending the workshops, and this latent variable
may be more explanatory of the outcome than the workshop
itself.
The big data culture that has permeated academic (as
well as other) institutions offers a solution to issues of selection bias in quasi-experimental interventions. Instead of
limiting learner access to a technology or program to form
a control group a priori, a subset of the overall population
of learners is selected post hoc such that it best matches the
group of learners who received the intervention. This creates a matched sample, and allows for an “apples-to-apples”
comparison of outcomes between the two groups of learners
while contextualizing how those groups might differ with respect to selection bias.
The work presented here describes a process for identifying a matched sample of learners and contextualizing how
the matched sample differs from those learners who have
received educational interventions. This technique is especially important when communicating research results to decision makers within the higher education institution. By
comparing the results of a treatment (a learning technology
or program) effect on a group of learners against a similarlymatched sample, researchers can control for selection bias
and make a more compelling argument about the impact
(or lack there of) of their intervention.
The contributions of this work are three fold:
1. A process for evaluating educational programs and interventions using subset matching, including an understanding of the important statistical tests that must be
considered when contextualizing how good (unbiased
with respect to some attributes) of a match is achieved.
2. A case study demonstrating how this method can be
applied to reduce selection bias.
3. A free and open source software toolkit1 that allows educational researchers to execute this process directly,
complete with the reporting of contextual statistics
about the matched populations.
2.
2.1
ADDRESSING SELECTION BIAS
Selection Bias
To identify methods to deal with the problem of selection
bias, we first describe what causes an educational intervention (treatment effect) to become biased. This bias is typically due to members of the treatment group voluntarily
selecting themselves to participate in a given intervention.
Other exogenous factors may exist, such as access to technology to engage in the treatment (e.g. Internet access, access
to a smart phone) or selection that is based in part on demographic features. Consequently one might ask whether
there is some sort of important difference in the individuals
who elect to participate in a particular treatment or whether
this a completely random phenomenon that is unrelated to
the particular individuals. Regardless, we can mitigate the
effects that various factors (the observed covariates) might
have on the outcome of interest. Specifically, we want to estimate the extent to which the measured covariates influence
our estimate of the average treatment effect. In general, our
estimates of the average treatment effect involve matching
members of the treatment condition to members of the comparison group (control) and relies on the “strong ignorability
assumption” [2]. To state plainly: If we observe two individuals with the same base set of covariates or same propensity
score, then the likelihood that either one would participate
in the intervention is the same for both. Thus one electing
to use the intervention and the second not is purely coincidental, hence comparing the two students’ outcomes is valid
approach. Without this assumption it is impossible to infer
all the selection bias has been removed from the estimated
treatment effect [5, 2].
This inference in practice is limited to the covariates we
are able to measure which inevitably have limitations either
due to resources, time or ethical constraints limiting our
ability to develop a “complete” set of potentially relevant
factors to control. A natural consequence of this is explained
in [3]:
“It is important to realize, however, that whether
treatments are randomly assigned or not, and no
matter how large a sample size one has, a skeptical observer could always eventually find some
variable that systematically differs in the E trials and C trials (e.g., length of longest hair on
the child) and claim the average difference estimates the effect of this variable rather than the
causal effect of Treatment. Within the experiment there can be no refutation of this claim;
only a logical argument explaining that the variable cannot causally affect the dependent variable or additional data outside the study can be
used to counter it.”
This statement again points to the need for a researcher,
when attempting to establish a causal relationship between
1
Available at https://github.com/usaskulc/population_
matching
the application of a treatment and some sort of measured
outcome, to use as complete a data set as possible. Covariates that are both related to an outcome of interest (such
as test scores) as well as the covariates that effect the likelihood of an individual opting to participate in the treatment
or intervention are relevant and discussed in [1]. Thus we
would say selection bias due to some collection of variables
X, is the bias that is introduced into our estimate of the
average treatment effect, when we fail to account or control
for X.
Mathematically we can state this in the following way:
Suppose the true treatment effect is Ttrue . Let T(−X) be the
estimated treatment effect when we do not use X and T(+X)
be the estimated treatment effect when we do use X, then:
bias X removed = T(−X) − T(+X)
(1)
This equation allows us to calculate how much selection
bias is introduced as the result of failing to account for a
particular covariate or set of covariates X from the data
we have available. For example, suppose we were interested in measuring the selection bias a variable such as Math
SAT2 would introduce into our estimated treatment effect.
We would first take our two groups, treatment and control,
and match them based on some list of covariates (e.g. gender, socio-economic status, residency status) and not include
Math SAT scores. Our estimated Treatment effect would be
T1. We then would repeat the analysis but this time include
Math SAT scores to get a second estimate of the treatment
effect T2. The difference between T1 and T2 is our point
estimate of the bias introduced by failing to include Math
SAT as a control variable.
For a full discussion of how various types of coavariates
in education can account for selection bias see [5]. To summarize their findings: When it comes to which variables to
use in one’s analysis, it is important to select data that was
known before the treatment was administered, or at least
could have been known before the treatment was administered. Using data collected after treatment assignment that
can be influenced (changed in value) by the treatment itself is not useful since they can introduce bias either by
causing an over or underestimate of the treatment effect.
In educational research, variables relating to demographics, pretest information, prior academic achievement, topic
and subject matter preferences, as well as psychological and
personality predispositions, have been shown to affect either observed performance or propensity to participate in
interventions. For instance, of these variables Shadish et
al. [4] have found when it comes to interventions variables
on proxy-pretests and topic preference together with demographics reduce nearly all bias for language-related outcomes
and variables related to demographics, pretests. Further,
prior academic achievement reduced about 75% of selection
bias in a mathematics related intervention. It is worth noting however, the actual reductions in bias could also be
due to the specific context of the intervention. However,
the findings do provide supporting evidence that estimating
treatment effects with observational data is an appropriate
approach.
2
The Math SAT is a standardized test measuring the mathematics ability of entry level college students in the United
States.
2.2
Methods for Subset Matching
The question then arises as to which matching method
best deals with the problem of selection bias. Should we
match equally across all of the covariate measures that we
have available, or should we use a univariate statistic that
describes the propensity by which an individual relates to
the treatment group? Rosenbaum and Rubin [2] provide
advice on this issue, and suggest that using the covariates
directly or propensity scores are both sufficient and neither
is clearly better than the other on this matter. Instead,
it is how related the set of all covariates which are used
for matching is to the treatment assignment or outcome of
interest which is important.
With this caveat in mind, we outline two popular methods
for finding a matched population for a quasi-experimental
study. The first is a simple matching strategy, whereby
scores are calculated for each covariate and each pair of
subjects in the treatment and condition groups. A vector
of scores for a particular pair of individuals represents the
difference between subjects, and various differencing methods (e.g. Euclidean distance, Mahalanobis distance) may be
used depending upon the form and distribution of the data.
The second method is to collapse covariates for each individual into a propensity score using a regression approach
such as linear or logistic regression. The result is then
a single value that describes the likelihood an individual
would receive some treatment condition. Care must be taken
when forming propensity scores, especially in large matching
datasets where the number and diversity of non-treatment
individuals outweighs that of the treatment individuals. The
difference between two individual’s scores forms a metric by
which individuals can be matched.
Regardless of the method used, both approaches form a
matrix of treatment versus non-treatment individuals where
intersection elements hold the similarity two individuals have
to one another. This matrix can be solved as a linear assignment problem with the result being globally minimal
(most similar) pairwise matches between the treatment and
non-treatment populations.
2.3
Reporting on the Effects of Selection Bias
While the subset matching technique attempts to minimize the overall difference between the treatment group and
a matched sample, such an approach does not guarantee that
suitable matches for a given analysis can be found. It is thus
important to verify how well matched the treatment group
is to the non-treatment group when presenting results on
the effect of the treatment. This can be done by considering
the similarity of each of the covariate distributions between
the treatment and non-treatment groups.
While there are several methods that might be used, a
practical approach for continuous data is to compare distributions using a two-sampled Kolmogrov-Smirnov test, which
is sensitive to both the shape and location of the distributions being compared. A second useful approach is to use
the Mann-Whitney test. It has greater efficiency than the
t-test on data not sampled from a normal distribution, and
it is nearly as efficient as the t-test on normally distributed
data. Both are conservative tests that will provide a comprehensive comparison of the distribution of two populations.
In our experience, achieving a significant (e.g. p ≤ 0.05)
confidence value using the Kolmogrov-Smirnov is difficult
unless the non-treatment group is quite large and diverse,
leading to excellent matches. Less robust tests of the quality of matches include means-test methods such the students
paired t-test.
3.
CASE STUDY: LEARNING
COMMUNITIES
The purpose of this section of the paper is not to outline
a particular case study result per se, but to demonstrate
how the techniques described can be used by educational
researchers to come to conclusions about the effect of their
interventions, by reducing the possible selection bias of the
participatory sample.
Learning communities programs3 at our institution group
some students into residences based upon students’ interest
in pursuing a particular domain or discipline. The goal of
the learning communities programs is to provide students
with a peer group for support, as well as provide opportunities for academic development. These programs have existed for more than a decade in various forms, and there is
a strong interest in understanding the effect these programs
have on student success and achievement. Current learning
communities include women students in science and engineering programs; students who are interested in the health
sciences; students pursuing the visual arts; students who
are interested in social justice and community; and students
who are interested in research.
Students are not chosen at random for participation in
learning communities programs. There is both self-selection
bias (e.g. students who are interested in being in the learning
community) as well as a formal selection phase (e.g. application forms included essays which are judged). Students
may apply to many learning communities, but can only be
accepted into one. Learning communities programs are only
available for freshman (first year) university students.
One common question for program evaluators of learning
communities is whether participation in the program raises
the overall academic achievement of students. A naive approach to answering this issue would be to conduct a t-test
between students who are in a particular learning community and those who were not in any learning community
along a particular outcome variable such as overall grade
point average. Using one year of such data, the means difference is 0.12 (on a four point scale, see Table 1) suggesting the learning community students actually perform worse
than non-learning community students; a t-test confirms significance at p ≤ 0.01.
Student Group
Learning Community
Non-Learning Community
N
103
6,090
average GPA
3.13
3.25
Table 1: Comparison of treatment (Learning Community) and non-treatment (Non-Learning Community) groups using a naive analysis.
In determining how well matched the comparison groups
are, a first step is to consider the list of variables being
considered and similarity of the distribution of those variables within each group. This can be done with a two-tailed
Kolmogrov-Smirnov test, and Table 2 shows the results between the two learning communities groups for a variety of
3
See http://www.lsa.umich.edu/mlc
variables that are hypothesized as interacting with cumulative GPA. For the variables that are statistically significant
(e.g. p < 0.01) we cannot reject the null hypothesis that
the two samples come from different distributions. In this
example, we see that only gender meets this criteria, suggesting that the distribution of gender in the two groups is
different.
Variables
Sex
Ethnic Group
Citizenship Status
Standardized Entrance Test
Credits at Entry
Parental Education
Household Income
KS Confidence (p)
p < 0.001∗ ∗ ∗
p = 0.720
p = 1.000
p = 0.987
p = 0.164
p = 0.953
p = 0.661
Table 2: Comparison of the treatement and nontreatment groups across seven demographic and
performance features before matching. In this case
it was the treatment group that had a higher number of women than the non-treatment group.
To reduce this bias a matched set can be created. Using
the equal covariate matching method described at the beginning of Section 2.2, it is possible to minimize the bias that
may exist. Balancing across the variables listed, a paired
treatment–non-treatment dataset of 206 individuals can be
created. Application of a two-tailed Kolmogrov-Smirnov
test shows no significance at p = 0.01 level, though one
variable (Household Income) is significant at the p = 0.05
level. The high confidence of all other p-values suggests
this dataset is well balanced, except perhaps with respect to
parental income.
Variables
Sex
Ethnic Group
Citizenship Status
Standardized Entrance Test
Credits at Entry
Parental Education
Household Income
KS Confidence (p)
p = 1.000
p = 1.000
p = 1.000
p = 0.996
p = 1.000
p = 1.000
p = 0.036∗
Table 3: Comparison of the treatment and nontreatment groups across seven demographic and
performance features after matching.
the p=0.036 level) has been eliminated.4
4.
4
The result of the matching process are two populations of
the same size with individuals in the first directly matched to
individuals in the second. Thus, a paired t-test for statistical significance can be used on outcome variables. Considering GPA, the paired t-test returns a statistically significant
difference (p = 0.003), with the means difference between
the groups being 0.18 points in favor of the non-treatment
group. In short, the researcher can now say with greater certainty that there is a difference between the treatment and
non-treatment students and that bias introduced because of
an observed variables (except perhaps Household Income, at
CONCLUSIONS
Selection bias in quasi-experimental studies can undermine the confidence decision-makers have in the results of
analyses, and lead to possible misunderstandings and poor
policy decisions. Yet institutions, researchers, and practitioners, are often unable to run randomized controlled experiments of learning innovations based on pragmatic or ethical concerns. This paper has introduced a methodology by
which researchers can use contextualize the results of their
analysis and reduce selection biases.
Leveraging big educational datasets and institutional data
warehouses, researchers can often mitigate selection bias by
finding a comparison group of learners who did not undergo
a particular treatment. Learners can be compared equally
across all covariates (e.g. demographics, previous performances, or preferences), or covariates can be collapsed into
a single propensity score which can be used as the basis
for matching. The end result of the matching process is a
paired dataset of learners who have undergone a treatment
and similar learners who did not receive the treatment. The
researcher can then apply post-hoc analysis as appropriate.
In this work we have included an example of this method
applied to a case study educational program which is particularly affected by selection bias: university learning communities. These learning communities are heavily biased based
on the sex of participants (Table 2). After controlling for
this bias, an increase of 66% is seen in the means differences
between the treatment and control groups (from 0.12 to 0.18
in GPA units). Whether this is significant enough to change
policy or deployment of the program depends on how decision makers weight this particular outcome. There may be
alternative student outcomes such as satisfaction, time to
degree completion, or co-curricular achievements that influence policy in this area. What is important here is that the
researcher can feel confident that these results more accurately reflect the effects on the treatment population given
the kinds of learners who would opt-in to the treatment.
In our experience, however, creating matched pairs of
learners rarely results in perfect results, thus contextualizing the goodness of fit between the two groups of learners
is important. This can be done both before the matching
as well as afterwards, using the Kolmogrov–Smirnov statistic. This technique can describe which covariates may not
be possible to match on; an insight which is essential when
forming educational policy.
As the level of significance goes up (a declining p value),
the more likely it is that variability (noise) in the data will
cause for a rejection of a particular hypothesis. As more
variables are considered, the chance of spurious correlation
at a well-accepted level such as p = 0.05 or p = 0.01 for one
variable increases. A more conservative value for a given
confidence level can be achieved by dividing the alpha (e.g.
0.05) necessary for statistical significance by the number of
variables being considered (7), a Bonferroni correction which
controls the family-wise error levels. In this example, at
p = 0.05, one would then expect only values of p ≤ 0.0083 to
be consider statistically significant. Thus the appearance of
Household Income being statistically significantly different
between the two distributions should be questioned as to
whether it is a spurious result.
5.
ACKNOWLEDGEMENTS
Thanks to Dr.
Jim Greer from the University of
Saskatchewan for motivating earlier work in this area. Also,
thanks to Dr. Ben Hansen at the University of Michigan for
insights on using propensity and prognostic scores and their
application to matching problems. Finally, thanks to Dr.
Brenda Gunderson from the University of Michigan for her
support in investigating these issues in the E2 Coach framework.
6.
REFERENCES
[1] B. Hansen. The prognostic analogue of the propensity
score. Biometrika, pages 1–17, 2008.
[2] P. Rosenbaum and D. Rubin. The central role of the
propensity score in observational studies for causal
effects. Biometrika, 70(1):41–55, 1983.
[3] D. Rubin. Estimating causal effects of treatments in
randomized and nonrandomized studies. Journal of
educational Psychology, 1974.
[4] W. R. Shadish, M. H. Clark, and P. M. Steiner. Can
Nonrandomized Experiments Yield Accurate Answers?
A Randomized Experiment Comparing Random and
Nonrandom Assignments. Journal of the American
Statistical Association, 103(484):1334–1344, Dec. 2008.
[5] P. M. Steiner, T. D. Cook, W. R. Shadish, and M. H.
Clark. The importance of covariate selection in
controlling for selection bias in observational studies.
Psychological methods, 15(3):250–67, Sept. 2010.