Download Report

8/26/2009
Advanced Adverse Impact Analysis
Why the (uncorrected) Fisher Exact
Test Should not be used for Most
Adverse Impact Analyses (8-26-09)
BCGi
Institute for Workforce Development
© Copyright 2009
Biddle Consulting Group, Inc.
All Rights Reserved
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Visit BCGi Online
 While you are waiting for the webinar to
begin,
begin


Don’t forget to check out our other training
opportunities through the BCGi website.
Join our online learning community by signing
up (its free) and we will notify you of our
upcoming
i ffree training
t i i events
t as wellll as other
th
information of value to the HR community.
www.BCGinstitute.org
1
8/26/2009
HRCI Credit
 BCG is an HRCI Preferred Provider
 CE Credits are available for attending this
webinar
 Only those who remain with us for at least
80% of the webinar will be eligible to
receive the HRCI training completion form
for CE submission
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
About Our Sponsor: BCG
• Assisted hundreds of clients with cases involving Equal Employment
Opportunity (EEO) / Affirmative Action (AA) (both plaintiff and defense)
p
Analyses
y
/ Test Development
p
and Validation
• Compensation
• Published: Adverse Impact and Test Validation, 2nd Ed., as a practical
guide for HR professionals
• Editor & Publisher: EEO Insight an industry e-Journal
• Creator and publisher of a variety of productivity
Software/Web Tools:
–
–
–
–
–
–
–
–
®
OPAC (Administrative Skills Testing)
®
CritiCall (9-1-1 Dispatcher Testing)
AutoAAP™ (Affirmative Action Software and Services)
C4™ (Contact Center Employee Testing)
Encounter™ (Video Situational Judgment Test)
Adverse Impact Toolkit™ (free online at www.disparateimpact.com)
®
®
AutoGOJA (Automated Guidelines Oriented Job Analysis )
COMPare: Compensation Analysis in Excel
www.Biddle.com
Industry
Leader
4
2
8/26/2009
Contact Information
Daniel Biddle,, Ph.D.
Dan@biddle.com
Biddle Consulting Group, Inc.
193 Blue Ravine, Ste. 270
Folsom, CA 95630
1-800-999-0438
www.biddle.com
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Questions?
Should you have any questions during the webinar you
have two options:
Ask a question through the GoToMeeting screen console and we will try to
address it at the end of the webinar.
Should y
you have any
yq
questions regarding
g
g OFCCP
Audits, Testing and Selection, or Statistical Analysis,
visit:
www.BCGInstitute.org
3
8/26/2009
Presentation Overview
Disclaimer: These are
complicated topics!
Adverse
Impact
Analyses
Background: Is this really
important?
Issue #1: Marginal Totals
Issue #2: Conservativeness
Data simulation results
Implications and recommendations
Copyright 2009, Biddle Consulting Group
All Rights Reserved
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
What’s the Big Deal?
 The issues we’ll be discussing are a “big deal”
and not a big deal at the same time.
 The big deal? Adverse impact is serious and
no one wants to calculate liability statistics
inaccurately.
 But it’s not a big deal because most of the
time under most circumstances,
time,
circumstances when SS
adverse impact is there, it’s there!
 Most court cases and audits are typically only
enforced when statistical evidence is strong.
4
8/26/2009
How Did this Come About?
 For decades, EEO professionals have relied
on “Chi-Square” type analyses for the 2x2
table question:
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
 Sometimes various corrections have been
used (Yates, Cochran).
 Sometimes the Fisher Exact Test (FET) has
been used
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
How Did this Come About?
 But is that what Fisher intended? All 2x2
tables to be run through his “exact” test?
 Since about the 1950s, various challenges
have been brought to the FET:
 What are the assumptions required for the
FET results to be accurately interpreted?
 Is the FET too conservative?
 Are there other more accurate techniques
when the strict FET conditions are not met?
5
8/26/2009
How Did this Come About?
 Most recently, from the mid-90s to this year, a
barrage of articles have been published in the biomed field,
field theoretical statistics field and journals
journals,
and other fields that have criticized the FET,
abandoned the FET, and recommended other
“less conservative” replacements that are more
applicable and accurate across a greater diversity
of 2x2 situations
situations…
 The adverse impact has not been neglected in
these discussions…
 We’ve reviewed over 80 such articles and
chapters…
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
How Did this Come About?
 2X2 analyses can be conducted in three situations: fixed,
mixed, and free margins.
 While there is a consensus in the current literature that the
FET is inappropriate in 2 of these 3 “2X2” situations…
 There is not a consensus regarding whether the
uncorrected FET should be used in 1 of the 3 “2X2”
situations.
 When evaluating these 2x2 situations
situations, it becomes clear
that the FET should not be used in many AI
circumstances, but may be used in some situations…
 Let’s take a look at the “2x2 situations”
6
8/26/2009
FET Issue #1: Marginal Totals
The FET Requires Meeting Conditional
Assumptions Not Always Met in Practice
Are the margins FIXED before or after the event?
Are the margins FIXED, CONSTRAINED, OR CORRELATED
to the employer’s previous decisions?
Pass
Fail
Totals
Men
8
2
10
www.biddle.com
Women Totals
2
10
6
8
8
18
13
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
FET Issue #1: Marginal Totals
The Three Major 2x2 Models (see Collins & Morris, 2008)
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
FIXED
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
MIXED
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
FREE
Model 1: Independence Trial: The marginal proportions
are assumed to be fixed in advance (i.e., proportion of
each group and selection totals are fixed). Data are not
viewed as a random sample from a larger population.
Model 2: Comparative Trial: Apps are viewed as
random samples from two distinct populations (e.g.,
minority and majority). Proportion from each population is
fixed (i.e., the marginal proportion on one variable is
assumed to be constant across replications). The second
marginal proportion (e.g., the marginal proportion of
applicants who pass the selection test) is estimated from
the sample data.
Model 3: Double Dichotomy: Neither row/column are
assumed to be fixed. Apps are viewed as a random
sample from a population that is characterized by two
dichotomous characteristics. No purposive sampling or
assignment to groups is used, and the proportion in
each group can vary across samples.
14
7
8/26/2009
FET Issue #1: Marginal Totals
The Three Major 2x2 Models: Applied to EEO Analysis
FIXED
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
Pass
Fail
Totals
Men
8
2
10
Women Totals
2
10
6
8
8
18
www.biddle.com
•T
Terminations
i i
/ RIF
RIFs
• However…shared/correlated with
past practices; oftentimes not predetermined
MIXED
• Some promotions
• However…
H
shared
h d ““odds
dd ratios”
ti ”
FREE
• Apps are widely recruited; show up;
unknown passing rates / hiring rates
15
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
FET Issue #1: Marginal Totals
Which of the Three 2X2 Models Apply to HR Decisions?
Reviewing Three 2X2 Models to HR Decisions (Adapted from Collins & Morris, 2008)
HR Practice
2X2 Model
Hiring with a fixed
cutoff score
Double
Dichotomy
Top-down
selection
www.biddle.com
Comments
Selection decisions use a fixed cutoff score. The
passing score typically set in advance or using
normative data. MQs might be used.
Candidates are selected top-down based on hiring
criteria until a fixed number of positions are filled.
Selection rate is fixed based on staffing needs. If a
different sample had been used, the number passing
would have been the same. However,, each group's
g
p
N
None
off th
the th
three
proportion is likely to vary across samples and is best
models fit
treated as an estimate of an unknown population
appropriately.
parameter. Further, because selection decisions
depend on applicant rank position in a particular
sample, the selected and nonselected groups are
sample-specific and do not reflect two distinct
populations as in the comparative trial model.
16
8
8/26/2009
FET Issue #1: Marginal Totals
Which of the Three 2X2 Models Apply to HR Decisions?
Reviewing Three 2X2 Models to HR Decisions (Adapted from Collins & Morris, 2008)
HR Practice
Banding
Promotion
www.biddle.com
2X2 Model
Comments
Banding is a combination of "ranking" and typically
None of the three
also involves a minimum cutoff score, so it is a hybrid
models fit
method for which none of the sampling models are a
appropriately.
perfect fit.
Candidate pool is relatively fixed... If decisions were
repeated, candidate set would be similar. In such
cases probabilities based on randomly sampling from
cases,
None of the three a population, as in the comparative trial and double
dichotomy models, would not apply. Similarly,
models fit
appropriately. probabilities based on random reassignment of
participants (i.e., independence trial model), would not
be appropriate. Without theoretical process for
producing different data patterns (e.g., random
17
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
FET Issue #1: Marginal Totals
Which of the Three 2X2 Models Apply to HR Decisions?
p
trial model (“Fixed”)
(
)
• “Because the independence
does not represent typical personnel selection data,
there is reason to question the appropriateness of
the Fisher Exact Test for adverse impact analysis.”
• “The tendency of these tests to be conservative
under
d the
h other
h sampling
li models
d l iindicates
di
that
h the
h
Fisher Exact Test and Yates’s test will be less likely
than other tests to identify true cases of adverse
impact” (Morris & Collins, 2008).
www.biddle.com
18
9
8/26/2009
FET Issue #1: Marginal Totals
Which of the Three 2X2 Models Apply to HR Decisions?
•
•
•
•
In the EEO analysis field, “The justification of conditional tests (those for
“fixed” margins)
g ) depends
p
on the assumption
p
that the process
p
determining
g
the fixed marginal counts is not dependent on the process under study…”
For example, When considering whether to use a conditional test (the FET)
when conducting a promotional analysis, “The number of minority
members hired out of a labor pool should not provide information about
the odds ratio of the promotion rates, the parameter of interest.”
Gastwirth advises checking this assumption before calculating conditional
tests in situations where the available sample results from a previous
selection process that may be affected by the same factors involved in the
process being examined (because the odds ratio of the hiring rates and
promotion rates would be related).
For this reason, the unconditional tests may be a “more accurate” test
across a greater number of AI cases (Gastwirth, J. (1997). Statistical
Evidence in Discrimination Cases, Journal of the Royal Statistical Society,
160, Part 2, 289-303).
www.biddle.com
19
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
FET Issue #1: Marginal Totals
• Men
• Women
PROMS
• Men
• Women
HIRES
• Men
• Women
TERMS
I the
In
th EEO area, when
h using
i statistical
t ti ti l tests,
t t it is
i important
i
t t to
t consider
id a
crucial assumption underlying conditional tests. This assumption
requires that one can condition on fixed marginal numbers that are not
dependent on any factor related to the process being investigated. For
example, if one examines the promotion data of a firm, the marginal
sample sizes of minority and majority employees eligible for advancement
clearly result from the hiring practice of that firm (Ibid).
www.biddle.com
20
10
8/26/2009
FET Issue #1: Marginal Totals
FET Requires “Calling Out” Marginal Totals
Before the Analysis is Conducted…
• Wh
When, if ever, iis this
thi really
ll the
th case in
i AI analyses?
l
?
• “The FET assumes that both of the margins in a 2X2 table are
fixed by construction—i.e., both the treatment and outcome
margins are fixed a priori” (Sekhon, 2005).
• “Over decades there has been a lively debate among
statisticians on the applicability of the conditional FET. The
argumentation against the test mainly is that it conditions
inference on both margins where only one margin is fixed by
most experimental designs and the test is inherently
conservative…the row and column marginal totals are fixed by
the researcher prior to data collection” (p. 171, Gimpel, 2007).
• “Fisher’s 2 x 2 exact test requires that the marginal frequencies
inwww.biddle.com
both margins are fixed a priori”
(Romualdi, et al, 2001).
21
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
FET Issue #1: Marginal Totals
Which of the Three 2X2 Models Apply to HR Decisions?
• For fun… it isn’t “truth” today
y if it’s not “so” on Wiki!
• “FET assumes that the row and column totals are
known in advance. In cases where this assumption is
not met, FET is very conservative, resulting in Type I
error which is below the nominal significance level. In
practice, this assumption is not met in many
experimental
p
designs
g and almost all non-experimental
p
ones. An alternative exact test, Barnard's exact test, has
been developed and Proponents of it suggest that this
method is more powerful, particularly in 2 by 2 tables.”
www.biddle.com
22
11
8/26/2009
FET Issue #2: Conservativeness
The FET is “too Conservative” Compared to other Methods
• “The tendency of these tests to be conservative under the other
sampling
li models
d l iindicates
di t that
th t the
th FET and
d Yates’s
Y t ’ test
t t will
ill be
b less
l
likely than other tests to identify true cases of adverse impact” (Morris
& Collins, 2008).
• “The exact test of Fisher…gives tests which are both extremely
conservative and is appropriate” (Upton, 1982) (later endorsed mid-P).
• “The traditional FET should practically never be used” … “the FET is
unnecessarily conservative with lower power than conditional mid-p
tests and unconditional tests” … “We do not recommend the use of
FET. FET is conservative, that is, other tests generally have higher
power yet still preserve test size” Lydersen, et. al, 2009)
• “FET can be conservative in the sense of its actual significance level
(or size) being much less than the nominal level” (Lin & Yang, 2009).
www.biddle.com
23
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Probability Theory Applied
to 2X2 Tables
0.2
DEMONSTRATION OF "DISCRETENESS"
IN THE FET PROBABILITY DISTRIBUTION
0.15
Asymptotic "Best Estimate" Line
Used by the Chi-Square
0.1
FET: p
Mid‐p:
Uncond.:
0.0536
0.0392 0.0338
0.05
The FET has 4 "stopping places" below .05
Chi‐Square Theory has more
0
1
2
3
4
5
6
7
24
12
8/26/2009
Probability Theory Applied
to 2X2 Tables
Actual Significance Level v.
Desired (.05) Significance Level
Actual Significance Level v.
Desired (.05) Significance Level
Mid--P
Mid
FET (uncorrected)
25
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Important Questions for HR
Professionals…
 What is the significance level used for testing whether a test
is valid?
 What is the significance level used for testing Adverse
Impact?
 Answers:
 Validity: .05
 Adverse Impact: .05
 What statistical tests are useful for answering these statistical
questions?
 Validity: Pearson correlation is common
 Adverse Impact: Fisher Exact Test (under a variety of
methods), Chi-Square, Z-test, etc.
13
8/26/2009
Example
• Let’s take two employers that use a physical test that has a
standardized mean group difference of 1.0 (d) between
men/women
– This difference is commonly observed on written tests
(minority/non-minority) and physical tests (men/women).
• Each employer tests for 1,000 applicants per year
• One employer hires only the top 10%; the other only the top
40%
• Such a test will exhibit adverse impact, it’s just depends on 2
factors:
•
1: the number of applicants tested and hired, and
•
2: the power of the statistical test used to detect the AI
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Example
 This example constitutes one where a “substantial
passing rate difference” (required by the Guidelines) has
b
been
observed
b
db
between
t
th
the ttwo groups iin th
the population
l ti
 Using a 40% hiring rate, a standardized mean group
difference of 1.0 (d) (between men and women) equates
to:
• 58% male passing rate
• 22% f
female
l passing
i rate
t
 Using a 10% hiring rate, a of 1.0 (d) equates to:
• 18% male passing rate
• 3% female passing rate
14
8/26/2009
Practical Implications of a Test with a 1.0 (d)
• How much overlap is there between two groups based on various d values?
.25 d = 82% overlap
.50 d = 67% overlap
67% overlap
.75 d = 55% overlap
1.0 d = 45% overlap
45% overlap
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Summarizing the AI Evidence on a
1.0 (d) Test
 Evaluating the company that uses a 40% hiring rate:
•
58% of the men will pass and 22% of the women will pass
•
Hiring ratio = 2.6 male hires for every 1 female hire
•
The impact ratio is 38% (2X less than half the 80% test)
•
Less than one-half (45%) of the male distribution overlaps
with the female distribution
 Evaluating the company that uses a 10% hiring rate:
•
18% of the men will pass and 3% of the women will pass
•
Hiring ratio = 6 male hires for every 1 female hire
•
The impact ratio is 17% (4X less than half the 80% test)
•
Less than one-half (45%) of the male distribution overlaps
with the female distribution
15
8/26/2009
Finding Adverse Impact
 Next let’s investigate the usefulness in answering the
AI question using three statistical tools:
 Fisher Exact Test (FET)
 Fisher Exact Test (mid-P)
 Chi-Square (or “Z” test)
 The sample sizes for both the “40% hiring rate” and
10% hiring rate”
rate employers will be scaled and
“10%
evaluated
 Sample sizes will be “matched” for both men and
women
31
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
First, Some Definitions
 Type I Error (α): reject the null (“no difference”) hypothesis
when the null hypothesis is true
 In other words, finding AI when it does not exist
 Type II Error (β): fail to reject the null hypothesis when the null
hypothesis is false
 In other words, missing AI when it exists
 Type I Error Rate: The percentage of Type I errors made by a
statistical test ((i.e., the rate at which it falselyy concludes AI).
)
 Type II Error Rate: The percentage of Type II errors made by
a statistical test (i.e., the rate at which it misses AI that exists).
 Nominal Level: the p-value of significance, declared in
advance (e.g., .05) (in AI cases, the major concern is with
answering the “big .05 question”)
32
16
8/26/2009
More Definitions
• Statistical p
power analysis
y evaluates the
likelihood that a statistical test will find a
meaningful difference at the specified level
(e.g., .05, or 2SDs).
• Adverse Impact tests that are “more powerful”
are more likely
lik l tto find
fi d adverse
d
iimpactt when
h it
exists.
www.biddle.com
33
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Power Analysis for 40% Hiring Rate
Employer
Power Curve for Detecting Adverse Impact on a 1.0(d) Test Used with a 40% Overall Passing Rate / Chart Answers the Question:
What percent of the time will the test find adverse impact when it exists?
FET
FET (mid P)
FET (mid‐P)
Chi Square
Chi‐Square
100%
90%
80%
70%
60%
50%
Gap Showing Increased Likelihood of p
g
f
Missing AI When it Exists
40%
30%
20%
10%
0%
5
10
15
20
25
30
35
Sample Size (Equal for Each Group)
40
45
50
34
17
8/26/2009
Power Analysis for 10% Hiring Rate
Employer
Power Curve for Detecting Adverse Impact on a 1.0(d) Test Used with a 10% Overall Passing Rate / Chart Answers the Question:
What percent of the time will the test find adverse impact when it exists?
FET
FET (mid‐P)
Chi‐Square
100%
90%
80%
70%
60%
50%
40%
Gap Showing Increased Likelihood of Missing AI When it Exists
30%
20%
10%
0%
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
Sample Size (Equal for Each Group)
80
35
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Power Comparison in Small
Samples
Average Statistical Power in Samples Between 10 and 50
FET
FET (mid‐P)
Chi‐Square
80%
75%
70%
64%
POWER
60%
56%
50%
45%
39%
40%
30%
66%
76%
68%
29%
20%
10%
0%
10% SR
20% SR
40% SR
Selection Ratio
18
8/26/2009
Accuracy of Tests for Answering the “just
.05 or less” Question
Comparison Between FET and Mid-P Based on Sample Size
(Based on Monte Carlo Simulations from Cited Articles)
FET SD Required f or Signif icance
Mid-p SD Required f or Signif icance
Poly. (FET SD Required f or Signif icance)
Poly. (Mid
(Mid-p
p SD Required f or Signif icance)
2.60
AVERAGE "OVERAGE" OF FISHER EXACT TEST (AMOUNT HIGHER THAN 1.96 TO FIND AN ACTUAL 1.96 FINDING)
2.50
2.40
2.30
AVERAGE "OVERAGE" OF MID‐P
SD Value
S
2.20
2.10
2.00
1.90
1.80
1.70
1.60
1.50
0-20
21-50
50-75
76-100
101-125
126-200
Sample Size
37
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
How Accurately do the Tests Answer the .05
Question?
Actual FET/Mid-P Significance Levels
(Compared to Desired .05 Level)
Sample
S
l
Size
0-20
21-50
50-75
76-100
101 125
101-125
126-200
Typical
Estimate for
n<50 (FET)
Typical
Estimate for
n<50 (mid-P)
Typical
yp
Alpha
Range
% Below
Actual SD
Desired .05 Required for
Level
Significance
0.015
0.025
0.026
0.032
0 035
0.035
0.043
70%
50%
47%
36%
30%
13%
2.43
2.24
2.22
2.15
2 11
2.11
2.02
0.029
41%
2.19
0.046
8%
1.99
38
19
8/26/2009
Type I Error Rates Between
Tests
Type I Error Rate Comparison Between Three 2X2 Tests Across Sample Size/Selection Ratio Scenarios
FET
Mid‐P
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
Selection Ratio (SRT), Minority Representation (PMIN), and Sample Size (N)
SRT‐70%‐PMIN30%‐N‐50
SRT‐70%‐PMIN30%‐N‐100
SRT‐70%‐PMIN10%‐N‐100
SRT‐70%‐PMIN30%‐N‐20
SRT‐70%‐PMIN10%‐N‐20
SRT‐70%‐PMIN10%‐N‐50
SRT‐50%‐PMIN50%‐N‐50
SRT‐50%‐PMIN50%‐N‐100
SRT‐50%‐PMIN30%‐N‐100
SRT‐50%‐PMIN50%‐N‐20
SRT‐50%‐PMIN30%‐N‐20
SRT‐50%‐PMIN30%‐N‐50
SRT‐50%‐PMIN10%‐N‐50
SRT‐50%‐PMIN10%‐N‐100
SRT‐30%‐PMIN50%‐N‐100
SRT‐50%‐PMIN10%‐N‐20
SRT‐30%‐PMIN50%‐N‐20
SRT‐30%‐PMIN50%‐N‐50
SRT‐30%‐PMIN30%‐N‐50
SRT‐30%‐PMIN30%‐N‐100
SRT‐30%‐PMIN10%‐N‐100
SRT‐30%‐PMIN30%‐N‐20
SRT‐30%‐PMIN10%‐N‐20
SRT‐30%‐PMIN10%‐N‐50
SRT‐10%‐PMIN50%‐N‐50
SRT‐10%‐PMIN50%‐N‐100
SRT‐10%‐PMIN30%‐N‐100
SRT‐10%‐PMIN50%‐N‐20
SRT‐10%‐PMIN30%‐N‐20
SRT‐10%‐PMIN30%‐N‐50
SRT‐10%‐PMIN10%‐N‐20
0
SRT‐10%‐PMIN10%‐N‐50
SRT‐10%‐PMIN10%‐N‐100
P‐V
Value Generated by Test
Z‐Test
39
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Summary
 The best AI test is one that balances the 3 concerns between:
 Being able to answer the .05 question
 Missing adverse impact when it exists, and
 Falsely concluding AI exists when it does not.
 The FET consistently “undershoots” the .05 level of significance:
 Drastically in smaller samples (n<50); substantially in samples 50125.
 The mid-P provides a “correctly” sized adjustment across various
samples
 Type II error rates (“missing” AI when it exists) differ substantially by
test, especially in smaller samples where the FET is much less
powerful
 All 3 common tests share similarly low Type I error rates, leaving the
employer with very low odds of “incorrectly” concluding AI
40
20
8/26/2009
Summary
• Using the FET unilaterally in all 3 conditions is unacceptable
and should be discontinued in light of the recent findings just
reviewed.
• The conditional FET may be appropriate in limited conditional
settings, but there will always be an argument against such
use:
• The FET is conservative regardless
• Does the situation analyzed truly meet conditional
requirements?
• However, the mid-P has power advantages, adheres more
closely to the .05 level, and is very closely aligned with the
FET (where appropriate)
www.biddle.com
41
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Summary
• Is either position—the FET or mid-P—aligned with a “plaintiff” or
“defense” position? It depends on the question being asked…
• If an employer is interested in knowing the exact p-value
p value given
in a clearly conditional situation where the margins were indeed
fixed beforehand… the FET will provide this answer.
• If an employer is interested in not missing adverse impact that
may exist (i.e., wants strong power to detect AI), the mid-P will
better answer the question, in both conditional and
unconditional situations (i.e., all 3 models).
• For
F a test
t t to
t b
be useful,
f l it should
h ld be
b reasonably
bl accurate,
t
reasonably powerful, versatile across wide situations
• The p-value from an AI test using a discrete distribution should
be reasonably ‘aligned’ with a P-value from a comparative
continuous distribution
www.biddle.com
42
21
8/26/2009
Summary
• The FET gives the actual conditional p-value, but will
always go below the .05
05 nominal level
level, thus not
answering the “exact” =<.05 or 2 SD question asked in
Title VII situations.
• The Mid-p may be thought as “assessing the strength of
evidence against the null hypothesis” (Barnard, 1989, p.
1474). This is not true regarding the exact p-value from
the FET
FET.
• The question being asked in Title VII situations is not
necessarily “what is the p-value”; but rather: “Is the pvalue less than .05”? The mid-p answers this question
more accurately.
www.biddle.com
43
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Summary
Advantages of Using the Mid-P (adapted from Hirji, 2006)
• Hirjij p
provides the basis for endorsing
g the mid-P as the p
preferred exact
method (for either conditional or unconditional situations):
• Statisticians who hold very divergent views on statistical inference
have either recommended or given justification for the mid-p method.
• A mid-p version has been or can be devised for most of the statistics
used in exact conditional and unconditional analysis of discrete data.
• The Confidence Intervals associated with Mid-ps are often preferred
by statistical program because they are more narrow / accurate (e.g.,
StatXact).
• The shape and power function of the mid-p tests are generally close
to the shape of the ideal power function—an important distinction
because it demonstrates that the power of the test is uniform, and
able to detect AI when it exists across a variety of data sets—both
balanced and unbalanced).
www.biddle.com
44
22
8/26/2009
Summary
Advantages of Using the Mid-P (adapted from Hirji, 2006)
• In a wide variety of designs and models, the mid-p rectifies the extreme
conservativeness of the traditional exact conditional method without
substantially compromising the type I error.
• Empirical studies show that the performance of the mid-p method
resembles that of the exact unconditional methods and the conditional
randomized methods.
• With the exception of a few studies, most studies indicate that in
comparison with a wide variety of exact and asymptotic methods, the midp methods are among the preferred,
preferred if not the preferred ones
ones.
• The mid-p as good comparative small and large sample properties.
• Hirji concludes by stating: The mid-p method is thus a widely-accepted,
conceptually sound, practical and among the better of the tools of data
analysis. Especially for sparse and not that large a sample size discrete
data, we thereby echo the words of Cohen and Yang (1994) that it is
among the “sensible tools for the applied statistician.”
www.biddle.com
45
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Summary
EVALUATION FACTOR
Appropriate in Independent
Trial? (Model 1, Fixed)
Appropriate in Comparative
Trial? (Model 2, Mixed)
Appropriate in Double
Dichotomy? (Model 3, Free)
Average Distance from .05
L
Level
l iin Small
S ll Samples
S
l
Actual Significance Level
Required in Small Samples
Preserves .05 Nominal Sig.
Level
Average Power in Small
www.biddle.com
Samples (n<50)
FET-Boschloo
FET
FET (mid-P)
(unconditional)
(conditional)
MAYBE
YES
NO
NO
YES
YES
NO
YES
YES
41%
8%
5-10%
2.19
1.99
1.95-2.05
YES
NO
NO
54%
62%
62%
46
23
8/26/2009
How Do You Compute the Mid-P?
 It’s rather simple…Many Stat Packages will provide the mid-p
 If you already have an AI tool or stat program, just
 Compute the 2-tail FET
 Subtract ½ of the p-value from the first table from that value
 The “hypergeomdist” function can be used EXAMPLE
 If you want to avoid the hassle, just calculate mid-p values for
FETs that are “on the cusp” of significance, such as 1.80 SDs
(corresponding to p-values of about .07)
07)
 Can easily be done for Mantel-Haenszel style analyses
 If the exact unconditional test is preferred:
http://www.stat.ncsu.edu/exact/
47
Copyright 2009, BCGinstitute, inc.
All Rights Reserved
Questions?
www.BCGinstitute.org
48
24