Download Report

EP 521 Spring, 2004, Vol I, Part 5
§3.
1
Sample Size Estimation
A key to study design are sample size or “power” calculations.
Required of ever grant proposal
In this section:
(1) we begin with theory behind power calculations and demonstrate how simple formulae for
power and sample sizes are derived.
(2) Next, show unified treatment of power for RD, OR, RR based on this theory.
(3) Then, describe how varying the question being asked can have substantial effect on the
required sample sizes.
(4) Brief explanation of the information needed for power calculations for matched pair studies.
(5) Some demonstrations on how to use and interpret software for power calculations.
Goals – To be able to understand what affects power, how to define the problem, and how to get
the computer to give you the answer you need.
EP 521 Spring, 2004, Vol I, Part 5
2
§3.1 POWER IN GENERAL
Sample Size Estimation: Terminology Review
Null hypothesis (Ho): specified value for a parameter (OR, RR, RD, IRR, IRD, for example)
Alternative hypothesis (Ha): specified alternative value for a parameter
Type I error = Pr(Reject Ho | Ho is true) = α
Type II error = Pr(fail to reject Ho | Ha is true) = β
Pr(fail to reject Ho | Ho is false)
Power = Pr(reject Ho | Ha is true) = 1- β
1-α = ?
(“Pr” signifies probability over repetitions of the study)
(References: Woodward, chap 8; Rothman and Greenland, pp. 184-8)
EP 521 Spring, 2004, Vol I, Part 5
3
Notes:
(1)
α-level is not a p-value. P-value is a quantity computed from and varying with data. α is fixed and
is specified without seeing the data.
(2)
p-value is not the Pr(Ho vs Ha). Is loosely defined:
Pr(test statistic as or more extreme than observed |Ho true).
(3)
p-value is not Pr(data|Ho). That is the likelihood. Likelihood is usually much smaller than the pvalue, because p-value includes not only Pr(data |Ho) but also the Pr(all other more extreme data
configurations Ho).
(4)
Absence of evidence is not evidence of absence. Failing to reject Ho ≠ accept Ho as true.
(5)
Studies with low power to produce results with appropriately narrow confidence intervals (as
defined by the purpose of the study) are not “negative studies” – they are “indeterminate”.
An initial description of what we are doing will help.
EP 521 Spring, 2004, Vol I, Part 5
4
.
.
H0
Ha
.
0
2
3
4
Type I error ( α ) -- H0 is true but you will reject H0 in favor of Ha. Suppose that 2 is your
threshold (critical value) for rejecting H0. So, you have only a very small chance of observing a
EP 521 Spring, 2004, Vol I, Part 5
5
value to the right of 2, and a large change of observing something to the left of 2, if H0 is true.
Type II error( β )– If Ha is true, then you have a chance of observing a value to the left of 2,
below the critical value, but it is not great. You have a much larger chance of observing a value to
the right of 2. How big a chance you have of observing a value at 2 or to the right of 2, if Ha is
true depends upon how far Ha is away from H0. If Ha is far away, then power is bigger, and type
II error is smaller.
Now what happens when sample size increases (or when variance decreases). The distributions
become narrower. (This is the distribution of the mean, for example). Holding everything else
constant, what does that do to my power to detect a difference? At 2, I have little chance of
falsely rejecting H0. This would be a very high critical value for rejecting H0. But if Ha is true,
you have an almost certain chance of observing a value at least 2, meaning that power is almost 1.0
and Type II error is almost 0.
EP 521 Spring, 2004, Vol I, Part 5
6
.8
.6
.4
.2
0
0
1
2
3
4
EP 521 Spring, 2004, Vol I, Part 5
7
I can pick a vertical line (2, for example) to correspond to a type I error. This is usually the case.
Then I can posit what Ha is (3 or 4), and if the sample size tells me how broad the distributions of
the effect size is under H0 and Ha, then I can estimate what Type II error and power will be.
Alternatively, I can specify Type I error, and power (and thus Type II error) and estimate just how
close Ha can be to H0 to achieve this level of power.
EP 521 Spring, 2004, Vol I, Part 5
Type I error
8
Type II error
From: Methods in Observational Epidemiology by J.L. Kelsey, A.S. Whittemore, Alfred S. Evans and W. Douglas Thompson,
1996, New York, Oxford University Press, p. 328.
EP 521 Spring, 2004, Vol I, Part 5
9
Power calculations are based on the sampling distribution of the difference (means, proportions) of
the groups being compared.
d = value of "difference" [Risk difference, log OR, difference in means, etc.] when null is true (d
= 0)
dc = value of difference that is just significantly different from d at significance level α critical
value
d* = value of difference when null is false.
Some key numbers to remember on SS calcs
(For purposes of this presentation)
Quantity
Interpretation
Value
Zα/2
Type I error
of 0.05
1.96
Zβ
Type II error
0.2 (80% power)
0.1 (90% power)
+0.84
+1.28
EP 521 Spring, 2004, Vol I, Part 5
10
(Zα/2 +Zβ)5 Used in SS calcs
Type I =0.05;Type II=0.2
Type I=0.05; Type II=0.1
7.85
10.5
Some texts refer to Zβ as Z 1-β and Zα/2 as Z 1-α/2 and thus have
slightly different formulae.
EP 521 Spring, 2004, Vol I, Part 5
11
SO
1. When null is false (HA = true), we are sampling from
distribution on right. Values to the left of d c
occur with probability β, and represent the probability of inappropriately failing to reject H 0).
Area to left of dc, when d* is true = Type II error ( = Pr (failing to reject H 0 | HA is true).
2. Values to the right of dc in the shaded area ,
α
of rejecting H0 when we should fail
2
to reject ( since H0 is true).
represent the probability
3. Values to the right of dc, forming part of the distribution of d* represent the power of detecting a
true difference [Pr (rejecting H 0 | H0 false)].= 1- β
EP 521 Spring, 2004, Vol I, Part 5
12
By using standard normal:
( )
d c = d+ Z α2 [se(d)]
(Eq 5.1)
and
*
*
(Eq 5.2)
d c = d - Zβ[se(d )]
where
is standard normal deviate corresponding to position of d c on distribution around
z
α/2
d.
Zβ is standard normal deviate corresponding to
position of dc on distribution around d* .
β = 0.1 = Type II error
e.g.,
(1- β) = 0.9
Z1-β = 1.28
Zβ = -1.28
Think in terms of flipping over the Ha distribution, so we look at z’s in Ha from right to left rather than
the usual left to right.
EP 521 Spring, 2004, Vol I, Part 5
13
.4
.2
0
-1.28
0
x
1.28
ddcc
d*
-d
-d
c
c
Point: Use + 1.28 for β = 0.1. Then, setting eq5.1 = eq 5.2, and
solving for Zβ we get:
Zβ =
*
d - d- Z α2 [se(d)]
se(d *)
(Eq 5.3)
EP 521 Spring, 2004, Vol I, Part 5
14
Usually assume se(d) = se(d*) and simplify
*
Zβ =
d -d
- Zα
2
se(d *)
(Eq 5.4)
Note: Zβ can range −∞ to + ∞. If, as is usual, d = 0, then
Zβ =
d*
− Za
2
se(d *)
(Eq 5.5)
What if d* = d = 0? Then: Zβ = - Z α
2
Power is 0.025 (in each tail).
(for α = 0.05) (Makes sense -- we reject only falsely.)
Using the simple Eq 5.5, we can arrive at a series of simple formulae for power and sample size calculations.
EP 521 Spring, 2004, Vol I, Part 5
15
§3.2 Power and sample sizes in case control and cohort studies
Methods of Sampling and Estimation of Sample Size
Definitions of Symbols Used in Equations for Calculating Power and Required Sample Size
symbol
Definition
d* Non-null value of the difference in proportions or means (i.e., the magnitude of difference one
wishes to detect)
n In a cohort or cross-sectional study, the number of exposed individuals studied; in a casecontrol study, the number of cases
r In a cohort or cross-sectional study, the ratio of the number of unexposed individuals studied to
the number of exposed individuals studied; in a case-control study, the ratio of the number of
controls studied to the number of cases studied
σ Standard deviation in the population for a continuously distributed variable
p1 In a cohort study (or a cross-sectional study), the proportion of exposed individuals who
develop (or have) the disease; in a case-control study, the proportion of cases who are exposed
p0 In a cohort study (or a cross-sectional study), the proportion of unexposed individuals who
develop (or have) the disease; in a case-control study, the proportion of controls who are exposed
p + rp 0
p= 1
= weighted average of p1 and p0
1+ r
(Ref:Kelsey et.al. 1996, Table 12-11.
EP 521 Spring, 2004, Vol I, Part 5
16
So When n is fixed by costs, time, etc., can use power calculations.
Initial derivation of Eq 5.6 from Eq 5.5
Recall: variance of a difference in means (assuming independence) Var(A-B) = Var(A) + Var(B)
Assuming a common standard deviation:
1 1
var(d *) = σ 2  + 
 n1 n2 
Here, we know n2 = r ⋅ n1
1
1 
2  r +1 
var(d *) = σ 2  +
=σ 

 n1 r ⋅ n1 
 r ⋅ n1 
So,
se(d *) = σ
r +1
r ⋅ n1
EP 521 Spring, 2004, Vol I, Part 5
17
Therefore: Zβ for difference in means:
Zβ =
*
d  nr  α

-Z
σ  r+1  2
(Eq 5.6)
Zβ for difference in proportions:
1/ 2
 nr 
Zβ =


p(1- p)  (r+1) 
d
*
 n(d *) 2 r 
Zβ = 

 (r+1)p(1- p) 
Equivalent!
Recall Var(p) =
Substitute
- Zα
2
1/ 2
- Zα
(Eq 5.7)
2
p(1- p)
n
p ⋅ (1 − p )
for σ above
EP 521 Spring, 2004, Vol I, Part 5
18
Note: we have defined d* as the risk difference (RD)
We can express RD in terms of both RR or OR and the baseline risk ( p0 )
p1
, p1 = p 0 RR
p0
So d* = P0RR - p0 = p0 (1- RR)
For RR: RR =
p1
(1- p1)
For OR: OR =
p0
(1- p 0)
So p1 =
p 0 • OR
So
1+ p 0(OR-1)


p 0 OR
d* = 
 - p0
 1+ p o (OR-1) 
We may have a specific OR or RR in mind and need to know the implied value of p 1.
So, we have a (1) simple, and (2) unified approach for (a) sample size and (b) power calculations for (i)
RD (ii) RR, or (iii) OR, as well as for differences in means.
EP 521 Spring, 2004, Vol I, Part 5
19
Example #1: Cohort design: Does smoking during pregnancy show an association with increased risk of
low birth weight in offspring?
Known facts:
1. Prevalence of smoking during pregnancy is about 3 (25%) , i.e., 3 non-smokers for each
smoker. So, r = 3 if we just pick a cohort at random and follow them.
2.
Incidence (overall) of low birth weights ( 2500 gm) is ~ 7%.
Suppose we have the time and dollars to study 1200 births.
Expect 1200/4 = 300 exposed (n = 300) during gestation (to smoking).
Suppose we want to measure the difference in risk (proportions of low birth weight babies)
and we want to detect a difference of 4% = (d *). What is the power to detect this
difference?
Must compute p0, p1 from overall incidence of LBW = 0.07. That is simply a weighted
average of risks among smokers and non-smokers.
[Smokers]
0.07
= (0.25) (p0 + 0.04)
EP 521 Spring, 2004, Vol I, Part 5
because p1 = (p0 + 0.04)
[Nonsmokers]
+ (0.75) (p0)
20
EP 521 Spring, 2004, Vol I, Part 5
21
Now, solve for p0:
p 0 = 0.06
p1 = 0.10
p=
0.10 + 3(0.06)
= 0.07
1+ 3
where p =
p1 + r(p 0)
unexposed 3
and r =
=
1+ r
exposed
1
For α = 0.05:
 n(d *) 2 r 
Zβ = 

 (r+1)p(1 − p) 
1/ 2
 300 (0.04) 2 3 
α
-Z =

2
 (3 +1)(0.07)(0.93) 
1/ 2
- 1.96 = 0.39
For Zβ = 0.39, power = 0.652. This is depicted on the normal density plot on the next page, and is the
shaded area, from left to right, under the curve, representing the cumulative normal from negative
infinity to +0.39. Note, this power plot is just the same as the prior plot, page 4, except that we are
now depicting power from left to right instead of from right to left (under the normal density).
Be careful. Want cumulative probability
EP 521 Spring, 2004, Vol I, Part 5
22
What do the power calculation programs produce?
1.
2.
3.
STATA Sampsi gets 0.592.
Stplan gives 0.606 (uses the arcsin transformation).
N-Query Advisor gives 0.63
R. Localio
EP 521 Spring, 2004, Vol I, Part 5
23
Example #2: Case control study of smoking during pregnancy and low birth weight in offspring.
Using same numbers as before,
Case = giving birth to low birthweight baby
Control = giving birth to "normal" birthweight baby. (e.g. 2501 gm.)
For p0, we will use overall prevalence of smoking (EXPOSURE) in general population of
pregnant [because cases are a small minority]
i.e., p0 = proportion of controls who are exposed = 0.25 (as before)
Want to detect OR = 1.8
Can study 175 cases
Plan control: case ratio = r = 2
Solve for : p1 =
p 0 OR
(0.25)(1.8)
= 0.375
p1 =
1+ p 0(OR-1)
1+ (0.25)(1.8 -1.0)
EP 521 Spring, 2004, Vol I, Part 5
24
*
d = p1 - p 0 = 0.375 - 0.250 = 0.125
p=
p1 + r p 0
,
1+ r
 (0.375) + 2(0.25) 
p=
 = 0.292
1+ 2

1/ 2


(175)(0.125) 2(2)
Zβ = 
 -1.96
 (2 +1)(0.299166)(0.70834) 
= 1.01
Power = 84.4% to detect OR = 1.8. This result means that the two distributions, one for Ho:
OR=1.0 and the other for Ha: OR=1.8 do not overlap very much (see figure on page 4).
EP 521 Spring, 2004, Vol I, Part 5
25
NOTES:
1) n = 175 cases, so total sample size = 175 + 350 = 525
2) In cohort study, we had p1 = 0.10, p0 = 0.06, gives OR = 1.74. Cohort Study needed 1200 births.
3) Everything is re-expressed as a difference in proportions (or means).
4). We need to know:
a) Exposure prevalence in population (for case control or cohort study
b) Disease (incidence) in population (for cohort study).
c) Desired "effect size" ("clinically important" difference)
d) Minor notational and other differences may be found in different texts --
p(1- p)
p (1- p1) p 2 (1- p 2)
replaced by 1
+
n
n
n
2
2
e.g.,
EP 521 Spring, 2004, Vol I, Part 5
26
Sample Size: Solve for n
for means:
nr
,
r+1
(Z + Z )
β
α
2
2
2
d * nr
= 2•
σ r+1
n=
Then
(
)
2
2
Zβ + Z α2 σ (r+1)
2
d* r
for proportions:
(Z + Z )
α
2
β
2
2
n d* r
=
(r+1) p (1- p )
(Z + Z )
n=
β
α
2
2
p (1- p )(r+1)
2
(d *) r
Eq 5.9
There are some common values for given levels of power and Type I error.
EP 521 Spring, 2004, Vol I, Part 5
27
Tables for common values of key parameters: (Kelsey et al., 1996, Table 12-16 p 333.)
(
Values of Z α + Zβ
2
)
2
for frequently used combinations of significance level and power
Significance level
α
Power
(1 – β)
0.01
0.80
0.90
0.95
0.99
0.80
0.90
0.95
0.99
0.80
0.90
0.95
0.99
0.05
0.10
EP 521 Spring, 2004, Vol I, Part 5
So, 7.85 and 10.5 are the key values to remember.
(Z
α
2
+ Zβ
)
2
11.679
14.879
17.814
24.031
7.849
10.507
12.995
18.372
6.183
8.564
10.822
15.770
28
EP 521 Spring, 2004, Vol I, Part 5
29
Another example: Case control study of smoking and low birth weight
Want OR = 1.8 to be detectable
Power = 90% α = 0.05
(Recall)
p = 0.292,(1- p) = 0.708, d * = .125, r = 2
(plus other prevalence assumptions)
Thus:
n=
(10.507)(0.29166)(0.70834)(3)
2
(0.125) (2)
= 208.4
= 209
n=209 + 418 controls = 627 [Remember that 175 cases gave 84% power]
EP 521 Spring, 2004, Vol I, Part 5
30
§3.3 Special Concerns in Power (Sample Size) Calculations
Worries
1.
2.
3.
Measurement error
Those selected/invited vs. those who agree to participate.
Enroll 80%
(0.8)(x) = 500
x = 500/0.8 = 625
Censoring
Loss to follow-up
Other causes of death than the cause of interest
Many assumptions involved in the calculations for such studies
EP 521 Spring, 2004, Vol I, Part 5
EP 521 Spring, 2004, Vol I, Part 5
31
32
§3.3.1 Measurement Error: effect on power (Refs: Armstrong et al 1992; Kelsey et al 1996, ch 13)
Where errors can occur:
Exposure variables (most common worry)
Disease (outcome) classification
Confounding factor or covariates
Effect of nondifferential error (misclassification or measurement): commonly (although not always)
biases or attenuates measure (effect size) towards the null
Effects of nondifferential error in exposure on sample sizes:
[In simple cases]
Observed effect size is smaller than true effect size,
i.e., it takes more power to demonstrate an effect for
a given true effect (observed effect will be closer to
null): effect of bias
Confidence intervals for corrected measures of effect
size are wider than if exposure were measured without error: effect of variance
Effects of nondifferential error in confounders -- Effect size can be biased in either direction
EP 521 Spring, 2004, Vol I, Part 5
33
Remedies for measurement error in planning studies
Estimate measurement error from pre-existing data
Use tables on attenuation bias (Kelsey, Armstrong)
If error is not known, plan a validation substudy (complex)
Plan on multiple measurements of subjects
For estimating the impact of nondifferential error, estimate the sensitivity and specificity of observed
exposure:
True Exposure
Observed
+
exposure +
a
b
c
d
Then Sn=Pr(O+|True+) = a/(a+c)
Sp=Pr(O-|True-) = d/(b+d)
Prevalence of exposure =(a+c)/(a+b+c+d)
EP 521 Spring, 2004, Vol I, Part 5
34
Effect on the Odds Ratio of
Nondifferential Error in the
Measurement of a Binary Exposure
Variable*(Kelsey et al., 1996, page
350*)
*The entries in the body of the table are
the attenuated values of the odds ratio
resulting from the effects of the
nondifferential error in measuring
exposure. Classification in terms of
disease status is assumed to be error
EP 521 Spring, 2004, Vol I, Part 5
35
free.
EP 521 Spring, 2004, Vol I, Part 5
36
§3.3.2 How many controls per case
What should the value of r be?
r= ratio of controls/cases
or unexposed/exposed
In practice: # of cases in case control study is the total # available, so we can't get any more than
there are. Then we can increase power by increasing r (i.e. taking more controls), BUT!!
precision does not increase beyond r = 3 or 4 (when c = 1).
Summary:
Have unified method for computing power and sample size for different parameters (RR, OR, RD,
difference in means).
They all depend on tradeoffs between Type I and Type II error, the assumed differences (or ratios) of
the means (or proportions), the standard deviation of the distributions (in case of differences in means),
and the sample size.
Power calculation programs do this work for us, but we need to understand what we are asking of those
programs.
EP 521 Spring, 2004, Vol I, Part 5
37
§3.4 The Fallacy of the Post-hoc Power Calculation (see Berlin & Goodman , 1994)
Suppose
σ = 10 (σ2 = 100).
N = 50 subjects / group
We have done a study comparing the effects of two drugs on a continuous outcome measure
with the above variance.
The result of the study is that the difference between the means of the two groups is 4 units.
(The two groups are independent)
100 σ 2
=
var (A - B) = var (A) + var (B)
50
n
4
Z = x1 x 2 =
=2
100 100
4
+
50 50
We do a Z-test (known variance):
So the test (either Z or t) would barely reject H 0 of no difference in means at the α = .05
level.
EP 521 Spring, 2004, Vol I, Part 5
38
Now, suppose that the planned detectable difference = 3.0 with 80% power and alpha=0.05.
But after the experiment, observe a difference = 2.0, with CI = 0 to 4.
This result means that you happened to observe an effect size in the sample that is lower than the true
effect size in the hypothesized population.
We must always distinguish
(1) The hypothesized true (but unobserved) population
(2) The actual observed sample from that population
Each sample from the true population will differ somewhat and will have a different estimated effect
size.
If you hypothesized a large difference, and you found only a small difference, then you are “out of
luck”. Too bad. Your p-value will likely be .0.05.
EP 521 Spring, 2004, Vol I, Part 5
Q:
39
What was the power to detect a difference of 4 units given N =
50 per group (i.e., r = 1) and σ = 10?
*
nr
d
- Zα
Zβ =
σ r+1 2
=
4 50 • 1
-1.96 = (.4)
10
2
(
)
25 -1.96 = 0.04
So power = 0.50 or 0.51
So if the power was so low, how did we detect a difference?
Meaningless question: the d* in the formula relates to the hypothetical mean of an alternative
distribution, not to an observed event. An observed event will always have (1- β) < 0.5 if the
finding is “not significant”.
In other words, if z < +1.96  power is < 0.5
So if observe "NS" finding, then always say study is underpowered.
But do not know what we'll find out until after experiment.
In short, d* - d = d*  dOBS - d = dOBS. There is no place for power after observe d OBS
Ref: Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis.
American Statistician. 2001;55(1):19-24.
EP 521 Spring, 2004, Vol I, Part 5
40
§3.5 Sample Sizes for Confidence Intervals
Sample size for single mean or proportion
L = margin of error within which you want to estimate the mean or proportion (1/2*width of CI)
Then for MEAN:
σ
= 1.96 * se
L=Z*
n
Zσ
2
L =
n
2 2
σ
n= Z 2
L
2
2
Proportion (e.g., sensitivity and specificity)
2
p(1- p)
n=z
, where
2
L
Z is standard normal value (2-sided) corresponding to the desired proportion of the time that the
estimate is to be within the desired margin of error.
EP 521 Spring, 2004, Vol I, Part 5
41
Example: Suppose you want to estimate the proportion of people with high cholesterol (> 200 mg ldl)
within 4% percentage points. You guess that the proportion will be around 40%.
Then: n =
(1.96) 2(0.4)(0.6)
= 576
2
(0.04)
With this n, there is a 95% probability (before doing the study) that the estimate obtained will be
within 4% of the population value. This calculation does not address the situation in which you
want to "rule out" a true value above (or below) a particular hypothesized value (see later).
pˆ ± 1.96
ˆ
ˆ
p(1p)
n
"Worst case" for proportions, when you have no idea what p will be; use 0.5
(1.96) 2(0.5)(0.5)
For the example: n =
2
(0.04)
n = 601 (not much bigger)
Suppose you wanted ± 3%?
Then let p = 0.5 be the proportion used for the sample size calculation. n = 1068 (much bigger)
EP 521 Spring, 2004, Vol I, Part 5
42
This is how the pollsters give you their ± numbers and compute n.
i.e.,
for p = 0.5, L = ± 0.04, and95% CI = 0.5 (0.46, 0.54)
Suppose you think p will be around .001 and you want ± 0.0005. This is a small proportion!
(1.96) 2 (.001)(.999)
n=
= 15,352 , a big study.
2
(.0005)
(cancer rates, etc. are this low)
But these calculations on width of CI fail to consider the uncertainty of the observed point estimate, e.g.,
ORˆ , even when the true OR is fixed. They assume you will be satisfied with this CI wherever it is
centered.
The following examples show how that assumption might not hold.
EP 521 Spring, 2004, Vol I, Part 5
43
§3.5 (continued) Sample Sizes for Confidence Intervals
The same CI question interval question might be answered differently.
Question #1
Suppose you want to ensure that your estimate of sensitivity (Sn) will have a confidence
interval (2-sided) of  5 percentage points. Assume you think that you will observe Sn
= 0.9. How many subjects do you need with disease to produce a confidence interval of
(0.85 to 0.95)?
z 2 p (1 − p )
L2
If z=1.96, p=0.9, and L=0.05, then
n=
n= 3.84 * 0.9 * 0.1 /0.0025 = 138
Question #2
Suppose you want to ensure that whatever observed Sn you find after your experiment,
that you can eliminate, by means of a 95% confidence interval around estimate, a true Sn
<0.85. How many subjects do you need with disease to ensure with 80% power that the
lower confidence bound is at least 0.85? This second question is different. It can be
viewed as an hypothesis test.
How can we calculate this CI?
EP 521 Spring, 2004, Vol I, Part 5
44
Here is the STATA code and output for that question:
. sampsi .9 .85, power(0.8) onesample
Estimated sample size for one-sample comparison of proportion
to hypothesized value
Test Ho: p = 0.9000, where p is the proportion in the population
Assumptions:
alpha =
0.0500
power =
0.8000
alternative p = 0.8500
(two-sided)
Estimated required sample size:
n =
316
This number is much larger. For question #1, you assume that you will observe Sn = 0.90. All you
want to know is how wide will the resulting CI be. But for question #2 you are assuming only that the
true Sn =0.9, and that the observed Sn might vary randomly around the true value. So, your observed
Sn might be smaller than 0.9! You must build in extra power so that whatever you observe, your lower
bound of the confidence interval will be at least 0.85.
(Simulations confirm this second result.)
Correspondence between these two different questions:
If in STATA one sets the alternative hypothesis (Ha) at the end of the confidence interval, and one
stipulates the power=0.5, then the sample size is the same as for question #1, i.e., n=138.
EP 521 Spring, 2004, Vol I, Part 5
45
Question #3:
Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can
eliminate a true Sn <0.85 and show a p<0.05. How many subjects do you need with disease to
ensure with 80% power that the lower confidence bound of a one-sided 95% confidence interval is
at least 0.85?
This amounts to a onesided onesample test:
. sampsi .9 .85, power(0.8) onesample onesided
Estimated sample size for one-sample comparison of proportion
to hypothesized value
Test Ho: p = 0.9000, where p is the proportion in the population
Assumptions:
alpha =
power =
alternative p =
0.0500
0.8000
0.8500
(one-sided)
Estimated required sample size:
n =
EP 521 Spring, 2004, Vol I, Part 5
253
46
Question #4: Fourth type of confidence interval problem: predicted CI when planning experiments
(Reference Goodman SN, Berlin JA. The use of predicted confidence intervals when planning
experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-206.)
Problem:
Evaluating a medical treatment: 45% cure rate.
Proposed surgical alternative must have higher cure rate: 70%+
(to offset higher risk of surgical morbidity)
Difference = .70 - .45 = 0.25 (25% pts)
Question: if design a study with 90% power to detect a difference this size (or larger), what is going to
be predicted confidence interval?
EP 521 Spring, 2004, Vol I, Part 5
47
Assume α=0.05, two sided
(1) Step 1: Compute samples n1 and n2 for each group to achieve 90% power to detect a difference of
0.25
n=
n=
( za / 2 + zβ ) 2 ∗ p (1 − p )( r + 1)
(d * ) 2 r
(10.507) ∗ 0.575 ∗ 0.425 ∗ 2
= 82
0.25 ∗ 0.25
(2) Step 2: Compute predicted confidence interval
Predicted 95% CI= observed difference±0.6* ∆0.90
Where ∆0.90 = True difference for which there is 90% power. = 0.25
Predict CI= observed difference ±0.6 * 0.25= ±0.15.
So, the predicted CI for this problem will be 0.30 wide.
Thus, if observed difference is 0.15, lower bound of CI will just = 0.0.
EP 521 Spring, 2004, Vol I, Part 5
48
The same result holds when using the alternative formula: ±0.7 *V0.80
In that case, given the same set of facts. If there is 80% power to demonstrate a risk difference of 0.25,
then one would expect a confidence interval to be wider. It is 0.7*0.25, or ±0.175.
(See Goodman and Berlin 1994 for derivation)
Why is this so: As the power increases (50%, 80%, 90%) the resulting confidence interval will get
narrower, holding constant the observed risk difference. So,
(a) Compute sample size to detect a risk difference a power level
(b) Use the simple formula to predict the confidence interval.
EP 521 Spring, 2004, Vol I, Part 5
49
§3.6 Relative size of (a) standard deviation and (b) desired effect size on power and samples sizes:
Suppose, in any of these situations, you have no idea what σ2 will be?
e.g., comparing 2 means:
n=
(
)
2
2
Z α2 + Zβ σ ( r+1)
(d )
* 2
r
We can always say that we would like to detect a difference of, say, one (or 0.5, or whatever)
standard deviations.
e.g., d* = ± σ
EP 521 Spring, 2004, Vol I, Part 5
50
Thus, for r = 1 (for example) n =
)
(
2
2 Z α + Zβ σ 2
2
σ
2
)
(
2
= 2 Z α + Zβ (small)
2
n=2*7.8 = 15.6
For d* = 0.5 σ and r = 1: n =
(
)
2
2 Z α + Zβ σ 2
2
( 0.5σ )
2
=
(
2 Z α + Zβ
2
0.25σ 2
)
2
(
= 8 Z α + Zβ
2
)
2
n=8* 7.8 = 62.4
(The sample size gets big quickly)
Note that this all depends only on the ratio
σ
2
(d )
* 2
(Same reasoning could be applied to single mean.)
So, sample size depends on the sd relative to the desired difference to be demonstrated
EP 521 Spring, 2004, Vol I, Part 5
51
Formulae differ according to textbook and sample size programs:
The formula above for comparing groups is approximate (but is used in many texts). A "more
exact" form is [(Fleiss, p. 41) for one control per case]:
2
(z
2 pq − z
p q +pq )
1− β 0 0
11
α/2
n=
, (per group)
2
(p − p )
0
1
p=
p1 + p0
, (remember r = 1)
2
Tables are also available for common combinations of p and power.
Fleiss JL. Statistical Methods for Rates and Proportions, 2nd Edition. New York: John Wiley & Sons,
Inc.; 1981: 262.
EP 521 Spring, 2004, Vol I, Part 5
52
Always note, however, when using formulae from texts that each author might define the terms
differently and therefore had slightly different formulae.
For example: Schlesselman Formula: pg. 145
n=
Where:
p1 =
(
Zα 2pq + Zβ p1q1 + p 0q 0
( p1 - p 0 )
)
2
NOTE:This calls Z = +1.96
p = 1 ( p1 + p 0 ) q = 1- p
p0 R
2
and
,
q 0 = 1- p 0
[1+ p 0(R-1)]
q1 = 1- p1
R= the odds ratio
EP 521 Spring, 2004, Vol I, Part 5
53
A formula that is simpler than the ones above (for r = 1, i.e., two equal sized groups) and for practical
purposes equivalent, is given by
n=
2pq( Z α + Zβ )
( p1 - p 0 )
2
2
Corresponding to α = .05 (two-sided) and β = .10, one has Zα = 1.96 and Zβ = 1.28, so that equation
reduces to a particularly simple formula:
n=
21* pq
( p1 - p 0 )
2
.
EP 521 Spring, 2004, Vol I, Part 5
Look at the huge sample sizes
when the baseline risk is low.
54
For r = 1
needed to detect small differences, especially
EP 521 Spring, 2004, Vol I, Part 5
55
From: Case-Control Studies: Design,
Conduct, Analysis by James J.
Schlesselman, New York, 1982, Oxford
University Press, Appendix A.
EP 521 Spring, 2004, Vol I, Part 5
56
Summary of different results
What is important from the Tables, such as the one from Kelsey, is that you can see just how severe are
the penalties when one wants to demonstrate small effects.
Consider the joint effects of (a) increasing power, and (b) decreasing size of the true OR
EP 521 Spring, 2004, Vol I, Part 5
57
§3.7 Sample Sizes for Matched Studies
§3.7.1 Frequency matching – as in stratified design
First we consider frequency matching. The formulae for stratified studies (frequency matched)
and individually matched are in Schlesselman's book. (CC Studies, p 159)
Recall, our estimates of MH OR are based on weighted estimates of stratum-specific OR s. This is
corresponding method of arriving at sample size for stratified design.
This is a way to incorporate strata, or a confounding factor, into the estimation of power or sample
size.
Must specify the following parameters, assuming have J strata.
1. p0j = exposure prevalence in controls in jth stratum
2. fj = fraction of the total observations in stratum j, where
∑f
j
= 1.0
j
3. Type I error
EP 521 Spring, 2004, Vol I, Part 5
58
4. Power
5. Assumed true effect size (RR=OR, in this case)
Assume: Equal number of cases and controls in each stratum
Constant RR (OR) across strata (no effect modification)
Required total number of "cases" =
where (using q = 1 - p)
2
( ln(OR) )
=
gj
 1
1

+
p 0 jq 0 j
 p1 jq1 j
(
) (
)



, p1 j =
n=
(
Z α2 + Zβ
Σf jg j
)
2
p 0 j(OR )
1+ p 0 j ( OR -1) 
The formula is essentially a weighted sum of d* and var(d*) from our general sample size/power
formula
EP 521 Spring, 2004, Vol I, Part 5
59
Example: OC Use and MI (Schlesselman, Table 6.5, p 160)
Hypothesized Effect Size R == 3, α = 0.05, β = 0.10 (power=0.9)
Age
fj
p0j
p 1j
gj
f jg j
25-29
.03
.22
.46
.122
.0037
30-34
.09
.08
.21
.062
.0056
35-39
.16
.07
.18
.055
.0088
40-44
.30
.02
.06
.018
.0054
45-49
.42
.02
.06
.018
.0076
.0311 = Σf jg j
1.00  by definition
f j = .42 is where we have the most cases
(age category 45-49)
p0j = .22 exposure prevalence - where most exposure is.
EP 521 Spring, 2004, Vol I, Part 5
Then:
Required
(1.96 +1.28 )
N=
60
2
0.0311
= 328 cases
Reason for the frequency matching: efficiency.
In the context of the case of myocardial infarction and oral contraceptive use.
Most cases in what age group?
Most exposure in what age group?
EP 521 Spring, 2004, Vol I, Part 5
61
§3.7.2 Pair Matched Studies (Schless. '6.6, pp 160 ff.)
There are special methods of computing power for matched studies. We consider first the simplest
situation: 1 to 1 matching. But “matched” studies can also have multiple controls per case.
The number of discordant pairs (= m) required to detect a relative risk (RR) is given by:
 Z α + Zβ P(1- P) 

m=  2
2
 1
 P- 
 2
where P =
2
OR
RR
≈
. So, here we are assuming that OR  RR.
(1+ OR ) (1+ RR )
We are going to work with the OR because it is the ratio of the frequencies of discordant pairs.
(We then make the assumption that OR is a good estimate for RR).
Here
P=
u10
u10 + u01
in the paired data table. (See the notation in Vol I, Part 4)
Must distinguish from
p0
and
p1 = risk of exposure among controls and the cases
EP 521 Spring, 2004, Vol I, Part 5
62
Derivation of sample size formula for McNemar’s test:
Recall that McNemar’s test is equivalent to a test of a binomial proportion, where the proportion is the
fraction of discordant pairs that are, for example, in the u10 cell in the 2 by 2 table of paired data. This
was shown in Vol I, Part 4.
We can use this relationship and a version of the sample size formula we have seen before to show the
correspondence between previous formulae and the ones specifically suited for matched pair case
control studies.
Details appear in Schlesselman (pp 145, 161)
(These calculations can be done by computer: PS, Power and Precision, PASS, for example).
EP 521 Spring, 2004, Vol I, Part 5
63
U10
1
ˆ
Ho: p= ;(OR=1), where p=
,
2
U10 + U 01
(here m=U10 + U01 ).
The standard sample size formula
for one-sample binomial test:
2

p (1 − p ) 
p0 (1 − p0 )  zα / 2 + zβ

p0 (1 − p0 ) 

n=
,
( p − p0 ) 2
2


p (1 − p ) 
1 1
2
⋅  zα / 2 + zβ
1 1 
 zα / 2

2 2

(1
)
+
z
p
−
p
⋅
β
 2

2 2 

n=
=
.
1 2
1 2
(p− )
(p − )
2
2
Letting m = n, we have derived the formula for number of discordant pairs.
Note: The denominator corresponds to d* from before, because we have expressed OR in terms of p,
and we are essentially doing the calculation for the difference between the desired OR and OR=1 (null).
EP 521 Spring, 2004, Vol I, Part 5
64
Estimating the number of discordant pairs.
We do this from our estimates of the risk of exposure in the control group.
Let pe = the probability of an exposure-discordant pair and M = the total number of pairs needed to yield
m
m discordant pairs. M = .
pe
This probability will depend in part on the baseline risk of exposure among the controls, on the odds
ratio that we are trying to demonstrate, and on the skill (or lack thereof) in selecting matching criteria.
First, consider the baseline case of estimating what fraction of the matched pairs will be informative, i.e,
what fraction will be discordant pairs.
Although pe depends on matching criteria, using the notation from McNemar=s test, the matched pairs
can be displayed in following table:
Control
E
E
CASE E
u11
u10
E
u01
u00
EP 521 Spring, 2004, Vol I, Part 5
65
Pe=Pr(exposure discordant pairs)
By definition: Pe=Pr(U10) + Pr(U01)
=Pr(E|case)Pr(NoE|ctrl)+ Pr(NoE|case)Pr(E|ctrl)
 (1-p0)
= p1
+ (1-p1)
p0
Note:this is an approximation, because Pe depends on the matching criteria (which include factors other
than E).
We can compute p1 , the proportion of exposed cases, from the OR and the value of p0 , the proportion
of exposed controls, using the formula for OR.
p1 =
M;
p 0OR
. Then q0 = 1 - p0, and q1 = 1 - p1, and
1+ p 0(OR -1)
m
m
=
= sample size needed.
pe ( p 0 q1 + p1q 0)
But there might be other reasons for assuming that the true percentage of usable discordant pairs is
actually smaller than what we might expect.
EP 521 Spring, 2004, Vol I, Part 5
66
Example: Pair (1 to 1) Matched study of OC use and congenital heart disease.
For α = 0.05, β = 0.1
We think: p0 = 0.03, i.e., 3% risk of exposure among population of controls (so, rare exposure)
We want to detect OR = 2
We know from the relationship among OR and
p1 =
p1 =Pr(E|case)
(.03)(2)
= .058 , because OR – 1= 2 -1 = 1
1+ .03(1)
and P =
OR
2
1
= . (1- P ) = . This is from the formula derived from McNemar’s test.
1+ OR 3
3
2
1.96
2 1
+1.28 g 

2
3 3
m= 
= 90 discordant pairs.
2
2 1
 − 
 3 2
EP 521 Spring, 2004, Vol I, Part 5
67
Then, to estimate the total number of pairs:
pe  prob. (discordant pair)
 (p0q1 + p1q0)
= [(.03) (.942) + (.058) (0.97)]
= .028 + .056 = .084
Then: M =
m
90
=
= 1071 matched pairs
p e .084
What happens with other combinations of parameters?
EP 521 Spring, 2004, Vol I, Part 5
68
alphaZa/2 power ZB OR p
0.05
0.05
0.05
0.05
1.96
1.96
1.96
1.96
0.9
0.9
0.9
0.9
1.28
1.28
1.28
1.28
2
2
2
2
0.67
0.67
0.67
0.67
0.05 1.96 0.9 1.28 2.5 0.71
0.05 1.96 0.9 1.28 3 0.75
Po=Pr(E|ctrl) r m
0.03
0.1
0.2
0.5
1
1
1
1
P1=
qo=1-Po q1=1-P1 pe M DuPont
"=Pr(E|case)
90.34
0.06
0.97
0.94 0.086 1046
1066
90.34
0.2
0.9
0.8 0.26 347
368
90.34
0.4
0.8
0.6 0.44 205
266
90.34
1
0.5
0 0.5 181
181
0.03 1 52.93
0.03 1 37.7
0.075
0.09
0.97
0.97
0.925 0.101 527
0.91 0.115 329
543
343
So, can see that M depends heavily on the probability of exposure among the controls, as well as on
the OR that one assumes is present in truth.
The column labeled M are results from this program. DuPont numbers in right column are from
program “PS” written by DuPont and Plummer.
EP 521 Spring, 2004, Vol I, Part 5
69
We are making assumptions about p 0, p1, OR and matching factors.
Pr (exposed) for members of each pair are independent and have constant probability homogeneity of Pr(E) for each pair
If are matching is less than optimal, and we have overmatched to some extent, then the
pr(exposure) for the case and control in each pair will tend to be similar, resulting in a larger
number of “noninformative” pairs.
Program by DuPont and Plummer allow user to adjust for this correlation of exposure.
EP 521 Spring, 2004, Vol I, Part 5
70
We can reverse this process and estimate power for given number of discordant pairs. (Ref= Schless
p 162)
 −z
1 
zβ =  α / 2 + m( P − ) 2  / P (1 − P ),
2 
 2
where power = Pr(Z ≤ zβ ),
and m is the number of discordant pairs (as before).
So, zβ = 1.28 is equivalent to power=0.9
Notes:
1. Can estimate m from M by M =
m
pe
2. Better estimate p e from preliminary data or revised after initial data collection
3. We have looked at case control studies (because that is where matching is more common). But
this framework can apply to cohort studies.
EP 521 Spring, 2004, Vol I, Part 5
71
§3.7.3 Matched studies with more than one control per case (or in the instance of cohort studies,
more than one unexposed per exposed).
The same principles apply to these more complex designs.
In these instances, there are several sets of paired tables per matched set, each table representing the
cross classification of pairing for the case with each of the controls. (So, if there are 3 controls per
case, one can think of a set of 3 tables of paired comparisons).
(1) A simple adjustment:
Let c = the number of controls per case, and let n be the number of cases assuming 1 to 1 matching.
Then with c to 1 matching, one needs n1 cases, where: n1 = (c + 1)n / 2c. Thus, if one needed
1050 cases (and 1050 controls, and then one selected 2 controls per case, the new number of cases
= (2+1)1050/2*2 = 3*1050/4= 788, and the number of controls = 1576. This approximation is
good in many cases, but falls apart when the probability of exposure of a sampled control is low.
EP 521 Spring, 2004, Vol I, Part 5
72
(2) More complex methods:
Better approximations are available. The programs in DuPont and Plummer (PS) use an
estimate of the correlation of the exposure status between a case and its matched controls.
The formula we have seen (Schlesselman) assumes no correlation. DuPont and Plummer
generalized this formula (for multiple controls per case AND for the possibility of some
correlation. )
[Aside: You can think of correlation in terms of two columns of data:
Case
1
1
0
0
1
Control
1
0
1
0
0
Where a 1 indicates exposed and a 0 indicates unexposed and each row is a matched pair (or
one of a set of matched pairs). Then the correlation is simple to obtain using standard formula
]
A good start is corr = 0.2. As the correlation increases, then sample size (number of cases)
increases.
EP 521 Spring, 2004, Vol I, Part 5
73
Effect of additional matches and correlation on sample sizes:
What happens when add controls per case in a matched study:
The number of cases needed drops, but the total number of patients increases.
Controls No. Case Total
per case patients patients
1
1066
2132
2
782
2346
3
688
2752
4
641
3205
Assuming same OR, power, alpha,
p0, p1 as in our example.
Calculations from program PS
Effect of correlation on sample sizes (using same example):
Corr
0
0.1
0.2
0.3
Case
patients
1066
1230
1437
1705
Correlation might occur when matching is less than optimal.
Available software: PS, PASS.
Reference: DuPont WD. Power calculations for matched case-control
studies. Biometrics. 1988; 44:1157-68.
EP 521 Spring, 2004, Vol I, Part 5
74
§3.8 Miscellaneous Comments on Sample Size calculations –
1.
More complex problems – (interactions) usually simplify the problem into a 2 by 2 table or a
subgroup comparison.
Just think about a 2 by 2 table for one of the subgroups of interest and power the study to detect a
clinically meaningful effect for that subgroup alone.
But there is a program from NCI (Power.exe) that is specifically designed for computing power
to detect interaction.
[Ms. Holly Brown (Brownh@exchange.nih.gov).]
Ref:
Lubin JH, Gail MH. On power and sample size for studying features of
the relative odds of disease. Am J Epidemiol 1990;131:552-566.
Garcia-Closas M, Lubin JH. Power and sample size calculations in
case-control studies of gene-environmental interactions: Comments on
different approaches. Am J Epidemiol 1999;149:689-693.
EP 521 Spring, 2004, Vol I, Part 5
2.
Adjustment for sample size from programs
Measurement error
Loss to followup
Lack of independence of observations (clustering)
Repeated measures
Covariates
Comparisons of subgroups
End of Vol 1 Part 5
75