Power analysis and sample size calculation Eirik Skogvoll, MD, PhD Professor, Faculty of Medicine Consultant, Dept. of Anaesthesiology and Emergency medicine 1 The hypothesis test framework Rosner tbl. 7.1 P (type II error) Truth Conclusion H0 correct 1– α α H0 accepted H0 rejected P (type I error) Power Type I error: rejecting H0 if H0 is true Type II error: accepting H0 if H0 is wrong 3 H0 wrong β 1–β Hypothesis A statement about one or more population parameters, e.g. μ ≤3 μ1 = μ2 σ2 = 14 π = 0,4 π1 =π2 ψ =1 β1 = 0 ρ >0 (Expected value, or “true mean”) (Two expected values) (variance) (probability of a “success”, in a binomial trial) (two probabilities) (odds ratio, from a contingency table) (regression coefficient) (correlation coefficient) 4 The sample size is a compromise… • Significance level • Required or optimal power • Available resources – Money – Time – Practical issues of recruiting patients or subjects • The problem under investigation 5 Power • The probability that some test will detect a difference, if it exists • With low power, even large differences may go unnoticed: Absence of evidence is not evidence of absence! • Power is usually set to 0.8 – 0.9 (80-90 %) • An infinitely small difference is impossible to detect • The researcher must specify the smallest difference of (clinical) interest • Power calculations requires knowledge of the background variability (usually the standard deviation) 6 Power or significance? • We wish to reduce the probability of both a type I error (α) and a type II error (β) , because… – A small α means that it is diffcult to reject H0 – A small β means that it is easier to reject H0 (and accept H1) – But minimizing α and β at the same time is complicated, because α increases as β decreases, and vice versa • Strategy: – Keep α at a comfortable level (0.10, 0.05, 0.01); the maximum ”acceptable” probability of making a type I error; i.e. rejecting H0 when it is true – Then choose a test that minimizes β, i.e. maximizes power ( =1 - β). Beware of H1: tests may have different properties! 7 Power (Rosner 7.5) One- sided alternative hypothesis H0: μ = μ0 vs. H1: μ = μ1 < μ0 H0 is rejected if zobs < zα and accepted if zobs ≥ zα . The decision dose not depend on μ1 as long μ1 < μ0. (Rosner Fig. 7.5) Power, Pr (reject H0|H0 wrong) = 1 – Pr(type II error) = 1 – β, however, depends on μ1. 8 Rosner Fig. 7.5 H1: μ = μ1 < μ0 9 vs. H 0: μ = μ 0 ⎛ σ2 ⎞ Under H1 , X ~ N ⎜⎜ μ1 , ⎟⎟ n ⎠ ⎝ We reject H0 if the observed statistic z is less than critical value zα , i.e. if zobs < zα . Using the corresponding value for X: ⎛ ⎞ ⎜ X −μ ⎟ σ ⎞ ⎛ 0 < zα ⎟ = Pr ⎜ X < μ0 + zα ⋅ Pr ( Z < zα ) = Pr ⎜ ⎟ σ n ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ n ⎝ ⎠ (zα < 0) 10 ... remember that Z = X − μ1 σ ⇔ X =Z⋅ σ n + μ1 n σ ⎞ σ ⎞ ⎛ ⎛ σ + μ1 < μ 0 + zα thus, Power = P⎜ X < μ 0 + zα ⎟ = P⎜ Z ⎟ n⎠ n n⎠ ⎝ ⎝ 11 σ ⎞ σ ⎞ ⎛ ⎛ zα zα ⎟ ⎜ ⎟ ⎜ μ μ μ n ⎟ = P⎜ Z < 0 − μ1 + n⎟ = P⎜ Z + 1 < 0 + σ σ σ ⎟ σ σ σ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ n ⎠ n n n ⎠ n n ⎝ ⎝ ⎞ ⎛ ⎞ ⎛ ⎟ ⎜ ⎟ ⎜ − ( μ 0 − μ1 ) μ μ μ μ − ⎛ ⎞ 0 1 0 1⎟ ⎜ ⎟ ⎜ = P⎜ Z < zα + ⋅ n⎟ =P Z< + zα = P Z < zα + σ σ ⎟ ⎜ ⎟ ⎜ σ ⎝ ⎠ ⎟ ⎜ ⎟ ⎜ n n ⎠ ⎝ ⎠ ⎝ ( μ 0 − μ1 ) ⎛ ⎞ = Φ⎜ zα + ⋅ n ⎟ ... because P(Z < z ) = Φ ( z ) ... table 3a σ ⎝ ⎠ 12 One-sided alternatives, in general: H0: μ = μ0 vs. As large as possible … H1: μ = μ1 | μ0 − μ1 | | μ0 − μ1 | ⎛ ⎞ ⎛ ⎞ Power = Φ ⎜ zα + ⋅ n ⎟ = Φ ⎜ − z1−α + ⋅ n⎟ σ σ ⎝ ⎠ ⎝ ⎠ (Rosner, Eq. 7.19) (remember that zα = – z1-α) 13 Factors influencing power: • • • • Power is reduced if α is reduced (zα becomes more negative) Power increases with increasing |μ0 – μ1| Power is reduced with increasing σ Power is increased with increasing n 14 Example (Rosner expl. 7.26) Birth weight. H0: μ0 = 120 oz, vs. H1: μ1 = 115 oz, α = 0.05, σ = 24, n = 100. Power = Pr(rejecting H0|H0 wrong) = 120 − 115 5 ⋅10 ⎞ ⎛ = Φ (0.438) Φ⎜ z0.05 + ⋅ 100 ⎟ = Φ (−1.645 + 24 24 ⎠ ⎝ = Pr( Z < 0.438) = 0.66 → Pr(type II error) = Pr(accept H0|H0 wrong) = β= 1– 0.669 = 0,331 15 Example (Rosner expl. 7.29) Birth weight. H0: μ0 = 120 oz, vs. H1: μ1 = 110 oz, α = 0.05, σ = 24, n = 100. 120 − 110 10 ⋅10 ⎞ ⎞ ⎛ ⎛ Φ⎜ z0.05 + ⋅ 100 ⎟ = Φ⎜ − 1.645 + ⎟ = Φ (2.52) 24 ⎠ 24 ⎝ ⎠ ⎝ = Pr( Z < 2.52) = 0.99 16 Power for a one sample test, two sided alternative (in general): H0: μ = μ0 vs. H1: μ = μ1 ≠ μ0 ⎛ ⎞ | μ0 − μ1 | Power ≈ Φ ⎜ − z α + ⋅ n ⎟ (Rosner Eq. 7.21) σ ⎝ 1− 2 ⎠ 17 From Power to Sample size… From Rosner Eq. 7.21 : ⎛ | μ 0 − μ1 | ⋅ n ⎞ ⎟ = 1 − β ... now we set (" fix" ) 1 − β Power = Φ⎜⎜ − z1−α / 2 + ⎟ σ ⎝ ⎠ ⎛ ⎛ | μ 0 − μ1 | ⋅ n ⎞ ⎞⎟ −1 ⎜ ⎟ = Φ −1 (1 − β ) Φ Φ⎜⎜ − z1−α / 2 + ⎟⎟ ⎜ σ ⎠⎠ ⎝ ⎝ − z1−α / 2 + | μ 0 − μ1 | ⋅ n σ = z1− β ⇔ ⇔ μ 0 − μ1 ⋅ n = σ ⋅ (z1− β | μ 0 − μ1 | ⋅ n σ = z1− β + z1−α / 2 σ ⋅ (z1− β + z1−α / 2 ) + z1−α / 2 ) ⇔ n = μ 0 − μ1 18 Sample size for one sample test, two sided alternative: H0: μ = μ0 vs. H1: μ = μ1 ≠ μ0 σ 2 (z1− β + z1−α / 2 )2 n= (Rosner Eq. 7.28) 2 (μ0 − μ1 ) 19 Two sample test: sample size (Rosner 8.10) Given X1 ∼ N(μ1, σ21) and X2 ∼ N(μ2, σ22) H0: μ1 = μ2 vs. H1: μ1 ≠ μ2 (two sided test), significance level α, and Power 1 – β. (σ 12 + σ 22 )( z n= 1− α + z1− β ) 2 2 ( μ2 − μ1 ) 2 …. in each group (Rosner Eq. 8.26) 20 Rules of thumb Lehr’s ”quick” equations for two-sample comparison: • 80 % power, α = 0.05 (two sided), Δ = standardized difference, m = no. in each group: m = 16/Δ2 • 90 % power, α = 0.05 (two sided), Δ = standardized difference, m = no. in each group m = 21/Δ2 21 The standardized difference Machin et al. ”Sample size tables for Clinical studies” 3.ed, 2009 Altman 1991 (figure 15.2) Need to specify the ”standardized difference”: Δ= | μ 2 − μ1 | σ … the least, clinically important difference expressed as standard deviations. May vary 0,1 – 1,0, where Δ= 0,1 – 0,2 constitutes a ”small” difference, while Δ = 0,8 is ”large”. 22 What is the background variability (”noise”)? • • • From other studies Pilot data From the range of possible values (”quick and dirty”): Range SD ≈ 4 • This results stems from the basic properties of the Normal (Gaussian) distribution, in which approx. 95 % of the population are found within ± 2SD of the mean 23 Example • 80 % power, α = 0.05 (two sided), Δ = standardised difference = 0.5 m = 16/Δ2 = 16/0.25 = 4*16 = 64 Total sample size = 64*2 = 128 24 Nomogram (Altman 1991) (total sample size) Standardised difference 25 Example 80 % power, α = 0.05 (two sided), Δ = 0.5 N (total) = 125 (total sample size) Standardised difference 26 Factors influencing power (1-ß) • • • • Power decreases if α is reduced (zα becomes more negative) Power increases with increasing |μ0 – μ1| Power decreases with increasing σ Power increases with increasing sample size ”n” 27 Factors leading to increased sample size (n) • • • • Large standard deviation Strict requirement for α (low level, 0.01- 0.02) Strict requirement for power (1 – β) (high level: 0.90-0.95) Low absolute difference between parameters under H1 and H0 (|μ1 – μ0|) 28 Confidence intervals • Most journals want CI’s reported along with (or instead of) p-values • The expected length of a confidence interval is determined by the sample size: l = 2·tn-1,1-α/2·s/√n …where t= relevant quantile of the t distribution with n-1 degrees of freedom s = sample standard deviation n = no. of observations 29 How to make efficient use of your subjects • Re-use subjects if possible, using a repeated measurements design. Intra-subject variance will be reduced, as each subject acts as his own control. • Enter more than one experimental factor, using a factorial design. Each subject counts twice (or thrice…) ! • Refine your measurement methods to reduce residual error • Never, never, never…! dichotomize a continuous outcome variable to a binary one (i.e. 1 or 0) ! This has the largest variance. 30 Software for sample size calculations • Machin et al. ”Sample Size Software for Clinical studies” 3.ed, 2009 -- but you need to set dot (.) as decimal separator / • Sample Power • R • StatExact (for special procedures) 31 References • • • • • • • • Eide, DM. Design av dyreforsøk (kap. 28) i A. Smith: Kompendium i forsøksdyrlære. http://oslovet.veths.no/dokument.aspx?dokument=262 (2007) Mead R. The Design of Experiments. Statistical Principles for Practical Applications. Cambridge: Cambridge Univ. Press; 1990 Altman DG. Practical statistics for medical research. London: Chapman & Hall; 1991. Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ. 1995 Aug 19;311(7003):485. Rosner B. Fundamentals of Biostatistics. Belmont, CA: Duxbury Thomson Brooks/Cole; 2005. Schulz KF, Grimes DA. Sample size calculations in randomised trials: mandatory and mystical. Lancet. 2005 Apr 9-15;365(9467):1348-53. Bland JM. The tyranny of power: is there a better way to calculate sample size? BMJ. 2009 October 6, 2009;339(oct06_3):b3985-. Machin D, Campbell M, Tan SB, Tan SH. Sample Size Tables for Clinical Studies. 3 ed. Oxford: Wiley-Blackwell; 2009. 33
© Copyright 2025