J Vet Diagn Invest 21:3–14 (2009) Practical sample size calculations for surveillance and diagnostic investigations Geoffrey T. Fosgate1 Abstract. The likelihood that a study will yield statistically significant results depends on the chosen sample size. Surveillance and diagnostic situations that require sample size calculations include certification of disease freedom, estimation of diagnostic accuracy, comparison of diagnostic accuracy, and determining equivalency of test accuracy. Reasons for inadequately sized studies that do not achieve statistical significance include failure to perform sample size calculations, selecting sample size based on convenience, insufficient funding for the study, and inefficient utilization of available funding. Sample sizes are directly dependent on the assumptions used for their calculation. Investigators must first specify the likely values of the parameters that they wish to estimate as their best guess prior to study initiation. They further need to define the desired precision of the estimate and allowable error levels. Type I (alpha) and type II (beta) errors are the errors associated with rejection of the null hypothesis when it is true and the nonrejection of the null hypothesis when it is false (a specific alternative hypothesis is true), respectively. Calculated sample sizes should be increased by the number of animals that are expected to be lost over the course of the study. Free software routines are available to calculate the necessary sample sizes for many surveillance and diagnostic situations. The objectives of the present article are to briefly discuss the statistical theory behind sample size calculations and provide practical tools and instruction for their calculation. Key words: Diagnostic testing; epidemiology; sample size; study design; surveillance. estimation of diagnostic accuracy, comparison of diagnostic accuracy among competing assays, and equivalency testing of assays. The appropriate sample size depends on the study purpose, and no calculations can be made until study objectives have been defined clearly. Sample size calculations are important because they require investigators to clearly define the expected outcome of investigations, encourage development of recruitment goals and a budget, and discourage the implementation of small, inconclusive studies. Common sample size mistakes include not performing any calculations, making unrealistic assumptions, failing to account for potential losses during the study, and failing to investigate sample sizes over a range of assumptions. Reasons for inadequately sized studies that do not achieve statistical significance include failing to perform sample size calculations, selecting sample size based on convenience, failing to secure sufficient funding for the project, and not using available funding efficiently. There is no single correct sample size ‘‘answer’’ for any given epidemiologic study objective or biologic question. Calculated sizes depend on assumptions made during their calculation, and such assumptions cannot be known with certainty. If assumptions were known to be true with certainty, then the study that is being designed will likely not add to the scientific understanding of the problem. There are concepts Introduction Calculation of sample size is important for the design of epidemiologic studies,18,62 and specifically for surveillance9 and diagnostic test evaluations.6,22,32 The probability that a completed study will yield statistically significant results depends on the choice of sample size assumptions and the statistical model used to make calculations. The statistical methodology of employed sample size calculations should parallel the proposed data analysis to the extent possible.18 The most frequently chosen sample size routines are based on frequentist statistics, and these have been reviewed previously for other fields.1,11,20,33,35,36,50,54,61,62 Issues specifically related to diagnostic test validation also have been discussed.2,28,42,48 Sample size routines related to issues of surveillance, as well as diagnostic test validation using Bayesian methodology, also have been developed.7,56 Surveillance and diagnostic situations that require sample size calculations include the detection of disease in a population to certify disease freedom, From the Department of Veterinary Integrative Biosciences, College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, TX. 1 Corresponding Author: Geoffrey T. Fosgate, Department of Veterinary Integrative Biosciences, College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, TX 77843. gfosgate@cvm.tamu.edu 3 4 Fosgate Table 1. Definition of type I and type II errors.* True state of nature HO is true Statistical Reject HO result Fail to reject HO HA is true (HO false) Type I error Agreement with (alpha) truth Agreement with Type II error truth (beta) * HO 5 null hypothesis; HA 5 alternative hypothesis. that are important to consider when performing sample size calculations, despite the inability to classify certain results as correct or incorrect. A few simple formulas are generally sufficient for most sample size situations that would be encountered in the design of studies to determine disease freedom and evaluate diagnostic tests. The objectives of the present article are to briefly discuss the statistical theory behind sample size calculations and provide practical tools and instruction for their calculation. This review will only discuss issues related to frequentist approaches to sample size calculation and will emphasize conservative methods that result in larger sample sizes. Epidemiologic errors The current presentation of statistical results in the medical literature tends to be a blending of significance testing attributed to the work of Fisher,21 subsequently discussed by others,29,55 and hypothesis testing as attributed to Neyman and Pearson.46,47 The P value in the Fisher significance testing approach is considered a quantitative value documenting the level of evidence for or against the null hypothesis. The P value is formally defined as the probability of observing the current data or more extreme when the null hypothesis is true. The hypothesis testing approach as introduced by Neyman and Pearson was based on rejection or acceptance of null hypotheses using specified P value cutoffs. The hypothesis testing interpretation of statistical results allows for the definition of type I and type II errors as the errors associated with rejection of the null hypothesis when it is indeed true and the acceptance of the null hypothesis when it is false (and a particular alternative hypothesis is true), respectively.47 The probabilities of making these errors are frequently referred to as alpha (a) and beta (b) for type I and type II errors, respectively.36,54 Current sample size procedures are derived from the hypothesis testing approach as put forth by Neyman and Pearson; however, current convention is to use the terminology of ‘‘failure to reject’’ rather than acceptance of a null hypothesis. Figure 1. Sampling distributions presented under the null (HO, black line) and alternative (HA, gray line) hypotheses. Black shaded area corresponds to alpha (type I error), and gray shaded area corresponds to beta (type II error). Sample size calculations solve for the number so that the critical value (cv) corresponds to the location, where Pr Z # z 5 1 – a/2 and Pr Z # z 5 b (or Pr Z # z 5 1 – a for a 1-sided test). A requirement for sample size calculation is the specification of alpha and beta when considering the testing of a statistical hypothesis (Table 1). Precisionbased sample size methods must specify alpha, but beta is not included in the equations and, based on the typical large-sample approximation methods, is consequently assumed to be 50% for the alternative hypothesis that the true value falls outside the limits of the calculated confidence interval.17 The P value obtained after statistical analysis will equal the prespecified alpha if the assumptions of the sample size calculations are observed exactly in the collected data due to their similar probabilistic definition. However, the meaning of beta is often misunderstood as simply ‘‘the probability of accepting the null hypothesis when a true difference exists’’ based on presentations in tables and figures.11,33,36,54 The issue is that there are an infinite number of specific alternative hypotheses that could be true if the null hypothesis is false, and many will be less probable than the null itself. Beta can be calculated only after an explicit alternative hypothesis has been specified. The hypothesis that is chosen during sample size calculation is the expected difference between the population values. Alpha and beta correspond to areas under sampling distributions for population means (including proportions) under the null and alternative hypotheses, respectively (Fig. 1). The statistical power of a test is defined as 1 2 b or the probability of rejecting the null hypothesis when the alternative hypothesis is true. Sample size for diagnostic investigations 5 Sample size adjustment factors Sample size calculations are often based on largesample approximation methods.24,30,51 The quality of the approximate results depends on the specific sample size situation, and adjustment factors have been developed to improve their approximation to exact distributions. Some of the typical adjustment factors include the finite population, continuity correction, and variance inflation factors. The finite population correction factor4,19 is typically considered when the study objective is to estimate a population proportion. Typically, sampling without replacement is performed, and if the sample size is relatively large compared with the total population, then this correction factor should be considered. A typical recommendation is to employ this factor when the sample includes 10% or more of the population.19 The need for this correction factor is derived from the fact that sampling is hypergeometric (sampling without replacement, as from a deck of cards), and sample size formulas are based on binomial (sampling with replacement) theory. The formula19 for the correction is N , n~n| Nzn where n is the corrected sample size, and n and N are the uncorrected and population sizes, respectively. Applying this correction factor causes the sample size to be smaller than the uncorrected. Confidence interval algorithms have been developed based on hypergeometric sampling,52 but the author is not aware of their availability in statistical packages. Usual confidence interval calculation methods are based on either normal approximation or exact binomial methodology. Application of the finite population correction factor is only recommended by the author when the analysis also incorporates adjustment for hypergeometric sampling. The continuity correction factor51 is employed when the study objective is to compare 2 population proportions (including diagnostic sensitivity or specificity). The difference in proportions is approximated by a normal distribution in typical sample size formulas, even though binomial distributions are discrete and normal distributions are continuous. The normal approximation might not always be adequate, and continuity correction should be applied to better approximate the exact distribution (Fig. 2). The formula25 for continuity correction is n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 n~ 1z 1z4=ðnjP1 {P2 jÞ , 4 Figure 2. Cumulative probability function for a binomial distribution (n 5 12, p 5 0.5) (gray shading) overlaid with the corresponding cumulative normal distribution (m 5 6, s 5 1.732) denoting the uncorrected (A; probability 5 0.500 ) and continuitycorrected (B; probability 5 0.614) probability for observing 6 successes. Continuity correction improves the approximation of the true binomial cumulative probability for observing 6 successes, which is 0.613. where n and n are the corrected and uncorrected sample sizes, respectively, and P1 and P2 are the hypothesized proportions. Applying the continuity correction increases the sample size over the uncorrected and should typically be applied. Frequently employed methods for the comparison of proportions use continuity correction when calculating chi-square test statistics.8,58 Sample size calculations for estimating proportions typically involve making the assumption of independence among sampling units. Lack of independence that is introduced when a clustered sampling design is employed can be adjusted by inflating the variance estimate. The design effect (DE),10,38 or variance inflation factor, is defined as the variance of the sampling design compared with simple random sampling. The formula10,45,59 for its calculation is DE~1zrðm{1Þ, where r is the intraclass correlation, and m is the sample size within each cluster. When clustered sampling is employed, then the sample size estimated 6 Fosgate by the usual methods assuming independence is multiplied by the DE to account for the expected dependence. The intraclass correlation is a relative measure of the homogeneity of sampling units within each cluster compared with a randomly selected sampling unit. This correlation is formally defined as the proportion of total variation within sampling units that can be accounted by variation among clusters.38,45 A high correlation indicates more dependence within the data, resulting in a larger DE. The intraclass correlation is generally estimated from pilot data or based on estimates available from the literature. If the number of clusters is fixed by design and the cluster sample size is unknown, then it is not possible to simply use the previously mentioned formula for the DE. The sample size per cluster (m) must first be estimated, and it is based on the effective sample size (ESS), which is the sample size estimated assuming independence. It is also necessary to know the number of clusters (k) and the intraclass correlation (r). The formula27 for calculation of the cluster sample size is m~ ESS{r(ESS) : k{r(ESS) Sample size situations Surveillance or detection of disease The detection of disease in a population is important for herd certification programs and for documenting freedom from disease after an outbreak. It has implications in regional and international trade of animals and animal products. The first step is to determine the prevalence of disease that is important to detect. A prevalence of disease at this level or greater is considered biologically important. Documenting a zero prevalence of disease is not typically possible because it would require testing the entire population with a perfect assay. The next step is to define the level of confidence for which it is desired to find the disease should it be present in the population at the hypothesized prevalence or higher. Again, 100% confidence is not feasible because it would require sampling all animals and testing with a perfect assay. Alpha is calculated as 1 2 confidence. The final step is to determine the statistical model to use for calculations. In small populations, sample size calculations should be based on a hypergeometric distribution (sampling without replacement). In larger populations, it is often assumed that the true hypergeometric distribution can be well approxi- mated by the binomial (sampling with replacement). The sample size formula assuming a binomial model is based on the following relationship: (1 2 p)n 5 (1 2 confidence). The formula19 after solving for the sample size is n~ log a=½logð1{pÞ, where a is 1 2 confidence, and p is the prevalence worth detecting. The corresponding formula19 based on hypergeometric sampling is D{1 n~ 1{(a)1=D N{ , 2 where a is 1 2 confidence, N is the population size, and D is the expected number of diseased animals in the population. The necessary sample size for various combinations of prevalence and confidence can be tabulated (Table 2), and software is available that will perform the necessary calculations. Survey Toolboxa can perform these calculations and is available free for download. The software performs calculations based on both binomial and hypergeometric sampling and can also adjust for imperfect sensitivity and specificity of employed tests. An example of this type of sample size problem is illustrated by the regulatory agency in Texas when it decided to perform active surveillance for bovine tuberculosis (Table 3). There are approximately 7,650 registered pure-bred beef seed stock producers in Texas, and it was decided that a herd-level prevalence of 0.001 (1 in 1,000 herds infected) or greater was important to detect with 95% confidence. Survey Toolbox can be used to solve this sample size problem. From the menu, choose Freedom From Disease R Sample Size. Click on the Sample Size tab and input 100 for the sensitivity and specificity of the test. Change the population size to 7,650 and set the minimum expected prevalence to %0.1. Click on the Options tab and be sure that the type I error is set at 0.05. Click to have the program calculate the sample size based on the simple binomial model. No other changes are necessary. Go back to the Sample Size tab and click on the Calculate button. The sample size based on the hypergeometric model can be calculated by changing to the Modified Hypergeometric Exact on the Options tab before clicking on the Calculate button. A binomial model suggested that the necessary sample size would be 2,994 of the 7,650 beef operations (39%). The interpretation is that assuming that the true prevalence is at least 0.001, then a sample consisting only of noninfected herds would Sample size for diagnostic investigations 7 Table 2. Number needed for study to be confident that the disease will be detected if present at or above a specified prevalence based on hypergeometric sampling and assuming a perfect test. Expected prevalence of disease in population (95%/99% confidence) Population size 0.10 0.05 0.02 0.01 0.005 20 50 100 150 200 250 300 350 400 450 500 600 700 800 1,000 1,200 1,400 1,600 1,800 2,000 3,000 4,000 5,000 6,000 10,000 100,000 .100,000* 15/18 22/29 25/35 26/38 27/39 27/40 27/41 27/41 27/41 28/42 28/42 28/42 28/42 28/43 28/43 28/43 28/43 28/43 28/43 28/43 28/43 28/44 28/44 28/44 28/44 28/44 28/44 19/20 34/41 44/59 48/67 51/72 52/75 53/77 54/79 54/80 55/81 55/82 56/83 56/84 56/85 57/86 57/86 57/87 57/87 57/88 58/88 58/88 58/89 58/89 58/89 58/89 58/90 58/90 20/20 48/50 77/90 94/117 105/136 112/149 117/159 121/167 124/174 126/179 128/183 131/189 134/194 135/198 138/204 139/208 141/210 142/212 142/214 143/215 145/219 146/222 146/223 146/224 147/225 148/228 148/228 20/20 50/50 95/99 129/143 155/180 174/210 189/235 201/255 210/272 218/287 224/300 235/320 243/336 249/349 258/367 264/381 268/391 272/398 275/404 278/409 284/425 287/433 289/438 291/441 294/443 298/457 298/458 20/20 50/50 100/100 150/150 190/198 238/244 233/286 272/324 310/360 349/391 388/420 378/470 369/511 421/546 450/601 471/642 486/673 500/699 509/719 517/736 542/791 555/821 563/839 569/852 580/888 596/915 598/919 * Based on binomial model. occur 5% of the time or less when the sample size is 2,994 (assuming a perfect test at the herd level). The hypergeometric model might be more appropriate, because sampling would be from a finite population without replacement; using the hypergeometric formula, the sample size is 2,388 herds (31%) of the 7,650 total. Estimation of a population proportion Calculating the sample size necessary to estimate a population proportion is important when an estimate of disease prevalence or diagnostic test validation is desired. The sensitivity and specificity of an assay should be considered population estimates in the same manner as other proportions. The sample size formulas employed for these calculations are typically Table 3. Biologic question Statistical model Prevalence of TB Confidence level Results Interpretation considered to be precision based because they involve finding confidence intervals of a specified width rather than testing hypotheses. The typical sample size formula37,58 based on the normal approximation to the binomial is 2 P:ð1{PÞ: Z1{a=2 n~ , e2 where P is the expected proportion (e.g., diagnostic sensitivity), e is one half the desired width of the confidence interval, and Z12a/2 is the standard normal Z value corresponding to a cumulative probability of 1 2 a/2. The investigator must specify a best guess for the proportion that is expected to be found after Sample size situation for the detection of bovine tuberculosis (TB) in beef cattle herds. Is TB present in the 7,650 registered beef herds in Texas? A hypergeometric model is assumed because sampling of herds will be performed without replacement. If TB is present at or above 0.1% (1 in 1,000 herds), then it is biologically important to detect. If TB is present at this prevalence, then it will be detected with 95% confidence. 2,388 of the 7,650 total herds should be tested. When the true prevalence is at least 0.1%, then a sample consisting only of TB-uninfected herds would occur 5% of the time or less when the sample size is 2,388 (assuming a perfect test at the herd level). 8 Fosgate Figure 3. The sample size is determined so that the sampling distribution of the hypothesized proportion (Pˆ) has an area under the curve between the specified upper (PU) and lower (PL) bounds of the confidence interval equal to the specified probability (grade shaded area); Pr(PL # Pˆ # PU) 5 confidence level. performing the study. The investigator also needs to specify the desired width of the interval around this proportion and the level of confidence. In essence, this procedure will find the sample size that, upon statistical analysis, would result in a confidence interval with the specified probability and limits if the assumed proportion were in fact observed by the study (Fig. 3). The resulting sample size could be adjusted using the finite population correction factor, and if this is performed then the statistical analysis should be similarly adjusted at the end of the study. Sample sizes calculated using formulas should always be rounded up to the nearest whole number. The sample size methods based on the normal approximation to the binomial might not be adequate when the expected proportion is close to the boundary values of 0 or 1. Exact binomial methods are preferred when the proportion is expected to fall outside the range of 0.2–0.8.26 The binomial probability function is the basis of exact sample size methods, and it is n Px ð1{PÞn{x , x where P is the hypothesized proportion, n is the sample size, and x is the number of observed ‘‘successes.’’ Table 4. Derivation of a sample size algorithm based on the binomial probability function has been described previously.26 It is based on the mid-P adjustment5,41 for the Clopper-Pearson method of exact confidence interval estimation.14 The investigator specifies PU and PL as the desired limits of the confidence interval around the hypothesized proportion (P) and the desired level of confidence. The calculated sample size could be adjusted using the finite population correction factor if deemed appropriate. Software is available to calculate the necessary sample size for estimating population proportions. Epi Infob includes software that can perform these calculations33 and is available free for download. The software performs calculations based on normal approximation methods and will apply the finite population correction factor if the population size is specified. Software to perform calculations based on binomial exact methods (Mid-P Sample Size routine) can be obtained by contacting the author. An example of this type of sample size problem is the design of a study to estimate the diagnostic specificity of a new assay to screen healthy cattle for Foot-and-mouth disease virus (FMDV; Table 4). The number of cattle necessary to sample could be calculated for an expected specificity of 0.99 and the desire to estimate this specificity 60.01 with 99% confidence. For this example, it will be assumed that sampling is from a large population, and a simple random sampling design will be employed. Epi Info 6 can be used to make the calculation based on normal approximation methods (newer versions of Epi Info have not retained presented sample size routines). From the menu, choose Programs R EPITABLE. From the menus in EPITABLE, choose Sample R Sample size R Single proportion. The size of the population does not need to be changed unless the application of the finite population correction is desired. The design effect should be 1.0 unless the variance is to be adjusted for clustered sampling techniques. Enter 1% for the desired precision, 99% for the expected prevalence, and check 1% for alpha. Alternatively, the modified exact sample size routine could be used. Open the Mid-P Sample Size routine and enter 0.99 for the proportion, 0.01 for the error Sample size situation for estimating the specificity of a test to screen cattle for Foot-and-mouth disease virus (FMDV). Biologic question Statistical model Best guess for specificity Precision Confidence level Results Interpretation What is the specificity of a new screening test for FMDV in cattle? An exact binomial model is assumed because the specificity is expected to be close to 1. The new assay is expected to be 99% specific. Specificity is desired to be estimated 61%. Specificity is desired to be estimated to be within these limits with 99% confidence. 974 cattle should be sampled. If 974 cattle are sampled and the true specificity of the test is 99%, then a 99% confidence interval should have the specified width (2%). Sample size for diagnostic investigations limit, and 0.99 for the confidence level. The normal approximation method suggested that 657 cattle would need to be sampled, whereas the method based on the exact binomial distribution suggested that 974 should be sampled. Neither of these numbers incorporates finite population correction. The sample size based on the exact distribution is preferred, and it is substantially larger than the sample size based on the normal approximation because the expected proportion is very close to 1. Typically, sample size calculations for studies that will perform clustered sampling first calculate the necessary sample size assuming independence or lack of clustering. Calculated sample sizes are then multiplied by the DE to account for the lack of independence. Expert opinion can be used to account for expected correlation of sampling units when prior information concerning the intraclass correlation is not available. A sample size routine incorporating a method to estimate the DE based on expert opinion for a fixed number of clusters has been developed27 and is available from the author. Comparison of 2 proportions Independent proportions. Calculating the sample size necessary to compare 2 population proportions is important when a comparison of the accuracy of diagnostic tests is desired. Sensitivity and specificity are population estimates, and comparison between 2 assays should be based on this sample size situation. The usual sample size formula13,25,53 based on the normal approximation to the binomial with equal group sizes is n~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 (1{P ){Zb P1 (1{P1 )zP2 (1{P2 ) Z1{a=2 2P (P2 {P1 )2 , where P1 and P2 are the expected proportions in each group, and P¯ is the simple average of the expected proportions. Variables Z12a/2 and Zb are the standard normal Z values corresponding to the selected alpha (2sided test) and beta, respectively. Typical presentation of the formula11,12 above uses Za/2 instead and an addition of the 2 components within the numerator. Solving these 2 formulations gives the same sample size because the numerator is squared. The specific formulation has been included here because alternative hypotheses have been presented in figures as being on the positive side of the null hypothesis, and therefore Zb should be negative. This is also consistent with the algebraic manipulation to solve for Zb, as presented in the section related to power calculation. The resulting sample size should be adjusted using the 9 Figure 4. Sample size estimates are affected by the standardized difference and the specified alpha (type I error) and beta (type II error). continuity correction factor, and all sample sizes should be rounded up to the nearest whole number. The magnitude of the difference between the 2 proportions has a greater effect on calculated sample sizes than typical values for alpha and beta (Fig. 4). The absolute magnitude of the proportions affects the calculations, with proportions closer to 0.5 resulting in larger sample sizes11 because the variance of a proportion is greatest at this value. The formula for the standardized difference (SDiff ) in proportions3,36,61 is jP1 {P2 j SDiff~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : P1 ð1{P1 ÞzP2 ð1{P2 Þ Software is available to calculate the necessary sample size to compare 2 independent population proportions. Epi Infob can be used to perform these calculations. The calculations are based on normal approximation methods and will apply a continuity correction factor. An example of this type of sample size problem is the design of a study to compare the diagnostic sensitivity of magnetic resonance imaging (MRI) for detection of intervertebral disk disease between chondrodystrophoid and nonchondrodystrophoid breeds of dogs (Table 5). The number of dogs necessary to sample could be calculated for expected sensitivities of 90% and 80% in chondrodystrophoid and nonchondrodystrophoid dogs, respectively. The statistical test could be desired to have an alpha of 5% and beta of 20% to detect this difference in proportions. The ratio of chondrodystrophoid to nonchondrodystrophoid dogs also needs to be specified, and the assumption could be made to have equal group sizes. Epi Info 6 can be used to make the calculation. From the menu, choose Programs R EPITABLE. From the menus in EPITABLE, choose 10 Fosgate Table 5. Sample size situation for comparing the sensitivity of magnetic resonance imaging (MRI) for detection of intervertebral disk disease (IVDD) between chondrodystrophoid and nonchondrodystrophoid breeds of dogs. Biologic question Statistical model Best guess for sensitivity 1 Best guess for sensitivity 2 Type I error Type II error Other assumptions Results Interpretation Is the sensitivity of MRI for detecting IVDD different for chondrodystrophoid and nonchondrodystrophoid dogs? A binomial model is assumed for sensitivity, and a normal approximation is used for the comparison of sensitivities. MRI is expected to be 90% sensitive for detecting IVDD in chondrodystrophoid dogs. MRI is expected to be 80% sensitive for detecting IVDD in nonchondrodystrophoid dogs. A statistical test with 5% type I error (alpha) is desired. A statistical test with 20% type II error (beta) is desired. Assume an equal number of chondrodystrophoid and nonchondrodystrophoid dogs will be evaluated. 219 are necessary in each group, for a total of 438 dogs. If 219 dogs are available from each breed group for sensitivity estimation and the true sensitivities are 80% and 90%, then a statistical test interpreted at the 5% level of significance (alpha) will have a power of 80% for detecting the difference in sensitivities as statistically significant. Sample R Sample size R Two proportions. The ratio of group 1 to group 2 should be 1, the percentage in group 1 would be 90%, the percentage in group 2 would be 80%, alpha should be 5%, and the power should be set at 80%. Calculations suggest that 219 dogs are necessary in each group (chondrodystrophoid and nonchondrodystrophoid) for a total of 438. The reported sample size includes continuity correction. Sample size calculations for the comparison of proportions when the group sizes are not equal are a simple modification of the presented formula.60 The formula also can be modified to allow for the estimation of odds ratios and risk ratios.20,53 All presented formulas correspond to the necessary sample sizes for 2-sided statistical tests. Variable Z12a/2 is replaced with Z12a to modify the formula for a 1-sided test. Dependent proportions. When multiple tests are performed based on specimens collected from the same animal, then the proportions (i.e., sensitivity and specificity) should be considered dependent. There are multiple conditional and unconditional approaches to solving this sample size problem,15,16,23,39,40,43,44,49,57 and a formula is not presented in this section due to increased complexity and lack of consensus among competing methods. An example of this type of sample size problem is the design of a study to compare the diagnostic specificity of 2 tests for FMDV screening in healthy cattle (Table 6). Serum samples from each selected animal for study will have both tests performed in parallel. The number of cattle necessary to sample could be calculated based on expected specificities of 99% and 95% in test 1 and test 2, respectively. The statistical test could be desired to have alpha be 1% and beta 10% to detect this difference in proportions. Software is available to calculate the necessary sample size to compare 2 dependent population proportions. WinPepic includes software that can perform these calculations and is available free for download. From the main menu, the program PAIRSetc should be selected. Sample size should be chosen from the top menu of PAIRSetc. The correct type of sample size procedure corresponds to the McNemar test, and S1 should therefore be selected. The significance level should be set as 1% and the power as 90%. The expected percentage of ‘‘Yes’’ in the first set of observations should be set as 99%, and the other percentage of ‘‘Yes’’ should be set as 95%. Numbers without ‘‘%’’ should be entered, and the remainder of the input boxes should be left blank. Table 6. Sample size situation for comparing the specificity of 2 tests for Foot-and-mouth disease virus (FMDV) screening in healthy cattle. Biologic question Statistical model Best guess for specificity 1 Best guess for specificity 2 Type I error Type II error Other assumptions Results Interpretation Is the specificity of a new FMDV screening test in cattle different than the more established test? A binomial model is assumed for specificity, a multinomial model is assumed for the cross-classified proportions, and a normal approximation is used for the difference in specificities. The new test is expected to be 99% specific. The established test is considered 95% specific. A statistical test with 1% type I error (alpha) is desired. A statistical test with 10% type II error (beta) is desired. Data from the 2 tests will be dependent because they will be performed on the same cattle. 544 cattle are required, and both tests performed on all. If 544 cattle are available for specificity estimation and the true specificities are 99% and 95%, then a statistical test interpreted at the 1% level of significance (alpha) will have a power of 90% for detecting the difference in specificities as statistically significant. Sample size for diagnostic investigations Calculations suggest that 544 pairs of observations are required (544 cattle total). This sample size is smaller than the corresponding sample size if the proportions were considered to be independent. Epi Info 6 could be used to make the calculation if the paired design was ignored. From the menu, choose Programs R EPITABLE. From the menus in EPITABLE, choose Sample R Sample size R Two proportions. The ratio of group 1 to group 2 should be 1, the percentage in group 1 would be 99%, the percentage in group 2 would be 95%, alpha should be 1%, and the power should be set at 90%. Calculations suggest that 588 cattle are necessary for each group, and this would be the total number of necessary cattle because of the paired design. This sample size is not much different (8% greater) than the calculation based on the paired design, and since it is larger, it would not be necessarily incorrect to use the usual unmatched method for sample size determination. Equivalency testing. A study that aims to determine whether or not a certain test has the equivalent (or noninferior) accuracy44 of another, typically wellestablished test is based on separately comparing sensitivity and specificity between tests. The first step is to consider the sensitivity and specificity of the wellestablished test and then quantify the level of difference in the accuracy that would be allowable while still considering the 2 tests equivalent or the new test not inferior. It is not possible to calculate a sample size to determine zero difference for the similar reason that it is not possible to calculate a sample size to be 100% sure that a given population has no disease (zero prevalence). An example would be to determine equivalency of a new test to a wellestablished test that has been reported to be 90% sensitive and 95% specific. Further assumptions could be that as long as the new test is at least 85% sensitive and 90% specific, then it would be considered equivalent. The allowable alpha and beta values could be assumed to be 5% (2-sided) and 20%, respectively. However, power values greater than 80% and larger alpha values are sometimes assumed for equivalency studies.54 Epi Info could be used to calculate the necessary sample size as described previously for 2 independent proportions. If equal group sizes are assumed (for each test), then the necessary sample size is 726 infected animals within each group tested by the 2 tests for sensitivity comparison and 474 uninfected animals within each group for specificity comparison. If a paired design were planned, then these numbers would be a reasonably good estimate for the total number of animals necessary for the evaluation. Often for noninferiority testing a 1-sided statistical test will be employed, and therefore the sample size calculation 11 should be adjusted accordingly. Equivalency testing in general requires large sample sizes, and the discussed example is a simplified situation. Literature related to these studies documents several methods of calculation and varies based on the determination of regions associated with rejection of the null hypothesis of no difference between tests. The simplified example has been presented to give a general idea of how studies should be designed, and interested readers should review the paper by Lu et al.44 Calculation of power when sample size is fixed When the sample size is fixed by design, then it is good planning to determine the power of a statistical test to identify a biologically important difference. Estimating the power to compare 2 population proportions is important when it is desired to compare the accuracy of diagnostic tests. The usual formula for calculating the power for this comparison is an algebraic manipulation of the previously presented sample size formula and assuming equal group sizes is pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1{P ){jP1 {P2 jpffiffinffi Z1{a=2 2P pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : Zb ~ P1 (1{P1 )zP2 (1{P2 ) A modification of the above formula,24 including continuity correction, is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1{P ){jP1 {P2 j n{ 2 Z1{a=2 2P jP2 {P1 j pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , Zb ~ P1 (1{P1 )zP2 (1{P2 ) where n is the sample size, P1 and P2 are the expected ¯ is the simple proportions in each group, and P average of the expected proportions. Variables Z12a/2 and Zb are standard normal Z values. Power is determined as 1 2 cumulative probability associated with Zb as calculated from the formula (Table 7). Typical presentations of these formulas24 incorporate Za/2 and addition of the numerator components. An example would be to compare diagnostic sensitivity between 2 tests when both tests were independently performed on 100 infected animals. Assume that the tests are believed to have sensitivities of 85% and 90%, and a test with an alpha of 5% is desired. Epi Info 6 can be used to calculate the power of the test to compare these 2 proportions. From the menu, choose Programs R EPITABLE. From the menus in EPITABLE, choose Sample R Power calculation R Cohort study. The number of exposed should be set to 100, and the ratio of exposed to exposed as 1 (exposed and nonexposed is simply a way to distinguish the 2 groups). The relative risk 12 Fosgate Table 7. Common standard normal Z scores for use in sample size formulas and power estimation.* Standard normal Z score Cumulative probability 1 2 cumulative probability 22.576 22.326 21.960 21.645 21.282 21.036 20.842 20.524 20.253 0.000 0.253 0.524 0.842 1.282 1.645 1.960 2.326 2.576 0.005 0.010 0.025 0.050 0.100 0.150 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 0.950 0.975 0.990 0.995 0.995 0.990 0.975 0.950 0.900 0.850 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.050 0.025 0.010 0.005 * Power is found as 1 minus the cumulative probability associated with the Z score calculated from the power formula. worth detecting should be set to 1.06 (90%/85%; larger proportion over the smaller), the attack rate in the unexposed should be set as 85% as the lower of the 2 proportions, and alpha should be 5%. The power calculation includes continuity correction and is reported as 13.3% by Epi Info. Using the presented formulas, the powers are calculated as 18.5% and 12.8% for the uncorrected and continuity-corrected formulas, respectively. The calculation of power is dependent on the specification of an alternative hypothesis. The sampling distribution of the proportion under the null hypothesis is determined, and the critical value (Pr Z # z 5 1 2 a/2) is located on this distribution. The alternative hypothesis is set as the expected difference in the 2 population proportions, and the sampling distribution of this difference is plotted with the critical value under the null hypothesis. The area under the sampling distribution of the alternative hypothesis to the right of the critical value is the power of the statistical test (Fig. 5). The shapes of these curves depend on the hypothesized proportions and the sample size. There is only a single power value related to each possible alpha and alternative hypothesis (expected difference in proportions). Conclusions The calculation of the sample size is very important during the design stage of all epidemiologic studies and should match the proposed statistical analysis to the extent possible. It is important to recognize that Figure 5. The sampling distribution under the null (black line) and alternative (gray line) hypotheses for the situation when P1 5 0.2 and P2 5 0.4 with equal group sizes. HO is the null hypothesis that the true proportion is 0.3 (simple average of P1 and P2), and HA is the alternative hypothesis that P1 5 0.2 and P2 5 0.4 and is centered at P2. Alternatively, HA could have been centered at P1. The gray shaded area corresponds to the power for the statistical test with alpha of 5% when the sample size per group is 20 (A), 40 (B), 80 (C), and 160 (D). The power is 50% in panel B because the observed P value of the comparison is equal to the specified alpha (5%). there is no single correct sample size, and all calculations are only as good as the employed assumptions. The sample size ensures statistical significance if the subsequent data collection is perfectly consistent with the assumptions made for the sample size calculation (assuming power was set as 50% or greater). If the null hypothesis is false and the assumed alternate hypothesis is true, then the probability of observing statistical significance will be equal to the assumed power of the test. The choice of assumptions for calculations is very important because their validity determines the likelihood of observing statistical significance. The traditional choices of 5% alpha and 20% beta can simply be used unless the investigator has specific reasons for other values. The choices of the best guesses or hypothesized values for the proportions that will be estimated by the study are more difficult. Values for these assumptions should be based on available literature or expert opinion. When there is doubt concerning their value, then proportions could be assumed to be close to 0.5. A proportion of 0.5 has the maximum variance, and therefore would result in the largest sample size. Sample size calculations correspond to the number of animals that are required to complete the study and be available for statistical analysis. They are the minimum sample sizes required to achieve the desired Sample size for diagnostic investigations statistical properties. Sample size calculations should be increased by the number of animals that are anticipated to be lost during the study. The study design influences the number of animals expected to be lost during implementation. Cross-sectional studies should have minimal losses, but there is always the possibility of mislabeled samples, lost records, and laboratory errors. Sample sizes for cross-sectional studies should be increased 1–5% to account for these potential losses. Prospective studies that cover long time periods could have substantial losses, but these types of study designs are unusual for diagnostic investigations. Some published recommendations include the posthoc calculation of power when study results fail to achieve statistical significance.34 However, there is no statistical basis for this calculation.31 The power of a 2-sided test with an alpha set to be equal to the observed P value is typically 50%,34 as presented in Figure 5. Therefore, post-hoc power calculations will typically be less than 50% for observed nonsignificant results. This fact, in conjunction with the one-to-one relationship between P value and power, suggests that little information can be garnered from their calculation. Post-hoc calculations of power could be useful if performed for magnitudes of differences other than what was observed by the study. In general, however, the post-hoc calculation of power is akin to determining the probability that an event will be observed after the event has already occurred (or not). A primary purpose of sample size calculations is to ensure that the proposed study will be of an appropriate size to find an important difference statistically significant. Therefore, calculations should be performed prior to the determination of the study size. In practice, however, sample sizes are sometimes performed after the number of animals for study has been set, for reasons that might include cost or availability. Often, the assumptions are simply modified based on trial and error until calculations lead to the predetermined sample size, and these calculations are presented in grant applications or other proposed research plans. Also, studies are sometimes performed without performing any sample size calculations. Many journals require discussion of sample size calculations, and therefore such calculations are sometimes performed after the fact, with assumptions modified until the appropriate size is found. These are obviously not appropriate uses of sample size calculations. A better approach often would be the calculation of power based on the sample size expected to be used for the study. Though such post-hoc determinations are inappropriate or misleading, many epidemiologists and statisticians likely have been asked to perform these calculations. 13 Unfortunately, the realities of research do not always coexist peacefully within the service of science itself. It is hoped that the material presented in the present article will demystify sample size calculations and encourage their use during the initial design phase of surveillance and diagnostic evaluations. Acknowledgements This manuscript was prepared in part through financial support by the U.S. Department of Agriculture, Cooperative State Research, Education, and Extension Service, National Research Initiative Award 2005-35204-16087. The author would like to thank the anonymous reviewers for helpful suggestions, which resulted in a better overall paper. Sources and manufacturers a. Survey Toolbox version 1.04. by Angus Cameron et al. Available at http://www.ausvet.com.au/content.php?page5res_ software. b. Epi InfoTM version 6.04d for Windows, Centers for Disease Control and Prevention, Atlanta, GA. Available at http:// www.cdc.gov/epiinfo/Epi6/ei6.htm. c. WINPEPI (PEPI-for-Windows) version 6.8 by J. H. Abramson. Available at http://www.brixtonhealth.com/pepi4windows. html. References 1. Aitken CG: 1999, Sampling—how big a sample? J Forensic Sci 44:750–760. 2. Alonzo TA, Pepe MS, Moskowitz CS: 2002, Sample size calculations for comparative studies of medical tests for detecting presence of disease. Stat Med 21:835–852. 3. Altman DG: 1980, Statistics and ethics in medical research: III. How large a sample? Br Med J 281:1336–1338. 4. Berry DA, Lindgren BW: 1996.Statistics: theory and methods, 2nd ed. Duxbury Press, Belmont, CA. 5. Berry G, Armitage P: 1995, Mid-P confidence intervals: a brief review. Statistician 44:417–423. 6. Bochmann F, Johnson Z, Azuara-Blanco A: 2007, Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey. Br J Ophthalmol 91:898–900. 7. Branscum AJ, Johnson WO, Gardner IA: 2006, Sample size calculations for disease freedom and prevalence estimation surveys. Stat Med 25:2658–2674. 8. Breslow NE, Day NE, eds.: 1980, Statistical methods in cancer research. In: The analysis of case-control studies, vol.1. IARC Scientific Publications no. 32, pp. 131–133. International Agency for Research on Cancer, Lyon, France. 9. Cameron AR, Baldock FC: 1998, A new probability formula for surveys to substantiate freedom from disease. Prev Vet Med 34:1–17. 10. Campbell MK, Mollison J, Grimshaw JM: 2001, Cluster trials in implementation research: estimation of intracluster correlation coefficients and sample size. Stat Med 20:391–399. 11. Carlin JB, Doyle LW: 2002, Sample size. J Paediatr Child Health 38:300–304. 12. Carpenter TE: 2001, Use of sample size for estimating efficacy of a vaccine against an infectious disease. Am J Vet Res 62:1582–1584. 13. Casagrande JT, Pike MC: 1978, An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics 34:483–486. 14 Fosgate 14. Clopper CJ, Pearson ES: 1934, The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26:404–413. 15. Connett JE, Smith JA, McHugh RB: 1987, Sample size and power for pair-matched case-control studies. Stat Med 6:53–59. 16. Connor RJ: 1987, Sample size for testing differences in proportions for the paired-sample design. Biometrics 43:207–211. 17. Daly LE: 1991, Confidence intervals and sample sizes: don’t throw out all your old sample size tables. BMJ 302:333–336. 18. Delucchi KL: 2004, Sample size estimation in research with dependent measures and dichotomous outcomes. Am J Public Health 94:372–377. 19. Dohoo IR, Martin W, Stryhn H: 2003, Veterinary epidemiologic research. AVC Inc., Charlottetown, PE, Canada. 20. Donner A: 1984, Approaches to sample size estimation in the design of clinical trials—a review. Stat Med 3:199–214. 21. Fisher RA: 1929, Tests of significance in harmonic analysis. Proc R Soc Lond A Math Phys Sci 125:54–59. 22. Flahault A, Cadilhac M, Thomas G: 2005, Sample size calculation should be performed for design accuracy in diagnostic test studies. J Clin Epidemiol 58:859–862. 23. Fleiss JL, Levin B: 1988, Sample size determination in studies with matched pairs. J Clin Epidemiol 41:727–730. 24. Fleiss JL, Levin BA, Paik MC: 2003, Statistical methods for rates and proportions, 3rd ed. Wiley, Hoboken, NJ. 25. Fleiss JL, Tytun A, Ury HK: 1980, A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36:343–346. 26. Fosgate GT: 2005, Modified exact sample size for a binomial proportion with special emphasis on diagnostic test parameter estimation. Stat Med 24:2857–2866. 27. Fosgate GT: 2007, A cluster-adjusted sample size algorithm for proportions was developed using a beta-binomial model. J Clin Epidemiol 60:250–255. 28. Georgiadis MP, Johnson WO, Gardner IA: 2005, Sample size determination for estimation of the accuracy of two conditionally independent tests in the absence of a gold standard. Prev Vet Med 71:1–10. 29. Goodman SN: 1993, p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 137:485–496. 30. Greenland S: 1985, Power, sample size and smallest detectable effect determination for multivariate studies. Stat Med 4:117–127. 31. Greenland S: 1988, On sample-size and power calculations for studies using confidence intervals. Am J Epidemiol 128:231–237. 32. Greiner M, Gardner IA: 2000, Epidemiologic issues in the validation of veterinary diagnostic tests. Prev Vet Med 45:3–22. 33. Grimes DA, Schulz KF: 1996, Determining sample size and power in clinical trials: the forgotten essential. Semin Reprod Endocrinol 14:125–131. 34. Hoenig JM, Heisey DM: 2001, The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 55:19. 35. Houle TT, Penzien DB, Houle CK: 2005, Statistical power and sample size estimation for headache research: an overview and power calculation tools. Headache 45:414–418. 36. Jones SR, Carley S, Harrison M: 2003, An introduction to power and sample size estimation. Emerg Med J 20:453–458. 37. Kelsey JL: 1996, Methods in observational epidemiology, 2nd ed. Oxford University Press, New York, NY. 38. Killip S, Mahfoud Z, Pearce K: 2004, What is an intracluster correlation coefficient? Crucial concepts for primary care researchers. Ann Fam Med 2:204–208. 39. Lachenbruch PA: 1992, On the sample size for studies based upon McNemar’s test. Stat Med 11:1521–1525. 40. Lachin JM: 1992, Power and sample size evaluation for the McNemar test with application to matched case-control studies. Stat Med 11:1239–1251. 41. Lancaster HO: 1961, Significance tests in discrete distributions. J Am Stat Assoc 56:223–234. 42. Li J, Fine J: 2004, On sample size for sensitivity and specificity in prospective diagnostic accuracy studies. Stat Med 23:2537–2550. 43. Lu Y, Bean JA: 1995, On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Stat Med 14:1831–1839. 44. Lu Y, Jin H, Genant HK: 2003, On the non-inferiority of a diagnostic test based on paired observations. Stat Med 22:3029–3044. 45. McDermott JJ, Schukken YH: 1994, A review of methods used to adjust for cluster effects in explanatory epidemiological studies of animal populations. Prev Vet Med 18:155–173. 46. Neyman J, Pearson ES: 1928, On the use and interpretation of certain test criteria for purposes of statistical inference: part I. Biometrika, 20A:175–240. 47. Neyman J, Pearson ES: 1933, On the problem of the most efficient tests of statistical hypotheses. Proc R Soc Lond A Math Phys 231:289–337. 48. Obuchowski NA: 1998, Sample size calculations in studies of test accuracy. Stat Methods Med Res 7:371–392. 49. Parker RA, Bregman DJ: 1986, Sample size for individually matched case-control studies. Biometrics 42:919–926. 50. Rigby AS, Vail A: 1998, Statistical methods in epidemiology. II: A commonsense approach to sample size estimation. Disabil Rehabil 20:405–410. 51. Rothman KJ, Greenland S: 1998, Modern epidemiology, 2nd ed. Lippincott-Raven, Philadelphia, PA. 52. Sahai H, Khurshid A: 1995, A note on confidence intervals for the hypergeometric parameter in analyzing biomedical data. Comput Biol Med 25:35–38. 53. Schlesselman JJ: 1974, Sample size requirements in cohort and case-control studies of disease. Am J Epidemiol 99:381–384. 54. Sheps S: 1993, Sample size and power. J Invest Surg 6:469–475. 55. Sterne JA, Davey Smith G: 2001, Sifting the evidence—what’s wrong with significance tests? BMJ 322:226–231. 56. Suess EA, Gardner IA, Johnson WO: 2002, Hierarchical Bayesian model for prevalence inferences and determination of a country’s status for an animal pathogen. Prev Vet Med 55:155–171. 57. Suissa S, Shuster JJ: 1991, The 2 3 2 matched-pairs trial: exact unconditional design and analysis. Biometrics 47:361–372. 58. Thrusfield MV: 2005, Veterinary epidemiology, 3rd ed. Blackwell Science, Ames, IA. 59. Ukoumunne OC: 2002, A comparison of confidence interval methods for the intraclass correlation coefficient in cluster randomized trials. Stat Med 21:3757–3774. 60. Ury HK, Fleiss JL: 1980, On approximate sample sizes for comparing two independent proportions with the use of Yates’ correction. Biometrics 36:347–351. 61. Whitley E, Ball J: 2002, Statistics review 4: sample size calculations. Crit Care 6:335–341. 62. Wickramaratne PJ: 1995, Sample size determination in epidemiologic studies. Stat Methods Med Res 4:311–337.
© Copyright 2025