Practical sample size calculations for surveillance and diagnostic investigations Geoffrey T. Fosgate

J Vet Diagn Invest 21:3–14 (2009)
Practical sample size calculations for surveillance and
diagnostic investigations
Geoffrey T. Fosgate1
Abstract. The likelihood that a study will yield statistically significant results depends on the chosen
sample size. Surveillance and diagnostic situations that require sample size calculations include certification of
disease freedom, estimation of diagnostic accuracy, comparison of diagnostic accuracy, and determining
equivalency of test accuracy. Reasons for inadequately sized studies that do not achieve statistical significance
include failure to perform sample size calculations, selecting sample size based on convenience, insufficient
funding for the study, and inefficient utilization of available funding. Sample sizes are directly dependent on
the assumptions used for their calculation. Investigators must first specify the likely values of the parameters
that they wish to estimate as their best guess prior to study initiation. They further need to define the desired
precision of the estimate and allowable error levels. Type I (alpha) and type II (beta) errors are the errors
associated with rejection of the null hypothesis when it is true and the nonrejection of the null hypothesis when
it is false (a specific alternative hypothesis is true), respectively. Calculated sample sizes should be increased by
the number of animals that are expected to be lost over the course of the study. Free software routines are
available to calculate the necessary sample sizes for many surveillance and diagnostic situations. The
objectives of the present article are to briefly discuss the statistical theory behind sample size calculations and
provide practical tools and instruction for their calculation.
Key words:
Diagnostic testing; epidemiology; sample size; study design; surveillance.
estimation of diagnostic accuracy, comparison of
diagnostic accuracy among competing assays, and
equivalency testing of assays. The appropriate sample
size depends on the study purpose, and no calculations can be made until study objectives have been
defined clearly. Sample size calculations are important because they require investigators to clearly
define the expected outcome of investigations, encourage development of recruitment goals and a
budget, and discourage the implementation of small,
inconclusive studies. Common sample size mistakes
include not performing any calculations, making
unrealistic assumptions, failing to account for potential losses during the study, and failing to investigate
sample sizes over a range of assumptions. Reasons for
inadequately sized studies that do not achieve
statistical significance include failing to perform
sample size calculations, selecting sample size based
on convenience, failing to secure sufficient funding
for the project, and not using available funding
efficiently.
There is no single correct sample size ‘‘answer’’ for
any given epidemiologic study objective or biologic
question. Calculated sizes depend on assumptions
made during their calculation, and such assumptions
cannot be known with certainty. If assumptions were
known to be true with certainty, then the study that is
being designed will likely not add to the scientific
understanding of the problem. There are concepts
Introduction
Calculation of sample size is important for the
design of epidemiologic studies,18,62 and specifically
for surveillance9 and diagnostic test evaluations.6,22,32
The probability that a completed study will yield
statistically significant results depends on the choice
of sample size assumptions and the statistical model
used to make calculations. The statistical methodology of employed sample size calculations should
parallel the proposed data analysis to the extent
possible.18 The most frequently chosen sample size
routines are based on frequentist statistics, and these
have been reviewed previously for other
fields.1,11,20,33,35,36,50,54,61,62 Issues specifically related
to diagnostic test validation also have been discussed.2,28,42,48 Sample size routines related to issues of
surveillance, as well as diagnostic test validation using
Bayesian methodology, also have been developed.7,56
Surveillance and diagnostic situations that require
sample size calculations include the detection of
disease in a population to certify disease freedom,
From the Department of Veterinary Integrative Biosciences,
College of Veterinary Medicine and Biomedical Sciences, Texas
A&M University, College Station, TX.
1
Corresponding Author: Geoffrey T. Fosgate, Department of
Veterinary Integrative Biosciences, College of Veterinary Medicine
and Biomedical Sciences, Texas A&M University, College Station,
TX 77843. gfosgate@cvm.tamu.edu
3
4
Fosgate
Table 1.
Definition of type I and type II errors.*
True state of nature
HO is true
Statistical Reject HO
result
Fail to
reject HO
HA is true (HO false)
Type I error
Agreement with
(alpha)
truth
Agreement with Type II error
truth
(beta)
* HO 5 null hypothesis; HA 5 alternative hypothesis.
that are important to consider when performing
sample size calculations, despite the inability to
classify certain results as correct or incorrect. A few
simple formulas are generally sufficient for most
sample size situations that would be encountered in
the design of studies to determine disease freedom
and evaluate diagnostic tests. The objectives of the
present article are to briefly discuss the statistical
theory behind sample size calculations and provide
practical tools and instruction for their calculation.
This review will only discuss issues related to
frequentist approaches to sample size calculation
and will emphasize conservative methods that result
in larger sample sizes.
Epidemiologic errors
The current presentation of statistical results in the
medical literature tends to be a blending of
significance testing attributed to the work of
Fisher,21 subsequently discussed by others,29,55 and
hypothesis testing as attributed to Neyman and
Pearson.46,47 The P value in the Fisher significance
testing approach is considered a quantitative value
documenting the level of evidence for or against the
null hypothesis. The P value is formally defined as
the probability of observing the current data or more
extreme when the null hypothesis is true. The
hypothesis testing approach as introduced by Neyman and Pearson was based on rejection or
acceptance of null hypotheses using specified P value
cutoffs. The hypothesis testing interpretation of
statistical results allows for the definition of type I
and type II errors as the errors associated with
rejection of the null hypothesis when it is indeed true
and the acceptance of the null hypothesis when it is
false (and a particular alternative hypothesis is true),
respectively.47 The probabilities of making these
errors are frequently referred to as alpha (a) and
beta (b) for type I and type II errors, respectively.36,54
Current sample size procedures are derived from the
hypothesis testing approach as put forth by Neyman
and Pearson; however, current convention is to use
the terminology of ‘‘failure to reject’’ rather than
acceptance of a null hypothesis.
Figure 1. Sampling distributions presented under the null
(HO, black line) and alternative (HA, gray line) hypotheses. Black
shaded area corresponds to alpha (type I error), and gray shaded
area corresponds to beta (type II error). Sample size calculations
solve for the number so that the critical value (cv) corresponds to
the location, where Pr Z # z 5 1 – a/2 and Pr Z # z 5 b (or Pr Z
# z 5 1 – a for a 1-sided test).
A requirement for sample size calculation is the
specification of alpha and beta when considering the
testing of a statistical hypothesis (Table 1). Precisionbased sample size methods must specify alpha, but
beta is not included in the equations and, based on
the typical large-sample approximation methods, is
consequently assumed to be 50% for the alternative
hypothesis that the true value falls outside the limits
of the calculated confidence interval.17 The P value
obtained after statistical analysis will equal the
prespecified alpha if the assumptions of the sample
size calculations are observed exactly in the collected
data due to their similar probabilistic definition.
However, the meaning of beta is often misunderstood
as simply ‘‘the probability of accepting the null
hypothesis when a true difference exists’’ based on
presentations in tables and figures.11,33,36,54 The issue
is that there are an infinite number of specific
alternative hypotheses that could be true if the null
hypothesis is false, and many will be less probable
than the null itself. Beta can be calculated only after
an explicit alternative hypothesis has been specified.
The hypothesis that is chosen during sample size
calculation is the expected difference between the
population values. Alpha and beta correspond to
areas under sampling distributions for population
means (including proportions) under the null and
alternative hypotheses, respectively (Fig. 1). The
statistical power of a test is defined as 1 2 b or the
probability of rejecting the null hypothesis when
the alternative hypothesis is true.
Sample size for diagnostic investigations
5
Sample size adjustment factors
Sample size calculations are often based on largesample approximation methods.24,30,51 The quality of
the approximate results depends on the specific
sample size situation, and adjustment factors have
been developed to improve their approximation to
exact distributions. Some of the typical adjustment
factors include the finite population, continuity
correction, and variance inflation factors.
The finite population correction factor4,19 is typically considered when the study objective is to
estimate a population proportion. Typically, sampling without replacement is performed, and if the
sample size is relatively large compared with the total
population, then this correction factor should be
considered. A typical recommendation is to employ
this factor when the sample includes 10% or more of
the population.19 The need for this correction factor is
derived from the fact that sampling is hypergeometric
(sampling without replacement, as from a deck of
cards), and sample size formulas are based on
binomial (sampling with replacement) theory. The
formula19 for the correction is
N
,
n~n|
Nzn
where n is the corrected sample size, and n and N are
the uncorrected and population sizes, respectively.
Applying this correction factor causes the sample size
to be smaller than the uncorrected. Confidence
interval algorithms have been developed based on
hypergeometric sampling,52 but the author is not
aware of their availability in statistical packages.
Usual confidence interval calculation methods are
based on either normal approximation or exact
binomial methodology. Application of the finite
population correction factor is only recommended
by the author when the analysis also incorporates
adjustment for hypergeometric sampling.
The continuity correction factor51 is employed
when the study objective is to compare 2 population
proportions (including diagnostic sensitivity or specificity). The difference in proportions is approximated
by a normal distribution in typical sample size
formulas, even though binomial distributions are
discrete and normal distributions are continuous. The
normal approximation might not always be adequate,
and continuity correction should be applied to better
approximate the exact distribution (Fig. 2). The
formula25 for continuity correction is
n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2
n~ 1z 1z4=ðnjP1 {P2 jÞ ,
4
Figure 2. Cumulative probability function for a binomial
distribution (n 5 12, p 5 0.5) (gray shading) overlaid with the
corresponding cumulative normal distribution (m 5 6, s 5 1.732)
denoting the uncorrected (A; probability 5 0.500 ) and continuitycorrected (B; probability 5 0.614) probability for observing 6
successes. Continuity correction improves the approximation of
the true binomial cumulative probability for observing 6 successes,
which is 0.613.
where n and n are the corrected and uncorrected
sample sizes, respectively, and P1 and P2 are the
hypothesized proportions. Applying the continuity
correction increases the sample size over the uncorrected and should typically be applied. Frequently
employed methods for the comparison of proportions
use continuity correction when calculating chi-square
test statistics.8,58
Sample size calculations for estimating proportions
typically involve making the assumption of independence among sampling units. Lack of independence
that is introduced when a clustered sampling design is
employed can be adjusted by inflating the variance
estimate. The design effect (DE),10,38 or variance
inflation factor, is defined as the variance of the
sampling design compared with simple random
sampling. The formula10,45,59 for its calculation is
DE~1zrðm{1Þ,
where r is the intraclass correlation, and m is the
sample size within each cluster. When clustered
sampling is employed, then the sample size estimated
6
Fosgate
by the usual methods assuming independence is
multiplied by the DE to account for the expected
dependence.
The intraclass correlation is a relative measure of
the homogeneity of sampling units within each cluster
compared with a randomly selected sampling unit.
This correlation is formally defined as the proportion
of total variation within sampling units that can be
accounted by variation among clusters.38,45 A high
correlation indicates more dependence within the
data, resulting in a larger DE. The intraclass
correlation is generally estimated from pilot data or
based on estimates available from the literature. If the
number of clusters is fixed by design and the cluster
sample size is unknown, then it is not possible to
simply use the previously mentioned formula for the
DE. The sample size per cluster (m) must first be
estimated, and it is based on the effective sample size
(ESS), which is the sample size estimated assuming
independence. It is also necessary to know the
number of clusters (k) and the intraclass correlation
(r). The formula27 for calculation of the cluster
sample size is
m~
ESS{r(ESS)
:
k{r(ESS)
Sample size situations
Surveillance or detection of disease
The detection of disease in a population is
important for herd certification programs and for
documenting freedom from disease after an outbreak.
It has implications in regional and international trade
of animals and animal products. The first step is to
determine the prevalence of disease that is important
to detect. A prevalence of disease at this level or
greater is considered biologically important. Documenting a zero prevalence of disease is not typically
possible because it would require testing the entire
population with a perfect assay. The next step is to
define the level of confidence for which it is desired to
find the disease should it be present in the population
at the hypothesized prevalence or higher. Again,
100% confidence is not feasible because it would
require sampling all animals and testing with a perfect
assay. Alpha is calculated as 1 2 confidence. The final
step is to determine the statistical model to use for
calculations. In small populations, sample size calculations should be based on a hypergeometric distribution (sampling without replacement). In larger
populations, it is often assumed that the true
hypergeometric distribution can be well approxi-
mated by the binomial (sampling with replacement).
The sample size formula assuming a binomial model
is based on the following relationship: (1 2 p)n 5 (1 2
confidence). The formula19 after solving for the
sample size is
n~ log a=½logð1{pÞ,
where a is 1 2 confidence, and p is the prevalence
worth detecting. The corresponding formula19 based
on hypergeometric sampling is
D{1
n~ 1{(a)1=D N{
,
2
where a is 1 2 confidence, N is the population size,
and D is the expected number of diseased animals in
the population.
The necessary sample size for various combinations
of prevalence and confidence can be tabulated
(Table 2), and software is available that will perform
the necessary calculations. Survey Toolboxa can
perform these calculations and is available free for
download. The software performs calculations based
on both binomial and hypergeometric sampling and
can also adjust for imperfect sensitivity and specificity
of employed tests.
An example of this type of sample size problem is
illustrated by the regulatory agency in Texas when it
decided to perform active surveillance for bovine
tuberculosis (Table 3). There are approximately 7,650
registered pure-bred beef seed stock producers in
Texas, and it was decided that a herd-level prevalence
of 0.001 (1 in 1,000 herds infected) or greater was
important to detect with 95% confidence. Survey
Toolbox can be used to solve this sample size
problem. From the menu, choose Freedom From
Disease R Sample Size. Click on the Sample Size tab
and input 100 for the sensitivity and specificity of the
test. Change the population size to 7,650 and set
the minimum expected prevalence to %0.1. Click on
the Options tab and be sure that the type I error is set
at 0.05. Click to have the program calculate the
sample size based on the simple binomial model. No
other changes are necessary. Go back to the Sample
Size tab and click on the Calculate button. The
sample size based on the hypergeometric model can
be calculated by changing to the Modified Hypergeometric Exact on the Options tab before clicking on the
Calculate button.
A binomial model suggested that the necessary
sample size would be 2,994 of the 7,650 beef
operations (39%). The interpretation is that assuming
that the true prevalence is at least 0.001, then a
sample consisting only of noninfected herds would
Sample size for diagnostic investigations
7
Table 2. Number needed for study to be confident that the disease will be detected if present at or above a specified prevalence
based on hypergeometric sampling and assuming a perfect test.
Expected prevalence of disease in population (95%/99% confidence)
Population size
0.10
0.05
0.02
0.01
0.005
20
50
100
150
200
250
300
350
400
450
500
600
700
800
1,000
1,200
1,400
1,600
1,800
2,000
3,000
4,000
5,000
6,000
10,000
100,000
.100,000*
15/18
22/29
25/35
26/38
27/39
27/40
27/41
27/41
27/41
28/42
28/42
28/42
28/42
28/43
28/43
28/43
28/43
28/43
28/43
28/43
28/43
28/44
28/44
28/44
28/44
28/44
28/44
19/20
34/41
44/59
48/67
51/72
52/75
53/77
54/79
54/80
55/81
55/82
56/83
56/84
56/85
57/86
57/86
57/87
57/87
57/88
58/88
58/88
58/89
58/89
58/89
58/89
58/90
58/90
20/20
48/50
77/90
94/117
105/136
112/149
117/159
121/167
124/174
126/179
128/183
131/189
134/194
135/198
138/204
139/208
141/210
142/212
142/214
143/215
145/219
146/222
146/223
146/224
147/225
148/228
148/228
20/20
50/50
95/99
129/143
155/180
174/210
189/235
201/255
210/272
218/287
224/300
235/320
243/336
249/349
258/367
264/381
268/391
272/398
275/404
278/409
284/425
287/433
289/438
291/441
294/443
298/457
298/458
20/20
50/50
100/100
150/150
190/198
238/244
233/286
272/324
310/360
349/391
388/420
378/470
369/511
421/546
450/601
471/642
486/673
500/699
509/719
517/736
542/791
555/821
563/839
569/852
580/888
596/915
598/919
* Based on binomial model.
occur 5% of the time or less when the sample size is 2,994
(assuming a perfect test at the herd level). The
hypergeometric model might be more appropriate,
because sampling would be from a finite population
without replacement; using the hypergeometric formula,
the sample size is 2,388 herds (31%) of the 7,650 total.
Estimation of a population proportion
Calculating the sample size necessary to estimate a
population proportion is important when an estimate
of disease prevalence or diagnostic test validation is
desired. The sensitivity and specificity of an assay
should be considered population estimates in the
same manner as other proportions. The sample size
formulas employed for these calculations are typically
Table 3.
Biologic question
Statistical model
Prevalence of TB
Confidence level
Results
Interpretation
considered to be precision based because they involve
finding confidence intervals of a specified width
rather than testing hypotheses. The typical sample
size formula37,58 based on the normal approximation
to the binomial is
2
P:ð1{PÞ: Z1{a=2
n~
,
e2
where P is the expected proportion (e.g., diagnostic
sensitivity), e is one half the desired width of the
confidence interval, and Z12a/2 is the standard normal
Z value corresponding to a cumulative probability of
1 2 a/2. The investigator must specify a best guess for
the proportion that is expected to be found after
Sample size situation for the detection of bovine tuberculosis (TB) in beef cattle herds.
Is TB present in the 7,650 registered beef herds in Texas?
A hypergeometric model is assumed because sampling of herds will be performed without replacement.
If TB is present at or above 0.1% (1 in 1,000 herds), then it is biologically important to detect.
If TB is present at this prevalence, then it will be detected with 95% confidence.
2,388 of the 7,650 total herds should be tested.
When the true prevalence is at least 0.1%, then a sample consisting only of TB-uninfected herds would
occur 5% of the time or less when the sample size is 2,388 (assuming a perfect test at the herd level).
8
Fosgate
Figure 3. The sample size is determined so that the sampling
distribution of the hypothesized proportion (Pˆ) has an area under
the curve between the specified upper (PU) and lower (PL) bounds
of the confidence interval equal to the specified probability (grade
shaded area); Pr(PL # Pˆ # PU) 5 confidence level.
performing the study. The investigator also needs to
specify the desired width of the interval around this
proportion and the level of confidence. In essence,
this procedure will find the sample size that, upon
statistical analysis, would result in a confidence
interval with the specified probability and limits if
the assumed proportion were in fact observed by the
study (Fig. 3). The resulting sample size could be
adjusted using the finite population correction factor,
and if this is performed then the statistical analysis
should be similarly adjusted at the end of the study.
Sample sizes calculated using formulas should always
be rounded up to the nearest whole number.
The sample size methods based on the normal
approximation to the binomial might not be adequate
when the expected proportion is close to the boundary
values of 0 or 1. Exact binomial methods are preferred
when the proportion is expected to fall outside the
range of 0.2–0.8.26 The binomial probability function is
the basis of exact sample size methods, and it is
n
Px ð1{PÞn{x ,
x
where P is the hypothesized proportion, n is the sample
size, and x is the number of observed ‘‘successes.’’
Table 4.
Derivation of a sample size algorithm based on the
binomial probability function has been described
previously.26 It is based on the mid-P adjustment5,41
for the Clopper-Pearson method of exact confidence
interval estimation.14 The investigator specifies PU and
PL as the desired limits of the confidence interval
around the hypothesized proportion (P) and the
desired level of confidence. The calculated sample size
could be adjusted using the finite population correction factor if deemed appropriate.
Software is available to calculate the necessary
sample size for estimating population proportions.
Epi Infob includes software that can perform these
calculations33 and is available free for download. The
software performs calculations based on normal
approximation methods and will apply the finite
population correction factor if the population size is
specified. Software to perform calculations based on
binomial exact methods (Mid-P Sample Size routine)
can be obtained by contacting the author.
An example of this type of sample size problem is
the design of a study to estimate the diagnostic
specificity of a new assay to screen healthy cattle for
Foot-and-mouth disease virus (FMDV; Table 4). The
number of cattle necessary to sample could be
calculated for an expected specificity of 0.99 and the
desire to estimate this specificity 60.01 with 99%
confidence. For this example, it will be assumed that
sampling is from a large population, and a simple
random sampling design will be employed. Epi Info 6
can be used to make the calculation based on normal
approximation methods (newer versions of Epi Info
have not retained presented sample size routines).
From the menu, choose Programs R EPITABLE.
From the menus in EPITABLE, choose Sample R
Sample size R Single proportion. The size of the
population does not need to be changed unless the
application of the finite population correction is
desired. The design effect should be 1.0 unless the
variance is to be adjusted for clustered sampling
techniques. Enter 1% for the desired precision, 99%
for the expected prevalence, and check 1% for alpha.
Alternatively, the modified exact sample size routine
could be used. Open the Mid-P Sample Size routine
and enter 0.99 for the proportion, 0.01 for the error
Sample size situation for estimating the specificity of a test to screen cattle for Foot-and-mouth disease virus (FMDV).
Biologic question
Statistical model
Best guess for specificity
Precision
Confidence level
Results
Interpretation
What is the specificity of a new screening test for FMDV in cattle?
An exact binomial model is assumed because the specificity is expected to be close to 1.
The new assay is expected to be 99% specific.
Specificity is desired to be estimated 61%.
Specificity is desired to be estimated to be within these limits with 99% confidence.
974 cattle should be sampled.
If 974 cattle are sampled and the true specificity of the test is 99%, then a 99% confidence interval
should have the specified width (2%).
Sample size for diagnostic investigations
limit, and 0.99 for the confidence level. The normal
approximation method suggested that 657 cattle
would need to be sampled, whereas the method based
on the exact binomial distribution suggested that 974
should be sampled. Neither of these numbers
incorporates finite population correction. The sample
size based on the exact distribution is preferred, and it
is substantially larger than the sample size based on
the normal approximation because the expected
proportion is very close to 1.
Typically, sample size calculations for studies that
will perform clustered sampling first calculate the
necessary sample size assuming independence or lack
of clustering. Calculated sample sizes are then
multiplied by the DE to account for the lack of
independence. Expert opinion can be used to account
for expected correlation of sampling units when prior
information concerning the intraclass correlation is
not available. A sample size routine incorporating a
method to estimate the DE based on expert opinion
for a fixed number of clusters has been developed27
and is available from the author.
Comparison of 2 proportions
Independent proportions. Calculating the sample size
necessary to compare 2 population proportions is
important when a comparison of the accuracy of
diagnostic tests is desired. Sensitivity and specificity are
population estimates, and comparison between 2 assays
should be based on this sample size situation. The usual
sample size formula13,25,53 based on the normal
approximation to the binomial with equal group sizes is
n~
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2
(1{P
){Zb P1 (1{P1 )zP2 (1{P2 )
Z1{a=2 2P
(P2 {P1 )2
,
where P1 and P2 are the expected proportions in each
group, and P¯ is the simple average of the expected
proportions. Variables Z12a/2 and Zb are the standard
normal Z values corresponding to the selected alpha (2sided test) and beta, respectively. Typical presentation
of the formula11,12 above uses Za/2 instead and an
addition of the 2 components within the numerator.
Solving these 2 formulations gives the same sample size
because the numerator is squared. The specific
formulation has been included here because
alternative hypotheses have been presented in figures
as being on the positive side of the null hypothesis, and
therefore Zb should be negative. This is also consistent
with the algebraic manipulation to solve for Zb, as
presented in the section related to power calculation.
The resulting sample size should be adjusted using the
9
Figure 4. Sample size estimates are affected by the standardized difference and the specified alpha (type I error) and beta
(type II error).
continuity correction factor, and all sample sizes should
be rounded up to the nearest whole number. The
magnitude of the difference between the 2 proportions
has a greater effect on calculated sample sizes than
typical values for alpha and beta (Fig. 4). The absolute
magnitude of the proportions affects the calculations,
with proportions closer to 0.5 resulting in larger sample
sizes11 because the variance of a proportion is greatest
at this value. The formula for the standardized
difference (SDiff ) in proportions3,36,61 is
jP1 {P2 j
SDiff~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
P1 ð1{P1 ÞzP2 ð1{P2 Þ
Software is available to calculate the necessary sample
size to compare 2 independent population proportions.
Epi Infob can be used to perform these calculations.
The calculations are based on normal approximation
methods and will apply a continuity correction factor.
An example of this type of sample size problem is the
design of a study to compare the diagnostic sensitivity
of magnetic resonance imaging (MRI) for detection of
intervertebral disk disease between chondrodystrophoid
and nonchondrodystrophoid breeds of dogs (Table 5).
The number of dogs necessary to sample could be
calculated for expected sensitivities of 90% and 80% in
chondrodystrophoid and nonchondrodystrophoid
dogs, respectively. The statistical test could be desired
to have an alpha of 5% and beta of 20% to detect this
difference in proportions. The ratio of chondrodystrophoid to nonchondrodystrophoid dogs also needs to
be specified, and the assumption could be made to have
equal group sizes. Epi Info 6 can be used to make the
calculation. From the menu, choose Programs R
EPITABLE. From the menus in EPITABLE, choose
10
Fosgate
Table 5. Sample size situation for comparing the sensitivity of magnetic resonance imaging (MRI) for detection of intervertebral
disk disease (IVDD) between chondrodystrophoid and nonchondrodystrophoid breeds of dogs.
Biologic question
Statistical model
Best guess for sensitivity 1
Best guess for sensitivity 2
Type I error
Type II error
Other assumptions
Results
Interpretation
Is the sensitivity of MRI for detecting IVDD different for chondrodystrophoid and
nonchondrodystrophoid dogs?
A binomial model is assumed for sensitivity, and a normal approximation is used for the comparison
of sensitivities.
MRI is expected to be 90% sensitive for detecting IVDD in chondrodystrophoid dogs.
MRI is expected to be 80% sensitive for detecting IVDD in nonchondrodystrophoid dogs.
A statistical test with 5% type I error (alpha) is desired.
A statistical test with 20% type II error (beta) is desired.
Assume an equal number of chondrodystrophoid and nonchondrodystrophoid dogs will be evaluated.
219 are necessary in each group, for a total of 438 dogs.
If 219 dogs are available from each breed group for sensitivity estimation and the true sensitivities are
80% and 90%, then a statistical test interpreted at the 5% level of significance (alpha) will have a
power of 80% for detecting the difference in sensitivities as statistically significant.
Sample R Sample size R Two proportions. The ratio of
group 1 to group 2 should be 1, the percentage in group
1 would be 90%, the percentage in group 2 would be
80%, alpha should be 5%, and the power should be set
at 80%. Calculations suggest that 219 dogs are
necessary in each group (chondrodystrophoid and
nonchondrodystrophoid) for a total of 438. The
reported sample size includes continuity correction.
Sample size calculations for the comparison of
proportions when the group sizes are not equal are a
simple modification of the presented formula.60 The
formula also can be modified to allow for the
estimation of odds ratios and risk ratios.20,53 All
presented formulas correspond to the necessary
sample sizes for 2-sided statistical tests. Variable
Z12a/2 is replaced with Z12a to modify the formula
for a 1-sided test.
Dependent proportions. When multiple tests are
performed based on specimens collected from the same
animal, then the proportions (i.e., sensitivity and
specificity) should be considered dependent. There
are multiple conditional and unconditional approaches
to solving this sample size problem,15,16,23,39,40,43,44,49,57
and a formula is not presented in this section due to
increased complexity and lack of consensus among
competing methods. An example of this type of sample
size problem is the design of a study to compare the
diagnostic specificity of 2 tests for FMDV screening in
healthy cattle (Table 6). Serum samples from each
selected animal for study will have both tests
performed in parallel. The number of cattle necessary
to sample could be calculated based on expected
specificities of 99% and 95% in test 1 and test 2,
respectively. The statistical test could be desired to
have alpha be 1% and beta 10% to detect this
difference in proportions. Software is available to
calculate the necessary sample size to compare 2
dependent population proportions. WinPepic includes
software that can perform these calculations and is
available free for download. From the main menu, the
program PAIRSetc should be selected. Sample size
should be chosen from the top menu of PAIRSetc. The
correct type of sample size procedure corresponds to
the McNemar test, and S1 should therefore be selected.
The significance level should be set as 1% and the
power as 90%. The expected percentage of ‘‘Yes’’ in
the first set of observations should be set as 99%, and
the other percentage of ‘‘Yes’’ should be set as 95%.
Numbers without ‘‘%’’ should be entered, and the
remainder of the input boxes should be left blank.
Table 6. Sample size situation for comparing the specificity of 2 tests for Foot-and-mouth disease virus (FMDV) screening in
healthy cattle.
Biologic question
Statistical model
Best guess for specificity 1
Best guess for specificity 2
Type I error
Type II error
Other assumptions
Results
Interpretation
Is the specificity of a new FMDV screening test in cattle different than the more established test?
A binomial model is assumed for specificity, a multinomial model is assumed for the cross-classified
proportions, and a normal approximation is used for the difference in specificities.
The new test is expected to be 99% specific.
The established test is considered 95% specific.
A statistical test with 1% type I error (alpha) is desired.
A statistical test with 10% type II error (beta) is desired.
Data from the 2 tests will be dependent because they will be performed on the same cattle.
544 cattle are required, and both tests performed on all.
If 544 cattle are available for specificity estimation and the true specificities are 99% and 95%, then a
statistical test interpreted at the 1% level of significance (alpha) will have a power of 90% for
detecting the difference in specificities as statistically significant.
Sample size for diagnostic investigations
Calculations suggest that 544 pairs of observations are
required (544 cattle total). This sample size is smaller
than the corresponding sample size if the proportions
were considered to be independent.
Epi Info 6 could be used to make the calculation if
the paired design was ignored. From the menu,
choose Programs R EPITABLE. From the menus in
EPITABLE, choose Sample R Sample size R Two
proportions. The ratio of group 1 to group 2 should be
1, the percentage in group 1 would be 99%, the
percentage in group 2 would be 95%, alpha should be
1%, and the power should be set at 90%. Calculations
suggest that 588 cattle are necessary for each group,
and this would be the total number of necessary cattle
because of the paired design. This sample size is not
much different (8% greater) than the calculation
based on the paired design, and since it is larger, it
would not be necessarily incorrect to use the usual
unmatched method for sample size determination.
Equivalency testing. A study that aims to determine
whether or not a certain test has the equivalent (or
noninferior) accuracy44 of another, typically wellestablished test is based on separately comparing
sensitivity and specificity between tests. The first step
is to consider the sensitivity and specificity of the wellestablished test and then quantify the level of
difference in the accuracy that would be allowable
while still considering the 2 tests equivalent or the new
test not inferior. It is not possible to calculate a
sample size to determine zero difference for the
similar reason that it is not possible to calculate a
sample size to be 100% sure that a given population
has no disease (zero prevalence). An example would
be to determine equivalency of a new test to a wellestablished test that has been reported to be 90%
sensitive and 95% specific. Further assumptions could
be that as long as the new test is at least 85% sensitive
and 90% specific, then it would be considered
equivalent. The allowable alpha and beta values
could be assumed to be 5% (2-sided) and 20%,
respectively. However, power values greater than 80%
and larger alpha values are sometimes assumed for
equivalency studies.54 Epi Info could be used to
calculate the necessary sample size as described
previously for 2 independent proportions. If equal
group sizes are assumed (for each test), then the
necessary sample size is 726 infected animals within
each group tested by the 2 tests for sensitivity
comparison and 474 uninfected animals within each
group for specificity comparison. If a paired design
were planned, then these numbers would be a
reasonably good estimate for the total number of
animals necessary for the evaluation. Often for
noninferiority testing a 1-sided statistical test will be
employed, and therefore the sample size calculation
11
should be adjusted accordingly. Equivalency testing in
general requires large sample sizes, and the discussed
example is a simplified situation. Literature related to
these studies documents several methods of calculation
and varies based on the determination of regions
associated with rejection of the null hypothesis of no
difference between tests. The simplified example has
been presented to give a general idea of how studies
should be designed, and interested readers should
review the paper by Lu et al.44
Calculation of power when sample size is fixed
When the sample size is fixed by design, then it is
good planning to determine the power of a statistical
test to identify a biologically important difference.
Estimating the power to compare 2 population
proportions is important when it is desired to
compare the accuracy of diagnostic tests. The usual
formula for calculating the power for this comparison
is an algebraic manipulation of the previously
presented sample size formula and assuming equal
group sizes is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1{P
){jP1 {P2 jpffiffinffi
Z1{a=2 2P
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
:
Zb ~
P1 (1{P1 )zP2 (1{P2 )
A modification of the above formula,24 including
continuity correction, is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1{P
){jP1 {P2 j n{ 2
Z1{a=2 2P
jP2 {P1 j
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
,
Zb ~
P1 (1{P1 )zP2 (1{P2 )
where n is the sample size, P1 and P2 are the expected
¯ is the simple
proportions in each group, and P
average of the expected proportions. Variables Z12a/2
and Zb are standard normal Z values. Power is
determined as 1 2 cumulative probability associated
with Zb as calculated from the formula (Table 7).
Typical presentations of these formulas24 incorporate
Za/2 and addition of the numerator components.
An example would be to compare diagnostic
sensitivity between 2 tests when both tests were
independently performed on 100 infected animals.
Assume that the tests are believed to have sensitivities
of 85% and 90%, and a test with an alpha of 5% is
desired. Epi Info 6 can be used to calculate the power
of the test to compare these 2 proportions. From the
menu, choose Programs R EPITABLE. From the
menus in EPITABLE, choose Sample R Power
calculation R Cohort study. The number of exposed
should be set to 100, and the ratio of exposed to
exposed as 1 (exposed and nonexposed is simply a
way to distinguish the 2 groups). The relative risk
12
Fosgate
Table 7.
Common standard normal Z scores for use in sample
size formulas and power estimation.*
Standard normal
Z score
Cumulative
probability
1 2 cumulative
probability
22.576
22.326
21.960
21.645
21.282
21.036
20.842
20.524
20.253
0.000
0.253
0.524
0.842
1.282
1.645
1.960
2.326
2.576
0.005
0.010
0.025
0.050
0.100
0.150
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
0.950
0.975
0.990
0.995
0.995
0.990
0.975
0.950
0.900
0.850
0.800
0.700
0.600
0.500
0.400
0.300
0.200
0.100
0.050
0.025
0.010
0.005
* Power is found as 1 minus the cumulative probability
associated with the Z score calculated from the power formula.
worth detecting should be set to 1.06 (90%/85%;
larger proportion over the smaller), the attack rate in
the unexposed should be set as 85% as the lower of
the 2 proportions, and alpha should be 5%. The
power calculation includes continuity correction and
is reported as 13.3% by Epi Info. Using the presented
formulas, the powers are calculated as 18.5% and
12.8% for the uncorrected and continuity-corrected
formulas, respectively.
The calculation of power is dependent on the
specification of an alternative hypothesis. The sampling distribution of the proportion under the null
hypothesis is determined, and the critical value (Pr Z
# z 5 1 2 a/2) is located on this distribution. The
alternative hypothesis is set as the expected difference
in the 2 population proportions, and the sampling
distribution of this difference is plotted with the
critical value under the null hypothesis. The area
under the sampling distribution of the alternative
hypothesis to the right of the critical value is the
power of the statistical test (Fig. 5). The shapes of
these curves depend on the hypothesized proportions
and the sample size. There is only a single power value
related to each possible alpha and alternative
hypothesis (expected difference in proportions).
Conclusions
The calculation of the sample size is very important
during the design stage of all epidemiologic studies
and should match the proposed statistical analysis to
the extent possible. It is important to recognize that
Figure 5. The sampling distribution under the null (black
line) and alternative (gray line) hypotheses for the situation when
P1 5 0.2 and P2 5 0.4 with equal group sizes. HO is the null
hypothesis that the true proportion is 0.3 (simple average of P1
and P2), and HA is the alternative hypothesis that P1 5 0.2 and P2
5 0.4 and is centered at P2. Alternatively, HA could have been
centered at P1. The gray shaded area corresponds to the power for
the statistical test with alpha of 5% when the sample size per group
is 20 (A), 40 (B), 80 (C), and 160 (D). The power is 50% in panel B
because the observed P value of the comparison is equal to the
specified alpha (5%).
there is no single correct sample size, and all
calculations are only as good as the employed
assumptions. The sample size ensures statistical
significance if the subsequent data collection is
perfectly consistent with the assumptions made for
the sample size calculation (assuming power was set
as 50% or greater). If the null hypothesis is false and
the assumed alternate hypothesis is true, then the
probability of observing statistical significance will be
equal to the assumed power of the test. The choice of
assumptions for calculations is very important
because their validity determines the likelihood of
observing statistical significance. The traditional
choices of 5% alpha and 20% beta can simply be
used unless the investigator has specific reasons for
other values. The choices of the best guesses or
hypothesized values for the proportions that will be
estimated by the study are more difficult. Values for
these assumptions should be based on available
literature or expert opinion. When there is doubt
concerning their value, then proportions could be
assumed to be close to 0.5. A proportion of 0.5 has
the maximum variance, and therefore would result in
the largest sample size.
Sample size calculations correspond to the number
of animals that are required to complete the study
and be available for statistical analysis. They are the
minimum sample sizes required to achieve the desired
Sample size for diagnostic investigations
statistical properties. Sample size calculations should
be increased by the number of animals that are
anticipated to be lost during the study. The study
design influences the number of animals expected to
be lost during implementation. Cross-sectional studies should have minimal losses, but there is always the
possibility of mislabeled samples, lost records, and
laboratory errors. Sample sizes for cross-sectional
studies should be increased 1–5% to account for these
potential losses. Prospective studies that cover long
time periods could have substantial losses, but these
types of study designs are unusual for diagnostic
investigations.
Some published recommendations include the posthoc calculation of power when study results fail to
achieve statistical significance.34 However, there is no
statistical basis for this calculation.31 The power of a
2-sided test with an alpha set to be equal to the
observed P value is typically 50%,34 as presented in
Figure 5. Therefore, post-hoc power calculations will
typically be less than 50% for observed nonsignificant
results. This fact, in conjunction with the one-to-one
relationship between P value and power, suggests that
little information can be garnered from their calculation. Post-hoc calculations of power could be useful
if performed for magnitudes of differences other than
what was observed by the study. In general, however,
the post-hoc calculation of power is akin to determining the probability that an event will be observed
after the event has already occurred (or not).
A primary purpose of sample size calculations is to
ensure that the proposed study will be of an
appropriate size to find an important difference
statistically significant. Therefore, calculations should
be performed prior to the determination of the study
size. In practice, however, sample sizes are sometimes
performed after the number of animals for study has
been set, for reasons that might include cost or
availability. Often, the assumptions are simply
modified based on trial and error until calculations
lead to the predetermined sample size, and these
calculations are presented in grant applications or
other proposed research plans. Also, studies are
sometimes performed without performing any sample
size calculations. Many journals require discussion of
sample size calculations, and therefore such calculations are sometimes performed after the fact, with
assumptions modified until the appropriate size is
found. These are obviously not appropriate uses of
sample size calculations. A better approach often
would be the calculation of power based on the
sample size expected to be used for the study. Though
such post-hoc determinations are inappropriate or
misleading, many epidemiologists and statisticians
likely have been asked to perform these calculations.
13
Unfortunately, the realities of research do not always
coexist peacefully within the service of science itself. It
is hoped that the material presented in the present
article will demystify sample size calculations and
encourage their use during the initial design phase of
surveillance and diagnostic evaluations.
Acknowledgements
This manuscript was prepared in part through financial
support by the U.S. Department of Agriculture, Cooperative State Research, Education, and Extension Service,
National Research Initiative Award 2005-35204-16087. The
author would like to thank the anonymous reviewers for
helpful suggestions, which resulted in a better overall paper.
Sources and manufacturers
a. Survey Toolbox version 1.04. by Angus Cameron et al.
Available at http://www.ausvet.com.au/content.php?page5res_
software.
b. Epi InfoTM version 6.04d for Windows, Centers for Disease
Control and Prevention, Atlanta, GA. Available at http://
www.cdc.gov/epiinfo/Epi6/ei6.htm.
c. WINPEPI (PEPI-for-Windows) version 6.8 by J. H. Abramson.
Available at http://www.brixtonhealth.com/pepi4windows.
html.
References
1. Aitken CG: 1999, Sampling—how big a sample? J Forensic
Sci 44:750–760.
2. Alonzo TA, Pepe MS, Moskowitz CS: 2002, Sample size
calculations for comparative studies of medical tests for
detecting presence of disease. Stat Med 21:835–852.
3. Altman DG: 1980, Statistics and ethics in medical research:
III. How large a sample? Br Med J 281:1336–1338.
4. Berry DA, Lindgren BW: 1996.Statistics: theory and methods,
2nd ed. Duxbury Press, Belmont, CA.
5. Berry G, Armitage P: 1995, Mid-P confidence intervals: a brief
review. Statistician 44:417–423.
6. Bochmann F, Johnson Z, Azuara-Blanco A: 2007, Sample size
in studies on diagnostic accuracy in ophthalmology: a
literature survey. Br J Ophthalmol 91:898–900.
7. Branscum AJ, Johnson WO, Gardner IA: 2006, Sample size
calculations for disease freedom and prevalence estimation
surveys. Stat Med 25:2658–2674.
8. Breslow NE, Day NE, eds.: 1980, Statistical methods in cancer
research. In: The analysis of case-control studies, vol.1. IARC
Scientific Publications no. 32, pp. 131–133. International
Agency for Research on Cancer, Lyon, France.
9. Cameron AR, Baldock FC: 1998, A new probability formula
for surveys to substantiate freedom from disease. Prev Vet
Med 34:1–17.
10. Campbell MK, Mollison J, Grimshaw JM: 2001, Cluster trials
in implementation research: estimation of intracluster correlation coefficients and sample size. Stat Med 20:391–399.
11. Carlin JB, Doyle LW: 2002, Sample size. J Paediatr Child
Health 38:300–304.
12. Carpenter TE: 2001, Use of sample size for estimating efficacy
of a vaccine against an infectious disease. Am J Vet Res
62:1582–1584.
13. Casagrande JT, Pike MC: 1978, An improved approximate
formula for calculating sample sizes for comparing two
binomial distributions. Biometrics 34:483–486.
14
Fosgate
14. Clopper CJ, Pearson ES: 1934, The use of confidence or
fiducial limits illustrated in the case of the binomial.
Biometrika 26:404–413.
15. Connett JE, Smith JA, McHugh RB: 1987, Sample size and
power for pair-matched case-control studies. Stat Med
6:53–59.
16. Connor RJ: 1987, Sample size for testing differences in
proportions for the paired-sample design. Biometrics
43:207–211.
17. Daly LE: 1991, Confidence intervals and sample sizes: don’t
throw out all your old sample size tables. BMJ 302:333–336.
18. Delucchi KL: 2004, Sample size estimation in research with
dependent measures and dichotomous outcomes. Am J Public
Health 94:372–377.
19. Dohoo IR, Martin W, Stryhn H: 2003, Veterinary epidemiologic research. AVC Inc., Charlottetown, PE, Canada.
20. Donner A: 1984, Approaches to sample size estimation in the
design of clinical trials—a review. Stat Med 3:199–214.
21. Fisher RA: 1929, Tests of significance in harmonic analysis.
Proc R Soc Lond A Math Phys Sci 125:54–59.
22. Flahault A, Cadilhac M, Thomas G: 2005, Sample size
calculation should be performed for design accuracy in
diagnostic test studies. J Clin Epidemiol 58:859–862.
23. Fleiss JL, Levin B: 1988, Sample size determination in studies
with matched pairs. J Clin Epidemiol 41:727–730.
24. Fleiss JL, Levin BA, Paik MC: 2003, Statistical methods for
rates and proportions, 3rd ed. Wiley, Hoboken, NJ.
25. Fleiss JL, Tytun A, Ury HK: 1980, A simple approximation
for calculating sample sizes for comparing independent
proportions. Biometrics 36:343–346.
26. Fosgate GT: 2005, Modified exact sample size for a binomial
proportion with special emphasis on diagnostic test parameter
estimation. Stat Med 24:2857–2866.
27. Fosgate GT: 2007, A cluster-adjusted sample size algorithm
for proportions was developed using a beta-binomial model. J
Clin Epidemiol 60:250–255.
28. Georgiadis MP, Johnson WO, Gardner IA: 2005, Sample size
determination for estimation of the accuracy of two conditionally independent tests in the absence of a gold standard.
Prev Vet Med 71:1–10.
29. Goodman SN: 1993, p values, hypothesis tests, and likelihood:
implications for epidemiology of a neglected historical debate.
Am J Epidemiol 137:485–496.
30. Greenland S: 1985, Power, sample size and smallest detectable
effect determination for multivariate studies. Stat Med
4:117–127.
31. Greenland S: 1988, On sample-size and power calculations for
studies using confidence intervals. Am J Epidemiol
128:231–237.
32. Greiner M, Gardner IA: 2000, Epidemiologic issues in the
validation of veterinary diagnostic tests. Prev Vet Med
45:3–22.
33. Grimes DA, Schulz KF: 1996, Determining sample size and
power in clinical trials: the forgotten essential. Semin Reprod
Endocrinol 14:125–131.
34. Hoenig JM, Heisey DM: 2001, The abuse of power: the
pervasive fallacy of power calculations for data analysis. Am
Stat 55:19.
35. Houle TT, Penzien DB, Houle CK: 2005, Statistical power
and sample size estimation for headache research: an overview
and power calculation tools. Headache 45:414–418.
36. Jones SR, Carley S, Harrison M: 2003, An introduction to
power and sample size estimation. Emerg Med J 20:453–458.
37. Kelsey JL: 1996, Methods in observational epidemiology, 2nd
ed. Oxford University Press, New York, NY.
38. Killip S, Mahfoud Z, Pearce K: 2004, What is an intracluster
correlation coefficient? Crucial concepts for primary care
researchers. Ann Fam Med 2:204–208.
39. Lachenbruch PA: 1992, On the sample size for studies based
upon McNemar’s test. Stat Med 11:1521–1525.
40. Lachin JM: 1992, Power and sample size evaluation for the
McNemar test with application to matched case-control
studies. Stat Med 11:1239–1251.
41. Lancaster HO: 1961, Significance tests in discrete distributions. J Am Stat Assoc 56:223–234.
42. Li J, Fine J: 2004, On sample size for sensitivity and specificity
in prospective diagnostic accuracy studies. Stat Med
23:2537–2550.
43. Lu Y, Bean JA: 1995, On the sample size for one-sided
equivalence of sensitivities based upon McNemar’s test. Stat
Med 14:1831–1839.
44. Lu Y, Jin H, Genant HK: 2003, On the non-inferiority of a
diagnostic test based on paired observations. Stat Med
22:3029–3044.
45. McDermott JJ, Schukken YH: 1994, A review of methods
used to adjust for cluster effects in explanatory epidemiological studies of animal populations. Prev Vet Med 18:155–173.
46. Neyman J, Pearson ES: 1928, On the use and interpretation of
certain test criteria for purposes of statistical inference: part I.
Biometrika, 20A:175–240.
47. Neyman J, Pearson ES: 1933, On the problem of the most
efficient tests of statistical hypotheses. Proc R Soc Lond A
Math Phys 231:289–337.
48. Obuchowski NA: 1998, Sample size calculations in studies of
test accuracy. Stat Methods Med Res 7:371–392.
49. Parker RA, Bregman DJ: 1986, Sample size for individually
matched case-control studies. Biometrics 42:919–926.
50. Rigby AS, Vail A: 1998, Statistical methods in epidemiology.
II: A commonsense approach to sample size estimation.
Disabil Rehabil 20:405–410.
51. Rothman KJ, Greenland S: 1998, Modern epidemiology, 2nd
ed. Lippincott-Raven, Philadelphia, PA.
52. Sahai H, Khurshid A: 1995, A note on confidence intervals for
the hypergeometric parameter in analyzing biomedical data.
Comput Biol Med 25:35–38.
53. Schlesselman JJ: 1974, Sample size requirements in cohort and
case-control studies of disease. Am J Epidemiol 99:381–384.
54. Sheps S: 1993, Sample size and power. J Invest Surg
6:469–475.
55. Sterne JA, Davey Smith G: 2001, Sifting the evidence—what’s
wrong with significance tests? BMJ 322:226–231.
56. Suess EA, Gardner IA, Johnson WO: 2002, Hierarchical
Bayesian model for prevalence inferences and determination
of a country’s status for an animal pathogen. Prev Vet Med
55:155–171.
57. Suissa S, Shuster JJ: 1991, The 2 3 2 matched-pairs trial: exact
unconditional design and analysis. Biometrics 47:361–372.
58. Thrusfield MV: 2005, Veterinary epidemiology, 3rd ed.
Blackwell Science, Ames, IA.
59. Ukoumunne OC: 2002, A comparison of confidence interval
methods for the intraclass correlation coefficient in cluster
randomized trials. Stat Med 21:3757–3774.
60. Ury HK, Fleiss JL: 1980, On approximate sample sizes for
comparing two independent proportions with the use of
Yates’ correction. Biometrics 36:347–351.
61. Whitley E, Ball J: 2002, Statistics review 4: sample size
calculations. Crit Care 6:335–341.
62. Wickramaratne PJ: 1995, Sample size determination in
epidemiologic studies. Stat Methods Med Res 4:311–337.