Microarray experimental design: power and sample size considerations

Microarray experimental design: power and sample
size considerations
M. C. K. Yang, J. J. Yang, R. A. McIndoe and J. X. She
Physiol. Genomics 16:24-28, 2003. First published 7 October 2003;
doi:10.1152/physiolgenomics.00037.2003
You might find this additional info useful...
This article cites 14 articles, 8 of which can be accessed free at:
http://physiolgenomics.physiology.org/content/16/1/24.full.html#ref-list-1
The Autoimmune Regulator Directly Controls the Expression of Genes Critical for Thymic
Epithelial Function
Qing-Guo Ruan, Kenneth Tung, Daniel Eisenman, Yulius Setiady, Sarah Eckenrode, Bing Yi,
Sharad Purohit, Wei-Peng Zheng, Yan Zhang, Leena Peltonen and Jin-Xiong She
J Immunol, June 1, 2007; 178 (11): 7173-7180.
[Abstract] [Full Text] [PDF]
Epistemological issues in omics and high-dimensional biology: give the people what they
want
Tapan S. Mehta, Stanislav O. Zakharkin, Gary L. Gadbury and David B. Allison
Physiol. Genomics, December 13, 2006; 28 (1): 24-32.
[Abstract] [Full Text] [PDF]
A guide to issues in microarray analysis: application to endometrial biology
Christine A White and Lois A Salamonsen
Reproduction, July , 2005; 130 (1): 1-13.
[Abstract] [Full Text] [PDF]
False discovery rate, sensitivity and sample size for microarray studies
Yudi Pawitan, Stefan Michiels, Serge Koscielny, Arief Gusnanto and Alexander Ploner
Bioinformatics, July 1, 2005; 21 (13): 3017-3024.
[Abstract] [Full Text] [PDF]
Updated information and services including high resolution figures, can be found at:
http://physiolgenomics.physiology.org/content/16/1/24.full.html
Additional material and information about Physiological Genomics can be found at:
http://www.the-aps.org/publications/pg
This infomation is current as of December 18, 2011.
Physiological Genomics publishes results of a wide variety of studies from human and from informative model systems with
techniques linking genes and pathways to physiology, from prokaryotes to eukaryotes. It is published quarterly in January, April,
July, and October by the American Physiological Society, 9650 Rockville Pike, Bethesda MD 20814-3991. Copyright © 2003 by
the American Physiological Society. ISSN: 1094-8341, ESSN: 1531-2267. Visit our website at http://www.the-aps.org/.
Downloaded from physiolgenomics.physiology.org on December 18, 2011
This article has been cited by 5 other HighWire hosted articles
Comparative analysis of innate immune responses following infection of newborn calves
with bovine rotavirus and bovine coronavirus
Palok Aich, Heather L. Wilson, Radhey S. Kaushik, Andy A. Potter, Lorne A. Babiuk and Philip
Griebel
J Gen Virol, October , 2007; 88 (10): 2749-2761.
[Abstract] [Full Text] [PDF]
Physiol Genomics 16: 24–28, 2003.
First published October 7, 2003; 10.1152/physiolgenomics.00037.2003.
Microarray experimental design: power and sample size considerations
M. C. K. Yang,1 J. J. Yang,1 R. A. McIndoe,2 and J. X. She2
1
Department of Statistics, University of Florida, Gainesville, Florida 32611; and 2Center for Biotechnology
and Genomic Medicine, Medical College of Georgia, Augusta, Georgia 30912
Submitted 20 March 2003; accepted in final form 30 September 2003
gene expression; statistical analysis; functional genomics
HIGH-THROUGHPUT MICROARRAY technologies have created unparalleled opportunities to study the mechanism of disease, monitor disease progression, and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not been perfected, there is
a large uncertainty in the decision process. Therefore, one must
rely on statistical analyses of the data to draw conclusions from
microarray experiments. A number of methods have been
developed to extract useful and robust information from microarray data sets once the hybridizations are complete. However, not as much attention has been paid to the experimental
design before the investigator actually completes the experiment.
One of the principal questions while designing a microarray
experiment is: What is the sample size required for a microarray experiment to achieve a user defined statistical threshold?
To achieve this goal, one must decide how the data will be
analyzed and what conclusions can be made. Since microarrays
can be used to answer a variety of questions, different experiments will require different designs and analyses. In this
study, we will concentrate on finding expression differences
between two treatment groups. The goals of the experiments
can be classified into three categories: 1) comparing two
groups at a single time point, 2) examining the treatment effect
over time when the without treatment effect is known, and 3)
Article published online before print. See web site for date of publication
(http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: J.-X. She, Center for
Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th
St., PV6B108, Augusta, GA 30912 (E-mail: jshe@mail.mcg.edu); or M. C. K.
Yang, Dept. of Statistics, Univ. of Florida, Gainesville, FL 32611 (E-mail:
yang@stat.ufl.edu).
24
comparing two groups at sequential time points. We will
investigate the sample size requirements necessary to satisfy a
specified false discovery rate (FDR) and power (1 ⫺ ␤) given
various biological and experimental variation.
Although there are other reports on how to compute the
sample size, none of them is as simple and straightforward as
the methods presented here. In our experience, we often observe a large number of genes that show expression differences
in a typical microarray analysis. The key is that we want to
make sure that only a small number in these selected genes are
false positives (discoveries). A brief survey of some of the
other methods is in the DISCUSSION.
METHODS
cDNA microarrays are printed on glass microscope slides using
established protocols (15). The analysis of microarray data can be
separated into several stages. The first stage is to eliminate spots with
very weak signal, which we define as the signal that is indistinguishable from the background noise. Yang et al. (17) demonstrated that
any signal less than 3 standard deviations of the background noise
should be discarded. Due to the variation in printing material on the
cDNA arrays, the expression is more accurately measured by the ratio
of sample intensity vs. intensity from a common reference (5, 12). To
derive the ratio, the intensities from both channels have to be normalized. Although there are a number of methods to normalize the two
channels, Yang et al. (17) determined that there was actually a very
small difference among the different normalization methods (10)
when spots with sufficient signal are chosen. In this study, we assume
that the analysis will be done after flagging for weak spots, logtransformation, and normalization. Given these assumptions, the useful data is now yijtk, where i is the treatment index, j is the subject
index, t is the time index, and k is the replicate index. A few
assumptions have to be made in order to obtain the analytic results for
sample size determination. The most general model in this study is
y ijtk ⫽ ␮ it ⫹ ␦ ijt ⫹ ⑀ ijtk
(1)
where ␮it is the mean expression level under the ith treatment at time
t, ␦ijt represents the subject variation, and ⑀ijtk is the experimental
variation. Throughout this paper, ␮it is a fixed constant (or fixed
effect), and ␦ijt is a random sample from subjects under treatment i, or
it is a random effect. Some of the indices can also be eliminated when
they are not relevant. For example, when the data is not measured over
time, the index t is eliminated, and when there is only one treatment,
the index i is omitted.
Adjustment of the threshold significance level. As in the usual
hypothesis testing, a significance level ␣ is assigned. When there is
only one hypothesis, this guarantees that the chance of incorrectly
rejecting the null hypothesis is ␣. When many hypotheses are tested
simultaneously, ␣ cannot be considered as the significance level for
rejecting all the true null hypotheses. The level for each individual test
has to be adjusted. One option for adjusting the test is to use the
Bonferroni inequality, i.e., to set every test at the new significant level
␣/ng, where ng is the number of hypotheses tested simultaneously. We
use the notation ng because this number is the number of genes in the
array. However, when ng is very large, the Bonferroni adjustment is
1094-8341/03 $5.00 Copyright © 2003 the American Physiological Society
Downloaded from physiolgenomics.physiology.org on December 18, 2011
Yang, M. C. K., J. J. Yang, R. A. McIndoe, and J. X. She.
Microarray experimental design: power and sample size considerations. Physiol Genomics 16: 24–28, 2003. First published October 7,
2003; 10.1152/physiolgenomics.00037.2003.—Gene expression analysis using high-throughput microarray technology has become a
powerful approach to study systems biology. The exponential growth
in microarray experiments has spawned a number of investigations
into the reliability and reproducibility of this type of data. However,
the sample size requirements necessary to obtain statistically significant results has not had as much attention. We report here statistical
methods for the determination of the sufficient number of subjects
necessary to minimize the false discovery rate while maintaining high
power to detect differentially expressed genes. Two experimental
designs were considered: 1) a comparison between two groups at a
single time point, and 2) a comparison of two experimental groups
with sequential time points. Computer programs are available for the
methods discussed in this paper and are adaptable to more complicated situations.
25
SAMPLE SIZE FOR MICROARRAY EXPERIMENTS
too conservative because it is meant to control family-wise error rate
(1), i.e., to control the probability of one type I error among thousands
of hypotheses. In microarray experiments, it is important to find the
best compromise between maximizing the power of detection and
minimizing false-positive identification. Very often, although not
always, additional follow-up research such as database search and
real-time PCR will be done to further eliminate false-positive results
(6). Thus the FDR defined as the expected proportion of wrong
discovery (1) fits this purpose better. At the sample size determination
stage, we need to know approximately the minimum number of genes
that will be accepted under a given selection rule. An experimenter
may know, from past experience, that at least n0 genes will be
accepted, then the adjusted significance level for each gene is
␣⬘ ⫽ ␣ max 共1, n0兲/ng
(2)
y ijk ⫽ ␮ i ⫹ ␦ ij ⫹ ⑀ ijk
(3)
where i ⫽ 1, 2; j ⫽ 1, . . ., ni; k ⫽ 1, . . ., m; ␦ij is the between subjects
variation, and ⑀ijk is experimental variation. When m ⬎ 1, replicates
are used to reduce the intermicroarray variation under the same
biological condition. We assume that both of ␦ij and ⑀ijk are normally
distributed with zero mean and standard deviation (SD) ␴s and ␴e,
respectively.
The hypotheses are
H 0: ␮1 ⫽ ␮2 versus H1: ␮1 ⫽ ␮2
(4)
for each gene. After the data has been collected, the mean and
variance for each gene is defined by
y i.. ⫽ ⌺ j ⌺ k y ijk /共n i m兲, y ij. ⫽ ⌺ k y ijk /m
(5)
2
⫽ ⌺ j ⌺ k 共y ij. ⫺ y i.. 兲 2 /共n 1 ⫹ n 2 ⫺ 2兲
s p..
(6)
and the decision rule is to accept H1 at significance level ␣⬘ when
t ⫽ 兩y 1.. ⫺ y 2.. 兩/关s p 共1/n 1 ⫹ 1/n 2 兲 1/2 兴 ⱖ t ␯,␣⬘
(7)
where ␣⬘ is defined by Eq. 2, ␯ ⫽ n1 ⫹ n2 ⫺ 2, and t␯,␣⬘ is the upper
␣⬘ quantile of the t distribution with ␯ degrees of freedom. Other types
of tests such as unequal variance t-test or a Mann-Whitney test can
also be used, but the results should be very similar (18). To determine
the sample size, we need to specify the power (1 ⫺ ␤) at a predetermined level of significant gene expression difference 兩␮1 ⫺ ␮2兩. In
traditional statistics, what we need is the value
⌬ ⫽ 兩␮ 1 ⫺ ␮ 2 兩/␴ s
(8)
and the variance ratio
r⫽
␴ 2e
␴ s2
(9)
It is often difficult to determine what ⌬ constitutes a significant
expression difference. We believe that the overlap concept can make
Physiol Genomics • VOL
16 •
w ⫽ the proportion of subjects with
(10)
expression apparently belonging to the other group
The term w is the proportion of subjects with expression that appears
to belong to the other treatment group. Under the model in Eq. 3, the
overlapping proportion is symmetric, i.e., the proportions that violate
the treatment assignment are the same in both treatments. The relation
between the traditional ⌬ in Eq. 8 and w in Eq. 10 is
w ⫽ 1 ⫺ ⌽共⌬/2兲
(11)
where ⌽(x) is the standard normal distribution function. Although ⌬
and w are equivalent and can be easily converted from one to another,
it is our opinion that the idea of overlapping between groups has an
easier interpretation. A historical survey and recent developments in
using overlapping for sample size determination have been published
(3, 8, 14). Hwang et al. (9) has also used the same idea in microarray
analysis. To determine the sample size based on Eqs. 7 and 11, we use
the noncentral F distribution (see APPENDIX II).
Serial time points with one treatment group. This experimental
design will measure gene changes under one treatment over time,
while the gene expression under no treatment condition is known. Let
the time points be 1, 2, . . ., T. Since there is only one treatment, the
index i can be dropped, and Eq. 1 becomes
y jtk ⫽ ␮ t ⫹ ␦ jt ⫹ ⑀ jtk
(12)
where ␦jt represents the subject variation. Under the null hypothesis,
when there is no treatment, we should have
H 0: ␮1 ⫽ ␮2 ⫽ · · · ⫽ ␮T
i.e., gene expression is not affected by the treatment. The test for
this hypothesis would be a two-way analysis of variance with time
and subject as two discrete factors with the subjects being a
random effect. The alternative is that H0 is not true, i.e., gene
expression has been changed. To determine the sample size, we
need to specify the pattern of {␮t} under H1. Moreover, we need to
know how this pattern varies among subjects. We use a simple
linear expression by letting
␮ t ⫽ ␥t,
and ␦jt ⫽ ␦jt
(13)
where ␥ ⫽ 0 under H0 (␥ ⫽ 0 under H1), and the ␦j terms represent
the variation of the subjects response, which are assumed to be
independent and identically distributed normal random variables with
0 mean and variance ␴␥2(N(0,␴␥2)) and ⑀jtk ⬃ N(0,␴e2). The reason
for this alternative model is that with a linear time effect, the slope
usually changes with the subjects. However, the linear relationship is
not crucial, because when time is considered discrete, Eq. 13 can be
a good approximation to a quadratic change. For example, the pattern
in Fig. 1, left, can be expressed linearly when time is reordered. We
have tried to use specific alternatives such as a quadratic model
(which covers the linear model) to increase the power of the test.
However, the increase in power is very limited. One could also make
the model more general by adding correlations between time points.
But during the design stage, this correlation is difficult to ascertain.
The random effect, ␦j, in the slope should be enough to rep resent the
intra-subject correlation.
Again, we will use the overlap concept to specify the alternative. In
the model in Eq. 13, suppose the alternative is ␥ ⬎ 0. (The case ␥ ⬍
0 works almost identically.) We let w ⫽ the proportion of subjects (in
the population) that show the opposite (␥ ⫹ ␦j ⬍ 0) mRNA expression. The power can be computed by simulation using SAS PROC
MIXED, i.e., we simulate the data a large number of times under the
alternative and use the proportion of the times that H1 is accepted as
www.physiolgenomics.org
Downloaded from physiolgenomics.physiology.org on December 18, 2011
to guarantee an ␣ FDR. The derivation is given in the APPENDIX I. The
value n0 has to be determined by experience or a pilot experiment. In
our experience, there are always a considerable number of genes that
will show different expressions. If there is no way to determine n0,
then the experimenter may let n0 ⫽ 0 and determine the most
conservative (largest) sample size. Sometimes it may be necessary to
do a pilot study to determine n0 and other variance components in the
model. For our laboratory, it is common practice to do a pilot study to
get an idea of these parameters and then determine the sample size for
the required power.
At the data analysis stage, other methods such as permutation tests
(4, 16) may be used, but their power are not as easy to compute at the
design stage.
Single time point comparison between two treatment groups. For
the case of a comparing between treatments at one time point, one can
drop the t index so Eq. 1 becomes
the interpretation of the alternative easier. When the alternative
hypothesis is true, we have 兩␮1 ⫺ ␮2兩 ⬎ 0. Naturally, any expression
closer to ␮1 will be classified to treatment 1 and vice versa. Let
26
SAMPLE SIZE FOR MICROARRAY EXPERIMENTS
Fig. 1. Any reaction where time is considered as a discrete variable can be
approximated by a linear function. Note the time points on the right are
reordered from the times on the left.
H 0: ␮11 ⫽ ␮21; ␮12 ⫽ ␮22; . . . ; ␮1T ⫽ ␮2T
and the general alternative that H0 is not true. This model is based on
two factors, time and treatment, and subjects are nested in the
treatments. The experiment is to test the treatment effect or the
treatment-time interaction. We will not consider the treatment-time
interaction in the design stage. To specify an alternative for the sample
size determination, more parameters are involved. We follow the
linear approximation in Eq. 13 with
␮ 2t ⫽ ␮ 1t ⫹ ␥t
and
␦ijt ⫽ ␦ijt
(14)
where the ␦ij terms share the same assumptions as those in Eq. 13.
Again we can define the alternative by using the overlap idea with w
being equal to the proportion of subjects that have the opposite slope
(␥ ⫹ ␦ij ⬍ 0) in mRNA expression when it belongs to the treatment
group with ␥ ⬎ 0. This model can be expanded to include other
alternatives such as the change of intercept. All of these modifications can be easily embedded into the SAS program mentioned in
APPENDIX III.
The details of power computation are again given in APPENDIX II.
RESULTS
For a single time point. Figure 2 presents the sample size
requirements for a microarray experiment with 5,000 genes
(ng) assayed with an FDR of 0.05 (␣) and power ⫽ 0.85 when
n0 ⫽ 0, 5, 20, and 50. The sample sizes for the two treatments
are assumed to be equal (i.e., n1 ⫽ n2 ⫽ n) and the number of
replicates equals 1 (m). The variance ratio (r) defined in Eq. 9
varies from 0 to 1. For example, if ng ⫽ 5,000, r ⫽ 0.25, and
a 15% or less overlap between genes from the two treatments
is considered significant, then 22 subjects per treatment group
is required to reach a power of 0.85 (n0 ⫽ 0), but only 15
subjects if we expect at least 50 genes will show significance.
However, Fig. 2 is used only as a demonstration of the change
of sample size under various conditions. The software given in
APPENDIX III will produce the required sample size under any
significance level, power, and other parameter values.
For serial time points. Although Fig. 2 covers some useful
experimental cases, it is more difficult to provide examples of
general interest in serial time point measurements due to the
large number of parameters. The power depends not only on
the biological/experimental variations as well as the number of
time points (T), but also the time unit of T. For example, for a
Physiol Genomics • VOL
16 •
Fig. 2. The required sample sizes under various alternatives based on the
overlap ratio w defined by Eq. 11 (or equivalently effect size ⌬) and variance
ratio r by Eq. 9 under various assumptions on the number of possible selected
genes (n0). The curves from the top to bottom are for r ⫽ 1.0, 0.75, 0.50, 0.25,
and 0.00, respectively. The overall significance is set at alpha ⫽ 0.05. The
sample size is n ⫽ n1 ⫽ n2. An example is given in the text.
www.physiolgenomics.org
Downloaded from physiolgenomics.physiology.org on December 18, 2011
the power estimate. Note that the power is not determined by w alone. It
is also a function of T. Details are given in the RESULTS and APPENDIX II.
Serial time points with two treatment groups. This experiment will
measure the gene expression changes over time from two treatment
groups or from one treatment and one control. The model is now the
full model in Eq. 1 with the null hypothesis
fixed slope ␥, the relative size of ␥T to the experimental
standard deviation plays a crucial role. In other words, seven
points in 1 day will be different from seven points in a week
with the same growth rate ␥ in Eqs. 13 and 14. Thus the
variance ratio (r) alone is not sufficient to determine the sample
size.
Figure 3 shows one example on the sample size required for
detecting mRNA expression changes over time. We used the
data from Yang et al. (17). In that study, we found a ␴e ⬇ 0.40
for log2-transformed ratios between the sample and reference.
If one assumes ng ⫽ 5,000, ␴␥ ⫽ 0.5, and 5% or less overlap
between the two conditions (effect size ⫽ 3.29), and a twofold
change in expression at the end of the time series (␥T ⫽ 1),
then the sample size curves for the one-sample and two-sample
designs can be obtained. The Bonferroni correction (n0 ⫽ 0)
and expected 1% of genes to be selected (n0 ⫽ 50) for replicate
sizes m ⫽ 1, 2, and 4 are shown. The FDR is set at 0.05, and
the power is set at 0.80.
Note that in this design, the end time point is fixed. Thus an
increase in the number of time points can merely add points
between the predefined start and end points. Since the highest
difference happens at the end point, there is only a small
increase in power by observing more than four time points. It
is also quite intuitive that T can compensate for the number of
27
SAMPLE SIZE FOR MICROARRAY EXPERIMENTS
replicates (m), because both of them increase the information
for the treatment effect when the treatment is effective.
APPENDIX I
Adjustment of the Significant Level Based on the False
Discovery Rate
In single hypothesis testing, the significance level ␣ can be also
interpreted as the false discovery rate, FDR, i.e.,
␣ ⫽ Pr兵Claim difference 共discovery兲兩No difference其
DISCUSSION
This paper provides a reasonable way to estimate the sample
size for microarray experiments. The analysis methods are
chosen to be simple in order to make the unknown parameters
as few as possible. More sophisticated methods may be used
after the data has been collected, but based on our experience,
different analysis methods usually produce similar results for
the prominent gene effects.
The computer software is very easy to comprehend. Thus a
user can easily modify the programs when other tests are more
appropriate. For example, if treatment-time interactions are a
main concern, then this term can be easily added to the SAS
procedure. The alternative under other models, if known, can
also be easily generated using the simulation program with
simple modifications.
The sample size determination discussed in this paper depends on the expected number of genes n0 to be accepted after
the experiment. As we mentioned before, if there is no way to
determine this number, then the experimenter may choose to
use the most conservative number n0 ⫽ 0. Optimal sample size
Physiol Genomics • VOL
16 •
In multiple hypotheses testing, if all the null hypotheses are true, then
the Bonferroni correction is also the same as the FDR. However, if at
least n0 ⬎ 0 genes are expected to be accepted, then
FDR ⫽ E关共GenesWronglyAccepted兲/共AlltheGenesAccepted兲兴
Moreover
冋
FDR ⫽ E
册 冋
册
GenesWronglyAccepted
GenesWronglyAccepted
ⱕE
AlltheGenesAccepted
n0
⫽
␣⬘n⬘g ␣⬘ng
ⱕ
n0
n0
where ␣⬘ is the adjusted significance level used in each test and ng⬘ is
the number of genes with no expression differences. When n0 ⫽ 0, it
means that there may be no gene accepted. If this is true, then there is
no false discovery. However, n0 is only the expected number; there
may be genes accepted after the experiment. Since the smallest
number is 1, ␣⬘ is the Bonferroni correction. This gives the max(1, n0)
in Eq. 2.
www.physiolgenomics.org
Downloaded from physiolgenomics.physiology.org on December 18, 2011
Fig. 3. Sample size required by one-sample and two-sample (n1 ⫽ n2)
sequential experiments based on a specific data set described in the text. The
array size is fixed at ng ⫽ 5,000. The alternative is a twofold increase in gene
expression at the end time. T is the number of observations between time 0 and
T. From the top to the bottom, the curves are for m ⫽ 1, 2, and 4.
depends on experience and the validity of certain assumptions.
Of course, there is a certain possibility of failure that an
experimenter must face. Even to determine the sample size for
the simplest two-sample t-test, we must assume that the two
treatments have the same variance. There is no guarantee on
this before the experiment is done. Similarly, in most microarray experiments, a large n0 can be expected. If the result shows
the assumed n0 is too large, then we must admit our misjudgment and may choose to do more experiments to increase the
power. This strategy is usually more economical than choosing
a large sample size by always assuming the worst scenario.
Zien at al. (18) have presented a method similar to ours. In
addition to the t-test, they also mentioned the Mann-Whitney
nonparametric test and a model with additive noise. However,
they did not address serial time points nor the use of the FDR
in the analysis. Their results are based on a simulation which is
not as accurate as the SAS output based on a noncentral F
distribution. Black and Doerge (2) also used the t-test for the
power calculation, but they considered only the number of
replications with a Bonferroni correction. Subject level variation and FDR were not considered. Pan et al. (13) discussed
sample size based on the mixture model. However, under the
null hypothesis when all the means are zero, the mixture model
is difficult to identify. Thus it may not be worth the effort to go
through such a complex model building procedure at the design
stage. The paper by Hwang et al. (9) tried to determine the
sample size for finding the best discriminant function for the
data similar to those given in Ref. 7. But the goal of finding the
discriminant function is different from identifying genes with
different expressions. The use of two highly correlated genes
adds little discriminant power, but both of them are important
to be identified.
28
SAMPLE SIZE FOR MICROARRAY EXPERIMENTS
APPENDIX II
Data Analysis and Power Computation
Two sample one time point per subject. The data analysis is based
on Eq. 7. For a given w, ⌬ ⫽ 2⌽⫺1(1 ⫺ w), and the power is
p ⫽ 1 ⫺ F ␯1共␰, ␭2兲
where F␯1(␰ ,␭2) represents the F distribution of 1, ␯ ⫽ n1 ⫹ n2 ⫺ 2
degrees of freedom with noncentrality parameter ␭2 evaluated at the
threshold ␰ ⫽ F1␯ ,␣⬘, which is the upper ␣⬘ quantile of the F
distribution with 1, ␯ ⫽ n1 ⫹ n2 ⫺ 2 degrees of freedom, and
␭2 ⫽
⌬2
共1 ⫹ r/m兲共1/n 1 ⫹ 1/n 2 兲
y jt ⫽ 共␥ ⫹ ␦ j 兲t ⫹ ⑀ jt , j ⫽ 1, . . . , n;
t ⫽ 1, . . . , T
(15)
where k in Eq. 12 is embedded into yjt and ⑀jt so they are averages over
m values, and ␦j is a normal random deviate with 0 mean and variance
␴␥, ⑀jt are standard normal deviates, and
␥ ⫽ ⫺␴␥⌽⫺1共w兲
The generated data is then tested by the mixed model procedure for
the significance of the time effect. The proportion of the rejected
samples are the estimated power of this test. The mixed model takes
care of the random effect from subjects. Users may refer to a SAS
example in Ref. 11 (p. 90) to see how SAS program is written.
Two sample serial time points experiment. This analysis is based on
a mixed model with time and treatment as fixed effects and the subject
as a random effect.
Without loss of generality, we may let m ⫽ 1 and ␮1t ⫽ ␮2t ⫽ 0.
For a given T and w, data are generated by
y ijt ⫽ 共␥ i ⫹ ␦ ij 兲t ⫹ ⑀ ijt , i ⫽ 1, 2;
j ⫽ 1, . . . , n i ;
t ⫽ 1, . . . , T
(16)
where ␦ij and ⑀ijt are generated by the same procedure in Eq. 15 with
␥1 ⫽ 0 and
␥ 2 ⫽ ⫺2␴␥⌽⫺1共w兲
The power is again estimated the same way as the one-treatment case.
APPENDIX III
Software
The software is located at http://web.stat.ufl.edu/⬃yang/microarray for the microarray sample size determination.
Program one_time.sas. This subroutine contains the SAS program
for one time point study based on Eq. 3.
Program sequent_1.sas. This subroutine contains the SAS program
for one sample serial time points measurements based on Eqs. 12 and 13.
Physiol Genomics • VOL
16 •
REFERENCES
1. Benjamini Y and Hochberg Y. Controlling the false alarm discovery
rate: a practical and powerful approach to multiple testing. J Royal Stat
Soc B57: 298–300, 1995.
2. Black MA and Doerge RW. Calculation of the minimum number of
replicate spots required for detection of significant gene expression fold
change in microarray experiments. Bioinformatics 18: 1609–1616, 2002.
3. Bradley EL. Overlapping coefficient. In: Encyclopedia of Statistical
Sciences. New York, Wiley, 1985, vol. 6, p. 546–547.
4. Callow MJ, Dudoit S, Gong EL, Speed TP, and Rubin EM. Microarray
expression profiling identifies gene with altered expression in HDLdeficiency mice. Genome Res 10: 2022–2029, 2000.
5. Chen Y, Dougherty ER, and Bittner ML. Ratio based decisions and
quantitative analysis of cDNA microarray images. J Chem Optics 2:
264–374, 1997.
6. Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis
and display of genome-wide expression patterns. Proc Natl Acad Sci USA
95: 14863–14868, 1998.
7. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov
JP, Coller H, Loh ML, Dowing JR, Caligiuri MA, Bloomfield CD, and
Lander ES. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286: 531–537, 1999.
8. Huberty CJ and Lowman LL. Group overlap as basis for effect size.
Educ Psychol Meas 60: 543–563, 2000.
9. Hwang D, Schmitt WA, Stephanopoulos G, and Stephanopoulos G.
Determination of minimum sample size and discriminatory expression
patterns in microarray data. Bioinformatics 18: 1184–1193, 2002.
10. Kerr MK, Martin M, and Churchill GA. Analysis of variance for gene
expression microarray data. J Comput Biol 7:819–837, 2000.
11. Littell RC, Milliken GA, Stroupe WW, and Wolfinger RD. SAS System
for Mixed Models. Cary, NC: SAS Institute, 1996.
12. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, and Tsui
KW. On differential variability of expression ratios: improving statistical
inference about gene expression changes form microarray data. J Comput
Biol 8: 37–52, 2001.
13. Pan W, Lin J, and Le CT. How many replicates of arrays are required to
detect gene expression changes in microarray experiments? A mixture
model approach. Genome Biol 3: research0022, 2002.
14. Rom DR and Hwang H. Testing for individual and population equivalence based on proportion of similar response. Stat Med 15: 1489–1505,
1996.
15. Shalon D, Smith SJ, and Brown PO. A DNA microarray system for
analyzing complex DNA samples using two color fluorescent probe
hybridization. Genome Res 6: 639–645, 1996.
16. Tusher VG, Tibshirani R, and Chu G. significance analysis of microarray applied to the ionizing radiation response. Proc Natl Acad Sci USA
5116–5121, 2001.
17. Yang MCK, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA,
and She JX. A statistical method for flagging weak spots improves
normalization and ratio estimates in microarray. Physiol Genomics 7:
45–53, 2001. First published August 8, 2001; 10.1152/physiolgenomics.
00020.2001.
18. Zien A, Fluck J, Zimmer R, and Lengauer T. How many do you need?
Preprint at http://cartan.gmd.de/⬃zien/HowManyArray/, 2003.
www.physiolgenomics.org
Downloaded from physiolgenomics.physiology.org on December 18, 2011
with r defined by Eq. 9. Since r and m are confounded, we may let
m ⫽ 1 in the computer program by changing r.
One sample serial time points experiment. The analysis is based on
the mixed model with time as a fixed effect and the subject as a
random effect. The SAS procedure is given in the software.
For the power simulation, data are generated by
Program sequent_2.sas. This subroutine contains the SAS program
for one sample serial time points measurements based on Eqs. 1
and 14.
All the input instructions are given by the comments in the
program. Examples corresponding to the sample sizes in Fig. 2 and
Fig. 3 are in the input (ⴱ.sas) and the output (ⴱ.lst) programs.