

The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 59:B306-B315 (2004)
© 2004 The Gerontological Society of America
Reproducibility, Sources of Variability, Pooling, and Sample Size: Important Considerations for the Design of High-Density Oligonucleotide Array Experiments
Eun-Soo Han1,2,,
Yimin Wu2,
Roger McCarter2,
James F. Nelson2,
Arlan Richardson2,3 and
Susan G. Hilsenbeck4
1 Department of Biological Science, The University of Tulsa, Oklahoma.
2 Department of Physiology, The University of Texas Health Science Center, San Antonio.
3 South Texas Veterans Health Care System at San Antonio.
4 Breast Center at Baylor College of Medicine, Houston, Texas.
Address correspondence to Eun-Soo Han, Department of Biological Science, University of Tulsa, 600 S. College Ave., Tulsa, OK 74104. E-mail: eun-han{at}utulsa.edu
 |
Abstract
|
|---|
We have undertaken a series of experiments to examine several issues that directly affect design of gene expression studies using Affymetrix GeneChip arrays: probe-level analysis, need for technical replication, relative contribution of various sources of variability, and utility of pooling RNA from different samples. Probe-level data were analyzed by Affymetrix MAS 5.0, and three model-based methods, PM-MM and PM-only models by dChip, and the RMA model by Bioconductor, with the latter two providing the best performance. We found that replicate chips of the same RNA have limited value in reducing total variability, and for relatively highly expressed genes in this biologically homogeneous animal model of aging, about 11% of total variation is due to day effects and the remainder is approximately equally split between sample and residual sources. We also found that pooling samples is neither advantageous nor detrimental. Finally we suggest a strategy for sample size calculations using formulas appropriate when coefficients of variation are known, target effects are expressed as fold changes, and data can be assumed to be approximately lognormally distributed.
WITH the sequencing of the genome(s) and the advent of high-density array technology, gene expression arrays such as cDNA or oligonucleotide microarrays have emerged as a powerful tool to measure global gene expression in cells/tissues. Gene expression arrays have been used to measure gene expression profiles in mouse (14) and human (513) as well as yeast (1417).
At present, several different types of gene expression arrays are commercially available. These include printed filter arrays, printed glass slide arrays, and in situ synthesized oligonucleotide Affymetrix GeneChip arrays. Among them, the Affymetrix GeneChip array system is considered as one of the most sophisticated systems. GeneChip arrays contain multiple (n = 1120) oligonucleotide (25 mer) pairs (perfect match and homomeric mismatch at 13th nucleotide) for each gene and require special dedicated equipment to carry out the hybridization, staining of label, washing, and quantitation processes. Likely advantages of the GeneChip include specificity and reproducibility due to multiple probes for a gene and automated control of the experimental process from hybridization to quantitation. However, high costs associated with GeneChip array and its interpretive instruments are major obstacles for the broad utilization by many investigators, and efficient use of limited resources is critical. We have therefore undertaken a series of experiments in the context of mouse models of aging to examine several issues that directly affect experimental design: reproducibility, sources of variability, and the relative advantages and disadvantages of pooling samples. Based on these findings, we suggest some simple approximations for use in sample size calculations in planning gene expression experiments based on the GeneChip platform.
Reproducibility is one of the most important factors in gene expression array experiments. Therefore, we examined experimental variability and GeneChip to GeneChip variability of Affymetrix system in measuring gene expression by repeating the array hybridization with probes synthesized separately from the same source of total RNA. We also investigated the relative contributions of sources of variability including interanimal (biological) variability and day-to-day variability.
As with other assays, it is widely recognized that replication (i.e., multiple samples of each type) is required in microarray experiments to reduce the effects of random variation (18). Several strategies have been used to reduce experimental variability and increase reliability. For example, it has been suggested that samples from several individuals can be pooled in order to reduce the effects of individual variation (19,20). Changes observed in pooled samples would be more representative of population changes. This is attractive for several reasons. It might reduce the number of GeneChips needed, saving money. In some situations, available RNA is extremely limited and it might be necessary to pool in order to conduct the experiment. However, it may be that data generated from pooled samples are not representative of data from individual samples and may not be substantially less variable. Because no empirical experiments have been done comparing results from individual and pools samples, we conducted experiments to characterize in detail the effect of pooling samples.
Finally, planning microarray experiments requires estimating sample sizes. We demonstrate a procedure using sample size formulas appropriate for lognormally distributed data when variability is expressed as a coefficient of variation (CV) and the target effect size is expressed as a fold change.
 |
METHODS
|
|---|
Animals
Male C57BL/6 mice were bred in the Animal Core of the Program Project. Parent mice were purchased from Jackson Laboratories (Bar Harbor, ME). All mice were fed ad libitum Harlan Teklad LM-485 mouse/rat sterilizable diet 7912 (Madison, WI). Animals were kept on a cycle of 12 hours of darkness and 12 hours of light (lights on at 6:00 AM). Sentinel mice were tested monthly for endoparasites, ectoparasites, mouse hepatitis virus, Sendai virus, mycoplasma, pneumonia virus of mice, Theiler's mouse encephalomyelitis virus, and minute virus of mice by a veterinary pathologist in Laboratory Animal Resources at the University of Texas Health Science Center at San Antonio. Every 6 months, the presence of murine virus antibodies: CAR bacillus, ectromelia virus, epizootic diarrhea of infant mice, lymphocytic choriomeningitis virus, minute virus of mice, mouse adenovirus (M.Ad-FL), mouse adenovirus (M.Ad-K87), mouse hepatitis virus, murine cytomegalovirus, mycoplasma pulmonis, parvovirus, pneumonia virus of mice, polyoma, reovirus, Sendai virus, and Theiler's mouse encephalomyelitis virus was monitored with serum samples from sentinel animals by BioReliance Co. (Rockville, MD). All tests were negative. All procedures involving use of the mice were approved by the Institutional Animal Care and Use Committee of the University of Texas Health Science Center and the Subcommittee for Animal Studies at the Audie L. Murphy Memorial Veterans Hospital.
Tissue Collection and RNA Preparation
Livers from nine 5-month-old mice and nine 13-month-old mice were collected between 10:00 AM and 12:00 PM, and total RNA was extracted from each liver as previously described (21).
Screening of mRNA by Affymetrix GeneChip Arrays
Murine Genome U74A Version 2 GeneChip containing oligonucleotide probes for
12,000 genes were purchased from Affymetrix (Santa Clara, CA). Each gene is represented by 16 perfect match and 16 mismatch oligonucleotide probes. Each 25mer oligonucleotide probe is uniquely complimentary to a gene. We followed the vendor's protocols. First, total RNA was subjected to a cleanup process using RNeasy Total RNA Isolation Kit (Quagen, Valencia, CA). Ten µg of the cleaned total liver RNA were converted to double-stranded cDNA using SuperScript Choice System (GIBCO/BRL, Rockville, MD) and T7-(dT)24 primer (GENSET Corp., La Jolla, CA). Biotin-labeled cRNA was synthesized from the cDNA with BioArray High Efficiency RNA Transcript Labeling Kit (Affymetrix, Santa Clara, CA). After cleaned up the in vitro transcription products with RNeasy Total RNA Isolation Kit (Quagen), the purified cRNA was fragmented to size ranging from 35 to 200 bases by incubating at 94°C for 35 minutes. Fifteen µg of the fragmented cRNA were hybridized to a GeneChip at 45°C with 45 rpm for 16 hours. After hybridization, the GeneChips were washed and stained with streptavidine-phycoerythrin, and then the signals were amplified with biotinylated antibody, goat Ig G and another staining with streptavidine-phycoerythrin using Fluidics Station (Affymetrix). The GeneChips were scanned with Hewlett Packard GeneArray Scanner (Affymetrix) with the target intensity of 250.
Statistical Analysis
Details of analyses are the subject of this article and are presented as results below. Affymetrix Microarray Suite Version 5.0 (MAS 5.0; Affymetrix) was used to stain, wash, scan, and quantitate each GeneChip, computing summary intensities (70th percentile of pixels in a feature) for each probe. MAS 5.0 was also used to scale and compute expression "signals" for each gene using the Affymetrix statistical algorithm (22). DNA Chip Analyzer (dCHIP) Version 1.3 (23) was used to repeat low-level analysis, using MAS 5.0 generated *.CEL files, normalizing and modeling to generate model-based estimates of expression for each probe set. Finally, low-level analysis was also conducted using normalization and modeling algorithms developed by Irizarry and colleagues (2426) and implemented in the R (27) package Affy, which is a part of the Bioconductor project (http://www.bioconductor.org). Briefly, 48 GeneChips (Table 1) were hybridized and processed using MAS 5.0 to generate 48 CEL files and to compute MAS 5.0 signal.
Next, the same 48 CEL files were loaded into dCHIP (Version 1.3). The "invariant set normalization" method (28) was used to normalize all GeneChips to a baseline chip (B2 in Table 1), selected because it exhibited the median feature intensity. The "invariant set" method selects a set of "invariant" perfect match (PM) probes that appear in nearly the same rank order position in the baseline and target GeneChips, and fits a smooth (running median) normalization curve. This curve is then used to generate new normalized values for every probe on the chip. Expression was estimated using both the perfect match minus mismatch (PM-MM) and perfect match only (PM-only) models (23). Both models are multiplicative on the raw intensity scale with additive error and are fit iteratively for each probe set, removing outliers. Outliers are single probes of unusual intensity on individual GeneChips, probe-pair outliers, in which mismatch probe intensities consistently exceed corresponding perfect match probe intensities, and array outliers for which fitted expression for the entire probe set has an unusually high standard error on one GeneChip, compared to other GeneChips. GeneChips with more than 5% of probe sets flagged as array outliers are of suspect quality. In this set of samples, array outliers accounted for from 0.008% to 1.24% of genes (median = 0.2%) and were treated as missing values in subsequent analyses.
Finally, the rma function in the affy package, implemented in R (Version 1.7.1, http://cran.r-project.org/) as part of the Bioconductor project (http://www.bioconductor.org/) was used to perform background correction, normalization, and expression modeling with the same CEL files for young animals as above. We used the default background adjustment, which models observed PM probe intensity as the sum of exponentially distributed signal and Gaussian noise, which is presumably due to optical noise and nonspecific binding. Normalization was done using the default quantile method, which adjusts probe intensities so that all GeneChips exhibit the same averaged empirical probe intensity distribution. After log base 2 (log2) transformation, expression was estimated by modeling using robust multiarray analysis (RMA). RMA models, which are additive on the log2 intensity scale with additive error, are fit iteratively for each probe set using Tukey's median polish algorithm (29), a robust method of fitting two-way additive models (26). Like PM-only in dChip, RMA uses only PM probes.
Further, high level analyses (see below) to examine reproducibility, components of variance, and concordance of results using individual or pooled samples were conducted using S-PLUS (Version 6.1 for Windows, Professional Edition, Release 1, Insightful, Seattle, WA) and StatXact-5 (Release 5.0.3, Cytel Software Corporation, Cambridge, MA). For some analyses, RMA estimates were back-transformed to the raw intensity scale for consistency with other estimates of expression (i.e., MAS 5.0 signal, and dCHIP PM-MM, and PM-only), and in other cases, raw intensity scale estimates of expression were log-transformed, as would frequently be done during statistical comparisons of gene expression to stabilize variances and reduce heteroscedasticity.
Throughout this article, we use the terms probe set and gene interchangeably, although we recognize that some Affymetrix probe sets may not correspond to real genes, and some real genes are represented by more that one probe set on the same GeneChip. We also use the term probe to refer to material on the GeneChip and target to refer to material isolated from the biological sample.
 |
RESULTS
|
|---|
Reproducibility
Source data (i.e., *.CEL files) from 15 GeneChip assays (Table 1; A, B, C, and ABC chips) were subjected to low-level analysis by 4 methods: 1) statistical expression analysis algorithm used by MAS 5.0, 2) PM-MM modeling by DNA Chip Analyzer (dChip), 3) PM-only modeling by dChip (23,28,30), and 4) RMA modeling by functions in Bioconductor (26,27). Of 12,422 noncontrol genes, 3715 (30%) were called as present on all 15 GeneChips by MAS 5.0, and 5992 (48%) were absent or marginal on all GeneChips. The remainder were present or absent on a mixture of chips. Array outliers are probe sets on a GeneChip that have a disproportionately high model-based standard error of expression and therefore an unusually poor fit compared to the same probe set on other GeneChips. GeneChips with a large number of array outliers may have quality problems, and estimates of expression are unreliable. We have opted to treat such data as missing values. By PM-MM modeling, 864 genes were deemed array outliers and treated as missing values on at least one GeneChip, and 93% of these were called absent or marginal by MAS 5.0 on at least one GeneChip. By PM-only modeling, only 327 genes were array outliers on at least one GeneChip and 82% of these were called absent or marginal by MAS 5.0 on at least one GeneChip. A total of 11,299 genes remain after eliminating genes with array outliers by either modeling method on any GeneChip; 3588 of these were called present by MAS 5.0 on all 15 GeneChips.
Scatterplots of the expression data (RMA data back-transformed for comparability) for representative replicate GeneChips using the same RNA (Figure 1AD) and for RNA from different animals in the same age group (Figure 1EH) show the typical comet-shaped distributions. Regardless of the method of low-level analysis, replicate data agree better than data from different animals in the same age group, and genes with higher expression agree more closely than genes with lower expression (Figure 1). For the purposes of illustration, 6 genes exhibiting a range of expression levels were selected and are shown on all panels. Comparing Figures 1EH, these 6 genes show similar relative positioning regardless of the method of low-level analysis. That is, genes labeled #1, #3, and #4 have higher expression values on GeneChip B1 than for C1 for all methods. For the replicate arrays (Figure 1AD), the 6 genes are all closely positioned on the midline.

View larger version (27K):
[in this window]
[in a new window]
|
Figure 1. Scatterplots of replicate assays of the same RNA and from assays of different animals in the same group with low-level modeling by 4 different methods. The same 6 genes have been highlighted on all 8 plots. The solid lines indicate fourfold differences in expression: A, replicate data from the same RNA from animal B, with expression estimated by MAS 5.0; B, replicate data from the same RNA from animal B, with expression estimated by PM-MM; C, replicate data from the same RNA from animal B, with expression estimated by PM only; D, replicate data from the same RNA from animal B, with expression estimated by RMA; E, data from 2 different animals (B and C) in the same treatment group, with expression estimated by MAS 5.0; F, data from 2 different animals (B and C) in the same treatment group, with expression estimated by PM-MM; G, data from 2 different animals in the same treatment group (B and C), with expression estimated by PM only. H, data from 2 different animals in the same treatment group (B and C), with expression estimated by RMA. In all panels, numbered genes are as follows: 1 = (Probeset = 161993_r_at, Gene symbol = 6330403K07Rik, locuslink = 103712); 2 = (Probeset = 161403_r_at, Gene symbol = Acrv1, locuslink = 11451); 3 = (Probeset = 99404_at, Gene symbol = Cyp7a1, locuslink = 13122); 4 = (Probeset = 104213_at, Gene symbol = AI266885, locuslink = 98890); 5 = (Probeset = 95348_at, Gene symbol = Cxcl1, locuslink = 14825); 6 = (Probeset = 97420_at, Gene symbol = Lrg-pending, locuslink = 76905)
|
|
Assay reproducibility was more formally examined using 6 technical replicate assays of the same pool of RNA (Table 1; ABC1, ABC2, ABC3, ABC7, ABC8, and ABC9) by computing means, standard deviations, and CVs for each gene. The data confirm that absolute variability is larger for genes with higher levels of expression, although the CVs of those genes are not larger. For genes with approximately the same numerical value for average expression, estimated standard deviations varied by up to 10-fold, depending on the method of low-level analysis. When expression is estimated by either the MAS 5.0 statistical algorithm or the dChip PM-MM model, CVs were highest for genes with very low expression values and tended to decrease with increasing expression value, so that genes in the lowest decile of expression had median CVs of 50% or more, while genes in the highest decile of expression had median CVs of about 10% (Figure 2A and B). In contrast, when expression was estimated by the PM-only model, CVs were consistently low and did not vary with expression except for genes in the lowest decile of expression (Figure 2C). The median CV of all genes was about 7% (Table 2). For expression estimated by RMA and back-transformed to the raw intensity scale for consistency with other measures of expression, CVs were consistently low and varied little with expression (Figure 2D). In fact, CVs tended to be lower at very low expression, and the median CV was 6.2%. Expression estimates based only on PM probes show reduced technical variability, especially at low levels of expression, with RMA being slightly better than dChip PM-only on average, although the distribution of CVs in this relatively small sample size were broadly overlapping.

View larger version (27K):
[in this window]
[in a new window]
|
Figure 2. Boxplots of gene-specific coefficients of variation (CVs) with genes grouped by deciles of average expression value. CVs are estimated from n = 6 assays of the same RNA (6 ABC pooled samples). Upper and lower whiskers indicate maximum and minimum values, respectively; the top, midline, and bottom of the boxes indicate the 75th, 50th, and 25th percentiles, respectively. Data from low-level modeling by 4 different methods are shown: A, Affymetrix MAS 5.0; B, perfect match minus mismatch (PM-MM); C, perfect match-only (PM-only); D, robust multiarray analysis (RMA). SD = standard deviation
|
|
View this table:
[in this window]
[in a new window]
|
Table 2. Summary Percentiles of Coefficients of Variation (SD/Mean) (Raw Intensity Scale) for 6 Replicates of the Same Pool of RNA Using Affymetrix Mg-U74av2 GeneChips (ChipIDs, ABC1, ABC2, ABC3, ABC7, ABC8, ABC9).
|
|
A similar analysis was performed using 9 assays from individual animals in the young group, thus representing both technical and biological replicates (Table 1; Y19). Results are similar to those above but with CV distributions for all models being shifted upward due to additional variability contributed by animal-to-animal differences. Figure 3 shows the 90th percentiles of gene-specific CVs for expression estimated with PM-only and RMA models. PM-only CVs are highest at the lowest levels of expression and generally decrease with increasing expression, while RMA CVs are lower at low expression and tend to increase slightly with increasing expression. In this animal model of aging, the 90th percentiles of CVs for technical + biological replicates are about 1.25-fold and 1.5-fold higher than corresponding CVs for technical replicates for RMA and PM-only modeled expression, respectively, and 90th percentiles of CVs in technical + biological replicates are very similar for both methods of modeling.

View larger version (16K):
[in this window]
[in a new window]
|
Figure 3. Plot of 90th percentiles of gene-specific coefficients of variation with genes grouped in half-deciles (i.e., 05th percentile, 5th10th percentile, etc.) by average expression value for perfect match-only (PM-only) estimated expression and robust multiarray analysis (RMA) estimated expression from datasets with technical (tech) replicates (Table 1: ABC13, ABC79) and technical plus biological (tech + bio) replicates (Table 1: Y19). SD = standard deviation
|
|
Sources of Variability
Nine GeneChips representing RNA from 3 different animals in the young age group were assayed 3 times, on 3 different days (Table 1; A1, B1, C1, A2, B2, C2, A3, B3, and C3). This balanced 3 x 3 experimental design was used to examine the relative contributions of biologic variability (due to variation in animals and how samples were obtained), technical variability (due to assay reproducibility), and batch/day effects (due to unintended day-to-day variations in experimental conditions). For each gene, analysis of variance with random main effects for animal and day was used to estimate components of variance due to animal, day, and residual error (31). Assuming that the overall variance of measured expression of a gene, g, is the sum of the individual components of variance (
=
+
+
), each component divided by the total (
) gives an estimate of the proportion of variability attributable to each source. Ternary plots in Figure 4 show the results for raw intensity scale PM-only estimated expression values for all genes (N = 11,299) and for present genes (N = 3588). In this experiment, among all genes, the average gene had 30% of its variability attributable to animal-to-animal variation, 16% attributable to day effects, and 54% left over as residual. This is remarkably similar to what we would predict based on the observation that CVs of biological + technical replicates (presented above) were about 50% higher than CVs for purely technical replicates. For present genes, the components comprised, on average, 41%, 11%, and 48% of total variance, respectively.

View larger version (28K):
[in this window]
[in a new window]
|
Figure 4. Ternary plots of variance component proportions. The proportion of variance attributable to each component is plotted as the perpendicular distance from a baseline toward the point of the triangle labeled for that component. Dashed lines delineate 20% intervals in each direction. Large gray dots are plotted at the centroids (three-dimensional average) for each graph. A, All genes; B, present genes as defined in Table 2
|
|
Very similar results were obtained for analyses of expression data modeled by the other two methods (MAS 5.0 and PM-MM) when only present genes were considered (data not shown). However, when all genes were considered, day and residual error combined accounted, on average, for 92.1%, and 91.7% of variance for MAS 5.0 and PM-MM, respectively. This is consistent with the very high CVs seen in the analysis of reproducibility for genes with lower expression values. Log-transformed PM-only data and log-scale RMA data give results virtually identical in every respect to those obtained for raw intensity scale PM-only estimates (data not shown).
Interanimal Variability and Pooling Effect
We measured the mRNA expression of liver RNA with individual and pooled RNA from nine 5-month-old (young) and nine 13-month-old (middle aged) C57BL/6 mice using Affymetrix Murine Genome U74Av2 Genechips (Table 1; Y1 to Y9 and M1 to M9). RNAs from 3 sets of 3 animals were pooled to generate 3 pooled samples for each age group (i.e., Y1 + Y2 + Y3, Y4 + Y5 + Y6, and Y7 + Y8 + Y9). For pooled samples, array hybridization was repeated 3 times with probes being synthesized separately from the same source RNA pool. Quantitation, normalization, and expression modeling using the PM-only and RMA models were performed as described above. Genes (N = 1059) with missing values (outliers) were omitted in order to maintain balanced representation. For each age group, we used the resulting data to estimate gene-specific variances in two ways. First, we used GeneChips from individual animals (n = 9) to compute the usual estimate of sample variance. Second, for the pooled samples, we computed gene-specific variances for each set of GeneChips (i.e., Y123A, Y456A, and Y789A) and then combined the resulting estimates by pooling the sums of squares and dividing by the total degrees of freedom. For young animals, 56% of 11,363 genes had smaller variance with pooled RNA than with individual animals (Figure 5). In this experimental system, results were not improved by log-transforming the expression data, and log-scale RMA modeled expression also gave very similar results (data not shown). Results were also very similar for middle-aged animals (PM-only modeled expression, RMA not done), with gene-specific variances of pooled samples being less than that for individual samples in 58% of genes.

View larger version (31K):
[in this window]
[in a new window]
|
Figure 5. Scatterplot of within-group standard deviations, for each gene, estimated from individual and pooled samples for young animals. Points below and to the right of the solid line, representing equal standard deviations, correspond to genes for which the estimate derived from pooled samples is smaller than the estimate derived from individual samples, and for which pooling would be preferred
|
|
Overall, these results are consistent with what would be expected if interanimal variability is about the same size as other sources of variability combined, and if pooling RNA from 3 animals approximates a physical averaging of expression, and therefore reduces the biological (but not technical) variability accordingly. Although these data suggest that pooling could be advantageous in the long run, it is clear from Figure 5 that, for modest sample sizes, observed results vary considerably for individual genes, perhaps because the sources of variability vary in importance for different genes and pooling does not generate perfectly equal representation. In fact, there is a weak negative association (kappa = .02, p =.01) as to preference for pooling in the young group compared with the middle-aged group, indicating slightly less agreement than would be expected by chance. That is, genes for which pooling was advantageous in the young group were slightly less likely than chance to also be advantageous in the middle-aged group.
Sample Size
We observed that with PM-only or RMA modeling of expression, the distribution of probe set-specific CVs did not depend on level of expression. This suggests a strategy for determining sample size when an experiment will use gene-wise hypothesis testing to detect differentially expressed genes, a common experimental approach. The approximation is based on choosing a target fold change to be detected in a two-sample comparison, selecting an upper quantile (i.e., 90th percentile or 95th percentile) of the CV distribution as an estimate of variability, and applying van Belle and Martin's (32) modification of the usual sample size equation for a two-sample t test to estimate required per group sample sizes.
where f (fold change) is the ratio of the geometric means of the raw data, V is the within-group CV,
is the single gene two-tailed Type I error (typically 0.05 or 0.01), ß is the single gene Type II error (typically 0.20 or 0.10), and z1
is the 100(1
)th percentile of the standard normal distribution. For
= 0.01, z1
/2 = 2.56. Two observations are added to the computed sample size to adjust for the fact that variances are unknown and will be estimated from the data (32). The equation has been generalized to allow for unequal sample sizes, for different group-specific estimates of CVs, and for paired comparisons (33).
The equation is based on the assumption that log-transformed expression data for a particular gene are approximately normally distributed. This assumption is difficult to verify with the small sample sizes available in this study. A random sample of 1000 genes tested by using the Kolmogorov-Smirnov test for goodness of fit of log-transformed data to the normal distribution found only 4 exhibiting a significant lack of fit to the normal distribution at the p =.1 level. A similar analysis using data from a set of primary human breast cancers (n = 24), assayed using the Affymetrix HgU95av2 GeneChip (Allred, unpublished observations), found only 13 of 1000 genes exhibiting a significant lack of fit to the normal distribution at the p =.1 level. This suggests that the assumption may be a reasonable approximation for many genes. In addition, in many cases, actual testing will rely on nonparametric or permutation methods not dependent on the assumption of normality.
Figure 6 shows the relationship between required sample size per group and fold change for a range of CVs. To illustrate, total within-group variance, estimated from the experiments presented above, was used to recompute CVs that included both assay reproducibility and other sources of experimental error variability. In both the young and middle-aged mouse experiments, each involving n = 9 individual normal liver samples from genetically homogeneous mice, 95% of genes had within-group CVs of 0.22 or less for PM-only modeled expression, and we would estimate that 7 samples per group will have at least 90% power to detect 1.5-fold change in 95% of genes at the 0.01 level of significance. In a set of primary human breast cancers (n = 24), assayed using the Affymetrix HgU95av2 GeneChip (Allred, unpublished observations), the 95th percentile CV (0.67) is much higher, due at least in part to the much more heterogeneous nature of the human cancer samples, and at least 36 samples per group would be required. Increasing the level of significance to 0.001, to increase protection against false positives, will require the per-group sample sizes to increase to 8 and 50 for the mouse and human breast cancer examples, respectively.

View larger version (23K):
[in this window]
[in a new window]
|
Figure 6. Nomogram relating sample size (n per group) to the fold change in expression to be detected and the within-group coefficient of variation, expressed as a proportion, for a level of significance of 0.01 and a power of 90%. SD = standard deviation
|
|
 |
DISCUSSION
|
|---|
Using Affymetrix GeneChip arrays, we investigated reproducibility, sources of variability, interanimal variability, sample pooling effect, and suggest an approximation for sample size determination on gene expression studies with microarrays. We analyzed low-level data using the most widely used methods: MAS 5.0 signal, dChip PM-MM and PM-only models, and RMA. Model-based methods using only PM probes consistently gave less variable results, with RMA outperforming PM-only on technical replicates, and results for technical + biological replicates favoring RMA at low expression and being similar otherwise. This is entirely consistent with data from Irizarry and colleagues (25), in which they demonstrate in a reanalysis of the Affymetrix and Genelogic spike in experiments that model-based methods of expression estimation are superior than the current measure provided by Affymetrix, and further suggest that the log scale additive RMA model outperforms the raw scale multiplication PM-only model implemented by dChip. This is obviously an area of continuing research and refinement. Models using additional sequence information such as G/C content (GCMRA, Bioconductor, http://www.bioconductor.org/) or direct modeling of molecular interaction on hybridization (34,35) are also under development. Using optimal expression estimation models, technical variability is low and technical replicates add little additional precision.
Our comparison of methods of estimating expression have the advantage over previous reports of examining a real experimental dataset, rather than artificial spike-in experiments, but has the disadvantage that we cannot directly evaluate accuracy or bias in estimation of absolute mRNA levels or fold changes. In analyses of publicly available spike-in experiments, where known amounts of a very small number of control genes have been spiked-in, it was observed that quantile normalization and RMA estimation of expression resulted in a 10%20% compression of estimated fold change as compared to MAS 5.0 (25). MAS 5.0 appeared to be more accurate for moderate fold changes but also appeared to be somewhat biased toward the null (fold change = 1) at larger fold changes. Despite compression of estimated fold changes, all three methods of summarization (MAS 5.0, PM-only, and RMA) almost invariably yielded estimates that were in the correct order. Irizarry and colleagues (25) argue that loss of accuracy is outweighed by the benefits of increased precision. Accuracy and bias in experiments with many differentially expressed genes remain unknown and more work is needed, especially in understanding the relative effects of normalization and expression estimation. For example, RMA appears to have benefits in reducing variability, but the method of normalization most often used with RMA, quantile normalization, may have unintended consequences of shrinking expression toward a common value, and biasing subsequent estimates of fold change.
We examined three sources of variability in a 3 x 3 factorial experiment. The first source was interanimal or animal-to-animal variation in the same treatment group. The second source was day effects. Although our experiments were carried out by the same technician using the same equipment, under the same experimental conditions (as much as possible), there was always some variability caused by the day of experiment. Minor variations in temperature, duration, or other experimental conditions likely change the optimality of labeling or hybridization favoring some sequences over other sequences, thus slightly changing the apparent abundances. This source of variation is not unique to microarray experiments. It also has been observed in reverse transcription reactions (Dr. Kenton Miller, personal communication). Finally, residual variability is variability other than animal-to-animal variability and day effects. Our data showed that the over 59% of variability came from the day effects and residual error regardless of the method of analysis, with up to 11% being due to day-to-day effects. We have found that even tiny differences in the experimental protocol can add substantial noise to the system, and internal consistency is increased by having a highly trained technician. This also suggests that, to avoid biases due to confounding with unavoidable day effects, experimental designs should attempt to distribute any day-to-day variability equally across all samples or all types of samples. Unbalanced allocation will tend to confound day effects and experimental effects, possibly leading to spurious false positives. The number of array experiments that can be conducted at the same time is limited, and therefore using a Latin-square type of experimental design to distribute the day-to-day variation over all sample types is a good idea.
We examined the possible benefits of pooling samples to reduce variability and found that pooling was neither beneficial nor detrimental. If interanimal variability is large, then pooling could reduce the CV, but with the small interanimal variability, as in this animal model, the beneficial effect of pooling is reduced. This result is consistent with results of Kendziorski and colleagues (36), who considered the contribution of both technical and biological variation and applied their theoretical findings to reverse transcription-polymerase chain reaction data of multiple genes. They give a formula, as a function of the ratio of biological to technical variation, for the number of subjects and array required in a pooled experiment to obtain gene expression estimates and confidence intervals comparable to single-sample data. They demonstrate that, by increasing the number of subjects (in pools), the number of arrays can be reduced without reducing precision. Pooling is most advantageous when biological variation is considerably larger that technical variation. They also conclude that technical replication is not beneficial. Our results are at odds with those of Peng and colleagues (37), who performed virtual pooling by averaging data from real microarray experiments. They failed to consider technical variation as a separate component of variability that is not reduced by pooling and therefore overestimate the benefits of pooling.
Despite limited benefit in variance reduction, samples could be pooled to solve the problem of limited amounts of RNA that may be available from a single animal.
We also presented a simple, two-sample t-test-based approach to estimating sample size for microarray experiments. The method relies on the assumption that expression data for a particular gene are approximately lognormally distributed. We suggest using an upper bound for estimated variability, for example, the 95th percentile of within-group gene-specific CV to be reasonably sure of having the desired power to detect target differences in most genes. The lognormal assumption was not tested because the available sample sizes were too small to have any power to detect all but the most extreme lack of fit. However, log-transformed gene expression data are frequently compared using two sample t tests or various analogs, and the assumption seemed a reasonable approximation. We observe the method leads to relatively small suggested sample sizes for biologically homogeneous experimental systems, and considerably larger sample sizes for more biologically heterogeneous systems.
Although the results presented here are based on modest sample sizes, the sample sizes are comparable to those seen in many microarray studies, results from two separate age groups are consistent, and many of the results confirm similar findings by other investigators.
 |
Acknowledgments
|
|---|
This work is supported in part by NIH grants AG-00746, AG-14674-04S1, and P01CA30195. Animals were maintained by the Animal Core of the Program Project (AG-14674). The authors thank Ms. Vivian Diaz for excellent breeding and care of the animals, and Mr. David Peng and Dr. Joe Martinez for their technical expertise.
 |
Footnotes
|
|---|
James R. Smith,, PhD, Decision Editor
Received October 17, 2002
Accepted December 18, 2003
 |
References
|
|---|
- Nguyen C, Rocha D, Granjeaud S, et al. Differential gene expression in the murine thymus assayed by quantitative hybridization of arrayed cDNA clones. Genomics.. 1995;29:207-216.[Medline]
- Lockhart DJ, Dong H, Byrne MC, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays [see comments]. Nat Biotechnol.. 1996;14:1675-1680.[Medline]
- Jelinsky SA, Harris HA, Brown EL, et al. Global transcription profiling of estrogen activity: estrogen receptor alpha regulates gene expression in the kidney. Endocrinology.. 2003;144:701-710.[Abstract/Free Full Text]
- Ramialison M, Mohr E, Nal B, et al. Expression profiling in mouse fetal thymus reveals clusters of coordinately expressed genes that mark individual stages of T-cell ontogeny. Immunogenetics.. 2002;54:469-478.[Medline]
- Chee M, Yang R, Hubbell E, et al. Accessing genetic information with high-density DNA arrays. Science.. 1996;274:610-614.[Abstract/Free Full Text]
- DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer [see comments]. Nat Genet.. 1996;14:457-460.[Medline]
- Pietu G, Alibert O, Guichard V, et al. Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA array. Genome Res.. 1996;6:492-503.[Abstract/Free Full Text]
- Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci U S A.. 1996;93:10614-10619.[Abstract/Free Full Text]
- Barczak A, Rodriguez MW, Hanspers K, et al. Spotted long oligonucleotide arrays for human gene expression analysis. Genome Res.. 2003;13:1775-1785.[Abstract/Free Full Text]
- Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci U S A.. 2003;100:9608-9613.[Abstract/Free Full Text]
- Keay S, Seillier-Moiseiwitsch F, Zhang CO, Chai TC, Zhang J. Changes in human bladder epithelial cell gene expression associated with interstitial cystitis or antiproliferative factor treatment. Physiol Genomics.. 2003;14:107-115.[Abstract/Free Full Text]
- Pusztai L, Ayers M, Stec J, et al. Gene expression profiles obtained from fine-needle aspirations of breast cancer reliably identify routine prognostic markers and reveal large-scale molecular differences between estrogen-negative and estrogen-positive tumors. Clin Cancer Res.. 2003;9:2406-2415.[Abstract/Free Full Text]
- Chang JC, Wooten EC, Tsimelzon A, et al. Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet.. 2003;362:362-369.[Medline]
- DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science.. 1997;278:680-686.[Abstract/Free Full Text]
- Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat Biotechnol.. 1997;15:1359-1367.[Medline]
- Agarwal AK, Rogers PD, Baerson SR, et al. Genome-wide expression profiling of the response to polyene, pyrimidine, azole, and echinocandin antifungal agents in Saccharomyces cerevisiae. J Biol Chem.. 2003;278:34998-35015.[Abstract/Free Full Text]
- Arava Y, Wang Y, Storey JD, Liu CL, Brown PO, Herschlag D. Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A.. 2003;100:3889-3894.[Abstract/Free Full Text]
- Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A.. 2000;97:9834-9839.[Abstract/Free Full Text]
- Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M. Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes [In Process Citation]. Physiol Genomics.. 2000;2:143-147.[Abstract/Free Full Text]
- Vawter MP, Crook JM, Hyde TM, et al. Microarray analysis of gene expression in the prefrontal cortex in schizophrenia: a preliminary study. Schizophr Res.. 2002;58:11-20.[Medline]
- Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory; 1989.
- Statistical Algorithms Description Document. Affymetrix, Inc. Available at: http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf.
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A.. 2001;98:31-36.[Abstract/Free Full Text]
- Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics.. 2003;19:185-193.[Abstract/Free Full Text]
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res.. 2003;31:e15.[Abstract/Free Full Text]
- Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics.. 2003;4:249-264.[Abstract]
- Ihaka R, Gentleman R. R: A Language for Data Analysis and Graphics. J Computat Graph Stat.. 1996;5:299-314.
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol.. 2001;2:research0032.0031-0032.0011.
- Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley Publishing Company; 1977.
- Naef F, Lim DA, Patil N, Magnasco M. From features to expression: high-density oligonucleotide arrays analysis revisited. Paper presented at: Proceedings of the DIMACS Workshop on Analysis of Gene Expression Data; October 2426, 2001.
- Winer BJ. Statistical Principles in Experimental Design. 2nd Ed. New York: McGraw-Hill Book Company; 1971.
- van Belle G, Martin DC. Sample size as a function of coefficient of variation and ratio of means. Am Stat.. 1993;47:165-167.
- Wolfe R, Carlin JB. Sample-size calculation for a log-transformed outcome measure. Control Clin Trials.. 1999;20:547-554.[Medline]
- Zhang L, Miles MF, Aldape KD. A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol.. 2003;21:818-821.[Medline]
- Zhang L, Miles MF, Aldape KD. Corrigendum: a model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol.. 2003;21:941.
- Kendziorski CM, Zhang Y, Lan H, Attie AD. The efficiency of pooling mRNA in microarray experiments. Biostatistics.. 2003;4:465-477.[Abstract]
- Peng X, Wood CL, Blalock EM, Chen KC, Landfield PW, Stromberg AJ. Statistical implications of pooling RNA samples for microarray experiments. BMC Bioinformatics.. 2003;4:26.[Medline]