Home

Optimizing the design of a microarray experiment

Friday August 25, 2006

Experimental Design and Linear Models

 
Exercise 1: From Limma User's Guide 20 April 2006
ApoAI Knockout Data: A Two-Sample Experiment on spotted cDNA microarrays

In this section we consider a case study where two RNA sources are compared through a common reference RNA. The analysis of the log-ratios involves a two-sample comparison of means for each gene.
The data used for this case study can be downloaded from
http://bioinf.wehi.edu.au/limmaGUI/DataSets.html.

Background. The data is from a study of lipid metabolism by M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin. Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research, 10:2022-2029, 2000. The apolipoprotein AI (ApoAI) gene is known to play a pivotal role in high density lipoprotein (HDL) metabolism.
Mice which have the ApoAI gene knocked out have very low HDL cholesterol levels. The purpose of this experiment is to determine how ApoAI deficiency affects the action of other genes in the liver, with the idea that this will help determine the molecular pathways through which ApoAI operates.

Hybridizations. The experiment compared 8 ApoAI knockout mice with 8 normal C57BL/6 ( "black six") mice, the control mice. For each of these 16 mice, target mRNA was obtained from liver tissue and labelled using a Cy5 dye. The RNA from each mouse was hybridized to a separate microarray. Common reference RNA was labelled with Cy3 dye and used for all the arrays. The reference RNA was obtained by pooling RNA extracted from the 8 control mice.

Questions.
How would you design such an experiment using two-color microarrays knowing the you have 8 ApoAI knockout mice and 8 normal C57BL/6 ( " black six ") mice?
Can you write a design matrix for the experiment performed by M. J. Callow et al.?
And for an alternative design?

Specify target file describing which sample will be hybridized on which slide, e.g.

FileNameCy3Cy5
File1wtmu
File2muwt

You are in an analogous situation but you can use single channel microarrays.
How would you proceed? Let's suppose that for economical reasons you could use only 8 chips, which hybridizations would you suggest performing?

In case you want to start from the downloaded data
In R

load("/mypathway/ApoAIArraysLoaded.lma")
library(limma)
genes=gal[,4:6]
targets=Targets
RG[[5]]=maLayout
RG[[6]]=genes
RG[[7]]=targets

names(RG)[5:7]=c("printer","genes","targets")
MAnorm=normalizeWithinArrays(RG)
rm(list=ls()[-c(8,11,22,23,38,45,46)])
ls()

save.image("/mypathway/ApoAI.RData")
q()

In this example we assume that the data is available as an RGList in the data file ApoAI.RData.
In R

library(limma)

load("/Users/ciggen/CIG_ED/ApoAI.RData")

names(RG)

Array print-tip normalization.

MAnorm=normalizeWithinArrays(RG)

cols=MAnorm$targets$Cy5
cols[cols=="wild type"] <- "blue"
cols[cols=="ApoAI KO"] <- "yellow"

boxplot(MAnorm$M~col(MAnorm$M),names=rownames(MAnorm$targets),col=cols,xlab="Mouse",ylab="M-values")

Define the design matrix

design <- cbind("Control-Ref"=1,"KO-Control"=MAnorm$targets$Cy5=="ApoAI KO")

Fit a linear model

fit=lmFit(MAnorm,design)

Compute empirical Bayes statistics

fite=eBayes(fit)

options(digits=3)

Show the top results for the first and second coefficient.

topTable(fite, coef=1, number=10, adjust.method="BH",sort.by="B")
topTable(fite, coef=2, number=10, adjust.method="BH",sort.by="B")

Create vulcano plots for the first and second coefficient.

par(mfrow=c(1,2))
volcanoplot(fite,coef=1,highlight=8,names=fite$genes$Name,main="Control vs Ref")
volcanoplot(fite,coef=2,highlight=8,names=fite$genes$Name,main="KO vs Control")

Compare the two volcano plots. What can you observe? Would you expect such a result?
Create MvA plot for the first coefficient.

par(mfrow=c(2,1))
plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-1,1))
plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-3,3))

Create MvA plots for the first and second coefficient.

plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-3,3) )
points(fite$Amean[fite$lods[,1]>5], fite$coeff[fite$lods[,1]>5,2], pch=20, col=2)

We select the M values corresponding to the 10 clones with the highest B values for the comparison control versus reference (pool of controls).

MAnorm$M[sort(fite$lods[,1], index.return=T,decreasing = T )$ix[1:10],1:8]

Remember that M values in this table are log2(ControlMouse1/poolOfControl), log2(ControlMouse2/poolOfControl), ..., log2(ControlMouse8/poolOfControl). Can you observe something unusual?

MAnorm$M[sort(fite$lods[,1], index.return=T,decreasing = T )$ix[1:10],1:16]

plot(fite$Amean, fite$coeff[,2], pch='.',ylim=c(-3,3) )
points(fite$Amean[fite$lods[,2]>5], fite$coeff[fite$lods[,2]>5,2], pch=20, col=2)

MAnorm$M[sort(fite$lods[,2], index.return=T,decreasing = T )$ix[1:10],1:8]
MAnorm$M[sort(fite$lods[,2], index.return=T,decreasing = T )$ix[1:10],9:16]

The second coefficient measures the fold change of mutant over wild type. Genes which have positive M-values are more highly expressed in the mutant RNA while with negative M-values are more highly expressed in the wild type. The analysis is analogous to the classical single-sample t-test except that we have used empirical Bayes methods to borrow information between genes.
Notice that the top gene is ApoAI itself which is heavily down-regulated. Theoretically the M-value should be minus infinity for ApoAI because it is the knockout gene, but because of background intensity and background estimation the M-value is only strongly negative.

Now we can consider only pool versus KO.

MAnew=MAnorm
MAnew$M=MAnorm$M[,9:16]
MAnew$A=MAnorm$A[,9:16]

design <- rep(1, 8)
fitN=lmFit(MAnew, design)
fitNe=eBayes(fitN)
topTable(fitNe, coef=1)

Compare the two volcano plots from the first model using 16 slides and the second model based on 8 slides.

par(mfrow=c(1,2))
volcanoplot(fitNe,coef=1,highlight=8,names=fite$genes$Name,main="KO vs Ref")
volcanoplot(fite,coef=2,highlight=8,names=fite$genes$Name,main="KO vs Control")

Do you see disadvantages in this approach? What about dye bias?

Now can you try to repeat the first part of the first exercise using only 4 slides instead of 16. Compare the results. What happen to the B, moderated t and p values?

 

Exercise 2: From Limma User's Guide 20 April 2006
Estrogen data: 2x2 factorial experiment with Affymetrix Arrays

This data is from the estrogen package on Bioconductor. A subset of the data is also analysed in the factDesign package vignette. To repeat this case study you will need to have the R packages affy, estrogen and hgu95av2cdf installed.

The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours).

The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway.

Questions.
Can you write a design matrix for this experiment?

How would you design such an experiment using two-color microarrays knowing that you have

  • 2 RNA extracted from cells characterized by no estrogen and length of exposure 10;
  • 2 RNA extracted from cells characterized by no estrogen and length of exposure 48;
  • 2 RNA extracted from cells characterized by estrogen and length of exposure 10;
  • 2 RNA extracted from cells characterized by estrogen and length of exposure 48?

Can you write a design matrix for this experiment?

First load the required packages:

library(limma)
library(affy)
library(hgu95av2cdf)

datadir <- file.path(.find.package("estrogen"),"extdata")
# OR
datadir <- c("/mypathway/library/estrogen/extdata")

dir(datadir)

targets <- readTargets("phenoData.txt",path=datadir,sep="",row.names="filename")
targets

ab <- ReadAffy(filenames=targets$filename, celfile.path=datadir)
eset <- rma(ab)

save.image("/mypathway/estro.RData")
q()

In this example we assume that the data is available as an exprSet object in the data file estro.RData.
In R

library(limma)
library(affy)
load("/Users/ciggen/CIG_ED/estro.RData")

To show which procedure can be applied to the object eset

slotNames(eset)

To extract the log2 intensities after RMA normalization

matExpr=exprs(eset)
colnames(matExpr)

#First parametrization

treatments <- factor(c(1,1,2,2,3,3,4,4),labels=c("e10","E10","e48","E48"))
data.class(treatments)

Define the design matrix

contrasts(treatments) <- cbind(Time=c(0,0,1,1),E10=c(0,1,0,0),E48=c(0,0,0,1))
design <- model.matrix(~treatments)
colnames(design) <- c("Intercept","Time","E10","E48")

Alternatively you can write your design matrix manually:
design=matrix(c(1,1,1,1,1,1,1,1,
0,0,0,0,1,1,1,1,
0,0,1,1,0,0,0,0,
0,0,0,0,0,0,1,1),nrow=8,ncol=4, byrow=F)
rownames(design)= treatments
colnames(design)= c("Intercept","Time","E10","E48")

fit <- lmFit(eset,design)

Define the contrast matrix

cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1))

Given the linear model fit (fit), compute estimated coefficients and standard errors for a given set of contrasts.

fit2 <- contrasts.fit(fit, cont.matrix)

Compute empirical Bayes statistics

fit2e <- eBayes(fit2)

Show the results

colnames(fit2e$coef)
options(digits=3)
topTable(fit2e, coef=1)
topTable(fit2e, coef=2)

We upload an annotation file from Affymetrix, a shorter version than the original one. (Of course shorter only in columns!)

netAffx=read.delim("/Users/ciggen/CIG_ED/HG_U95Av2_annot_short.csv", row.names=1)

dim(netAffx)
colnames(netAffx)
netAffx[1:10,]

cbind(topTable(fit2e, coef=1, n=24),
      netAffx[topTable(fit2e, coef=1, n=24)[,1],2])
cbind(topTable(fit2e, coef=2, n=24),
      netAffx[topTable(fit2e, coef=2, n=24)[,1],2])
      
cbind(topTable(fit2e, coef=1, n=24),
      netAffx[topTable(fit2e, coef=1, n=24)[,1],3])
cbind(topTable(fit2e, coef=2, n=24),
      netAffx[topTable(fit2e, coef=2, n=24)[,1],3])

You can observe quite a number of genes in common between the 2 lists.

Alternatively

fit1e <- eBayes(fit)

colnames(fit1e$coef)
options(digits=3)
topTable(fit1e, coef=1)
topTable(fit1e, coef=2)
topTable(fit1e, coef=3)
topTable(fit1e, coef=4)

cbind(topTable(fit1e, coef=1, n=24),
      netAffx[topTable(fit1e, coef=1, n=24)[,1],3])
cbind(topTable(fit1e, coef=2, n=24),
      netAffx[topTable(fit1e, coef=2, n=24)[,1],3])

topTable(fit1e, coef=3)
topTable(fit1e, coef=4)
cbind(topTable(fit1e, coef=3, n=24),
      netAffx[topTable(fit1e, coef=3, n=24)[,1],3])
cbind(topTable(fit1e, coef=4, n=24),
      netAffx[topTable(fit1e, coef=4, n=24)[,1],3])

They are of course the same tables as

topTable(fit2e, coef=1)
topTable(fit2e, coef=2)

#Second parametrization

design2=matrix(c(rep(1,8),
   0,0,0,0,1,1,1,1,
   0,0,1,1,0,0,1,1,
   0,0,0,0,0,0,1,1
      ), byrow = FALSE, 
   nrow=8, ncol=4)
colnames(design2) <- c("Intercept","Time","Estr","TimeEstr")
rownames(design2) <- treatments

fitP2 <- lmFit(eset,design2)

fitP2e <- eBayes(fitP2)

colnames(fitP2e$coef)
options(digits=3)
topTable(fitP2e, coef=1)
topTable(fitP2e, coef=2)

Compare with

topTable(fit1e, coef=2)


topTable(fitP2e, coef=3)
topTable(fitP2e, coef=4)

cbind(topTable(fitP2e, coef=3, n=24),
      netAffx[topTable(fitP2e, coef=3, n=24)[,1],3])
cbind(topTable(fitP2e, coef=4, n=24),
      netAffx[topTable(fitP2e, coef=4, n=24)[,1],3])

We do not observe a lot of genes in common between the 2 lists.

intersect(topTable(fitP2e, coef=4, n=30)[,1],topTable(fitP2e, coef=3, n=30)[,1])

What can we observe regarding their coefficients?

cbind(as.character(netAffx[topTable(fit2e, coef=1, n=24)[,1],3]),
      as.character(netAffx[topTable(fitP2e, coef=3, n=24)[,1],3]))

#One more contrast to first parametrization

cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1),"E48-E10"=c(0,0,1,-1) )

Do you remember the definition of interaction?

fit2m <- contrasts.fit(fit, cont.matrix)
fit2me <- eBayes(fit2m)

colnames(fit2me$coef)
options(digits=3)

topTable(fit2me, coef=1)
topTable(fit2me, coef=2)
topTable(fit2me, coef=3)
topTable(fitP2e, coef=4)

# Some more statistics: F test
We can examine which genes respond to estrogen at either time using the moderated Fstatistics on 2 degrees of freedom. The moderated F p-value is stored in the component fit2e$F.p.value.
What p-value cutoff should be used? One way to decide which changes are significant for each gene would be to use Benjamini and Hochberg’s method to control the false discovery rate across all the genes and both tests:

 results1 <- decideTests(fit2e, method="global")

Another method would be to adjust the F-test p-values rather than the t-test p-values:

 results2 <- decideTests(fit2e, method="nestedF")

Here we use a more conservative method which depends far less on distributional assumptions, which is to make use of control and spike-in probe-sets which theoretically should not be differentially-expressed. The smallest p-value amongst these controls turns out to be about 0.00014:

 i <- grep("AFFX",geneNames(eset))
 summary(fit2e$F.p.value[i])

 results <- classifyTestsF(fit2e, p.value=0.0001)
 summary(results)

 table(E10=results[,1],E48=results[,2])
 vennDiagram(results,include="up")
 vennDiagram(results,include="down")

 

 

Supplementary exercises

Replicates Arrays

The simplest possible microarray experiment is one with a series of replicate two-color arrays all comparing the same two RNA sources. For a four-array experiment comparing wild type (wt) and mutant (mu) RNA, the targets file might contain the following entries:

FileNameCy3Cy5
File1wtmu
File2wtmu
File3wtmu
File4wtmu

Can you write the design matrix?

Dye Swaps

A simple modification of the above experiment would be to swap the dyes for one of the arrays.
The targets file might now be

FileNameCy3Cy5
File1wtmu
File2muwt
File3wtmu
File4muwt

If there are at least two arrays with each dye-orientation, then it is possible to estimate and adjust for any probe-specific dye effects. If the experiment was

FileNameCy3Cy5
File1wtmu
File2muwt
File3wtmu
File4muwt

Can you propose a design matrix, given that you are interested in detecting the genes which show dye effects?

 

Several Groups

The above approaches for two groups extend easily to any number of groups. Suppose that three RNA targets to be compared using AffymetrixTM arrays. Suppose that the three targets are called RNA1, RNA2 and RNA3.

Write a design matrix. Considered that we are interested in all pair-wise comparisons between the three groups.

Now suppose that the experiment had been conducted using two-color arrays with a common reference instead of AffymetrixTM arrays. For example the targets frame might be

FileNameCy3Cy5
File1RefRNA1
File2RNA1Ref
File3RefRNA2
File4RNA2Ref
File5RefRNA3
File6RNA3Ref

Write a design matrix?

Now suppose that the experiment had been conducted using two-color arrays without the use of a common reference. How would you conduct the experiment? Write a targets file and a design matrix.

 

Time course

Time course experiments are those in which RNA is extracted at several time points after the onset of some treatment or stimulation. Simple time course experiments are similar to experiments with several groups. Here we consider a two-way experiment in which time course profiles are to be compared for two genotypes. Consider the targets frame

FileNameTarget
File1wt.0hr
File2wt.0hr
File3wt.6hr
File4wt.24hr
File5mu.0hr
File6mu.0hr
File7mu.6hr
File8mu.24hr

The targets are RNA samples collected from wild-type and mutant animals at 0, 6 and 24 hour time points. This can be viewed as a factorial experiment but a simpler approach is to use the group-mean parametrization.
Which genes respond at either the 6 hour or 24 hour times in the wild-type?
Which genes respond in the mutant?
Which genes respond differently in the mutant relative to the wild-type?

We can find these by extracting the contrasts between the wild-type times. The method of analysis is described in a paper on histone deacetylase inhibitors for a six-point time course experiment.
M. J. Peart, G. K. Smyth, R. K. van Laar, V. M. Richon, A. J. Holloway, and R. W. Johnstone. Identification and functional significance of genes regulated by structurally diverse histone deacetylase inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 102(10):3697-3702, 2005.

 

 

Latest update 2006-08-25
Valid HTML 4.01 Transitional   Valid CSS!