Experimental Design and Linear Models | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Exercise 1: From Limma User's Guide 20 April 2006 | |||||||||
|
ApoAI Knockout Data: A Two-Sample Experiment on spotted cDNA microarrays
In this section we consider a case study where two RNA sources are compared through a common reference RNA. The analysis of the log-ratios involves a
two-sample comparison of means for each gene. Background. The data is from a study of lipid metabolism by M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin. Microarray expression
profiling identifies genes with altered expression in HDL deficient mice. Genome Research, 10:2022-2029, 2000. The apolipoprotein AI (ApoAI) gene is known
to play a pivotal role in high density lipoprotein (HDL) metabolism. Hybridizations. The experiment compared 8 ApoAI knockout mice with 8 normal C57BL/6 ( "black six") mice, the control mice. For each of these 16 mice, target mRNA was obtained from liver tissue and labelled using a Cy5 dye. The RNA from each mouse was hybridized to a separate microarray. Common reference RNA was labelled with Cy3 dye and used for all the arrays. The reference RNA was obtained by pooling RNA extracted from the 8 control mice. Questions. Specify target file describing which sample will be hybridized on which slide, e.g.
You are in an analogous situation but you can use single channel microarrays. In case you want to start from the downloaded data load("/mypathway/ApoAIArraysLoaded.lma")
library(limma)
genes=gal[,4:6]
targets=Targets
RG[[5]]=maLayout
RG[[6]]=genes
RG[[7]]=targets
names(RG)[5:7]=c("printer","genes","targets")
MAnorm=normalizeWithinArrays(RG)
rm(list=ls()[-c(8,11,22,23,38,45,46)])
ls()
save.image("/mypathway/ApoAI.RData")
q()
In this example we assume that the data is available as an RGList in the data file ApoAI.RData. library(limma)
load("/Users/ciggen/CIG_ED/ApoAI.RData")
names(RG)
Array print-tip normalization. MAnorm=normalizeWithinArrays(RG) cols=MAnorm$targets$Cy5 cols[cols=="wild type"] <- "blue" cols[cols=="ApoAI KO"] <- "yellow" boxplot(MAnorm$M~col(MAnorm$M),names=rownames(MAnorm$targets),col=cols,xlab="Mouse",ylab="M-values") Define the design matrix design <- cbind("Control-Ref"=1,"KO-Control"=MAnorm$targets$Cy5=="ApoAI KO")
Fit a linear model fit=lmFit(MAnorm,design) Compute empirical Bayes statistics fite=eBayes(fit) options(digits=3) Show the top results for the first and second coefficient. topTable(fite, coef=1, number=10, adjust.method="BH",sort.by="B") topTable(fite, coef=2, number=10, adjust.method="BH",sort.by="B") Create vulcano plots for the first and second coefficient. par(mfrow=c(1,2)) volcanoplot(fite,coef=1,highlight=8,names=fite$genes$Name,main="Control vs Ref") volcanoplot(fite,coef=2,highlight=8,names=fite$genes$Name,main="KO vs Control") Compare the two volcano plots. What can you observe? Would you expect such a result?
par(mfrow=c(2,1)) plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-1,1)) plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-3,3)) Create MvA plots for the first and second coefficient. plot(fite$Amean, fite$coeff[,1], pch='.',ylim=c(-3,3) ) points(fite$Amean[fite$lods[,1]>5], fite$coeff[fite$lods[,1]>5,2], pch=20, col=2) We select the M values corresponding to the 10 clones with the highest B values for the comparison control versus reference (pool of controls). MAnorm$M[sort(fite$lods[,1], index.return=T,decreasing = T )$ix[1:10],1:8] Remember that M values in this table are log2(ControlMouse1/poolOfControl), log2(ControlMouse2/poolOfControl), ..., log2(ControlMouse8/poolOfControl). Can you observe something unusual? MAnorm$M[sort(fite$lods[,1], index.return=T,decreasing = T )$ix[1:10],1:16] plot(fite$Amean, fite$coeff[,2], pch='.',ylim=c(-3,3) ) points(fite$Amean[fite$lods[,2]>5], fite$coeff[fite$lods[,2]>5,2], pch=20, col=2) MAnorm$M[sort(fite$lods[,2], index.return=T,decreasing = T )$ix[1:10],1:8] MAnorm$M[sort(fite$lods[,2], index.return=T,decreasing = T )$ix[1:10],9:16] The second coefficient measures the fold change of mutant over wild type. Genes which have positive M-values are more highly expressed in the mutant RNA while
with negative M-values are more highly expressed in the wild type. The analysis is analogous to the classical single-sample t-test except that we have used
empirical Bayes methods to borrow information between genes. MAnew=MAnorm MAnew$M=MAnorm$M[,9:16] MAnew$A=MAnorm$A[,9:16] design <- rep(1, 8) fitN=lmFit(MAnew, design) fitNe=eBayes(fitN) topTable(fitNe, coef=1) Compare the two volcano plots from the first model using 16 slides and the second model based on 8 slides. par(mfrow=c(1,2)) volcanoplot(fitNe,coef=1,highlight=8,names=fite$genes$Name,main="KO vs Ref") volcanoplot(fite,coef=2,highlight=8,names=fite$genes$Name,main="KO vs Control") Do you see disadvantages in this approach? What about dye bias? Now can you try to repeat the first part of the first exercise using only 4 slides instead of 16. Compare the results. What happen to the B, moderated t and p values? |
| Exercise 2: From Limma User's Guide 20 April 2006 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Estrogen data: 2x2 factorial experiment with Affymetrix Arrays
This data is from the estrogen package on Bioconductor. A subset of the data is also analysed in the factDesign package vignette. To repeat this case study you will need to have the R packages affy, estrogen and hgu95av2cdf installed. The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours). The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway. Questions. How would you design such an experiment using two-color microarrays knowing that you have
Can you write a design matrix for this experiment? library(limma)
library(affy)
library(hgu95av2cdf)
datadir <- file.path(.find.package("estrogen"),"extdata")
# OR
datadir <- c("/mypathway/library/estrogen/extdata")
dir(datadir)
targets <- readTargets("phenoData.txt",path=datadir,sep="",row.names="filename")
targets
ab <- ReadAffy(filenames=targets$filename, celfile.path=datadir)
eset <- rma(ab)
save.image("/mypathway/estro.RData")
q()
In this example we assume that the data is available as an exprSet object in the data file estro.RData.
library(limma)
library(affy)
load("/Users/ciggen/CIG_ED/estro.RData")
To show which procedure can be applied to the object eset slotNames(eset) To extract the log2 intensities after RMA normalization matExpr=exprs(eset) colnames(matExpr) #First parametrization treatments <- factor(c(1,1,2,2,3,3,4,4),labels=c("e10","E10","e48","E48"))
data.class(treatments)
Define the design matrix contrasts(treatments) <- cbind(Time=c(0,0,1,1),E10=c(0,1,0,0),E48=c(0,0,0,1))
design <- model.matrix(~treatments)
colnames(design) <- c("Intercept","Time","E10","E48")
Alternatively you can write your design matrix manually: fit <- lmFit(eset,design) Define the contrast matrix cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1)) Given the linear model fit (fit), compute estimated coefficients and standard errors for a given set of contrasts. fit2 <- contrasts.fit(fit, cont.matrix) Compute empirical Bayes statistics fit2e <- eBayes(fit2) Show the results colnames(fit2e$coef) options(digits=3) topTable(fit2e, coef=1) topTable(fit2e, coef=2) We upload an annotation file from Affymetrix, a shorter version than the original one. (Of course shorter only in columns!) netAffx=read.delim("/Users/ciggen/CIG_ED/HG_U95Av2_annot_short.csv", row.names=1)
dim(netAffx)
colnames(netAffx)
netAffx[1:10,]
cbind(topTable(fit2e, coef=1, n=24),
netAffx[topTable(fit2e, coef=1, n=24)[,1],2])
cbind(topTable(fit2e, coef=2, n=24),
netAffx[topTable(fit2e, coef=2, n=24)[,1],2])
cbind(topTable(fit2e, coef=1, n=24),
netAffx[topTable(fit2e, coef=1, n=24)[,1],3])
cbind(topTable(fit2e, coef=2, n=24),
netAffx[topTable(fit2e, coef=2, n=24)[,1],3])
You can observe quite a number of genes in common between the 2 lists. fit1e <- eBayes(fit)
colnames(fit1e$coef)
options(digits=3)
topTable(fit1e, coef=1)
topTable(fit1e, coef=2)
topTable(fit1e, coef=3)
topTable(fit1e, coef=4)
cbind(topTable(fit1e, coef=1, n=24),
netAffx[topTable(fit1e, coef=1, n=24)[,1],3])
cbind(topTable(fit1e, coef=2, n=24),
netAffx[topTable(fit1e, coef=2, n=24)[,1],3])
topTable(fit1e, coef=3)
topTable(fit1e, coef=4)
cbind(topTable(fit1e, coef=3, n=24),
netAffx[topTable(fit1e, coef=3, n=24)[,1],3])
cbind(topTable(fit1e, coef=4, n=24),
netAffx[topTable(fit1e, coef=4, n=24)[,1],3])
They are of course the same tables as topTable(fit2e, coef=1) topTable(fit2e, coef=2) #Second parametrization design2=matrix(c(rep(1,8),
0,0,0,0,1,1,1,1,
0,0,1,1,0,0,1,1,
0,0,0,0,0,0,1,1
), byrow = FALSE,
nrow=8, ncol=4)
colnames(design2) <- c("Intercept","Time","Estr","TimeEstr")
rownames(design2) <- treatments
fitP2 <- lmFit(eset,design2)
fitP2e <- eBayes(fitP2)
colnames(fitP2e$coef)
options(digits=3)
topTable(fitP2e, coef=1)
topTable(fitP2e, coef=2)
Compare with topTable(fit1e, coef=2)
topTable(fitP2e, coef=3)
topTable(fitP2e, coef=4)
cbind(topTable(fitP2e, coef=3, n=24),
netAffx[topTable(fitP2e, coef=3, n=24)[,1],3])
cbind(topTable(fitP2e, coef=4, n=24),
netAffx[topTable(fitP2e, coef=4, n=24)[,1],3])
We do not observe a lot of genes in common between the 2 lists. intersect(topTable(fitP2e, coef=4, n=30)[,1],topTable(fitP2e, coef=3, n=30)[,1]) What can we observe regarding their coefficients?
cbind(as.character(netAffx[topTable(fit2e, coef=1, n=24)[,1],3]),
as.character(netAffx[topTable(fitP2e, coef=3, n=24)[,1],3]))
#One more contrast to first parametrization cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1),"E48-E10"=c(0,0,1,-1) ) Do you remember the definition of interaction? fit2m <- contrasts.fit(fit, cont.matrix) fit2me <- eBayes(fit2m) colnames(fit2me$coef) options(digits=3) topTable(fit2me, coef=1) topTable(fit2me, coef=2) topTable(fit2me, coef=3) topTable(fitP2e, coef=4) # Some more statistics: F test results1 <- decideTests(fit2e, method="global") Another method would be to adjust the F-test p-values rather than the t-test p-values: results2 <- decideTests(fit2e, method="nestedF") Here we use a more conservative method which depends far less on distributional assumptions, which is to make use of control and spike-in probe-sets which theoretically should not be differentially-expressed. The smallest p-value amongst these controls turns out to be about 0.00014: i <- grep("AFFX",geneNames(eset))
summary(fit2e$F.p.value[i])
results <- classifyTestsF(fit2e, p.value=0.0001)
summary(results)
table(E10=results[,1],E48=results[,2])
vennDiagram(results,include="up")
vennDiagram(results,include="down")
Supplementary exercises
Replicates Arrays The simplest possible microarray experiment is one with a series of replicate two-color arrays all comparing the same two RNA sources. For a four-array experiment comparing wild type (wt) and mutant (mu) RNA, the targets file might contain the following entries:
Can you write the design matrix? Dye Swaps A simple modification of the above experiment would be to swap the dyes for one of the arrays.
If there are at least two arrays with each dye-orientation, then it is possible to estimate and adjust for any probe-specific dye effects. If the experiment was
Can you propose a design matrix, given that you are interested in detecting the genes which show dye effects?
Several Groups The above approaches for two groups extend easily to any number of groups. Suppose that three RNA targets to be compared using AffymetrixTM arrays. Suppose
that the three targets are called RNA1, RNA2 and RNA3.
Now suppose that the experiment had been conducted using two-color arrays with a common reference instead of AffymetrixTM arrays. For example the targets frame might be
Write a design matrix? Now suppose that the experiment had been conducted using two-color arrays without the use of a common reference. How would you conduct the experiment? Write a targets file and a design matrix.
Time course Time course experiments are those in which RNA is extracted at several time points after the onset of some treatment or stimulation. Simple time course experiments are similar to experiments with several groups. Here we consider a two-way experiment in which time course profiles are to be compared for two genotypes. Consider the targets frame
The targets are RNA samples collected from wild-type and mutant animals at 0, 6 and 24 hour time points. This can be viewed as a factorial experiment but a
simpler approach is to use the group-mean parametrization. We can find these by extracting the contrasts between the wild-type times. The method of analysis is described in a paper on histone deacetylase inhibitors
for a six-point time course experiment. |