Module 4: Finding differentially expressed genes with RNA-seq
We are going to look at genes differentially expressed between Luminal and Basal subtypes of breast cancer.
For this we are going to download data from The Cancer Genome Atlas [ADD LINK] readily available in BioConductor.
Mini introduction to BioConductor
BioConductor is the most popular software repository for genomics data analysis, and is worth exploring in detail. This R-based ecosystem is open-source and community built. It contains software LINK packages but also DATA and ANNOTATION packages.
Examples of packages in BioConductor:
- Data package: The entire human genome sequence e.g., for the current GRCH38 build
- Data package: Multi-modal genomic data from The Cancer Genome Atlas for various tumour types
- Annotation package: Annotation for the Illumina DNA methylation array
- Software:
DEseq
to process RNAseq data,minfi
to process DNA methylation array data, CHIPseeker for Chip-Seq data, etc.,
As we will see, BioConductor also has specialized data structures for bioinformatics operations. Data structures are specialized, standardized formats for organizing, accessing, modifying and storing data.
Examples of BioConductor data structures:
GenomicRanges
: for working with genomic coordinatesSummarizedExperiment
: container for storing data and metadata about an experimentMultiAssayExperiment
: container for experiments where multiple genomic assays were run on the same samples
BioConductor uses its own package manager to install packages. Instead of install.packages()
we use, BiocManager::install()
.
if (!requireNamespace("curatedTCGAData", quietly = TRUE))
::install("curatedTCGAData") BiocManager
Fetch breast cancer data using curatedTCGAData
Load the package:
suppressMessages(library(curatedTCGAData))
Let’s take a look at the available data for breast cancer, without downloading anything (set dry.run=TRUE
).
Each row shows a data layer available. e.g., “mRNA” is transcriptomic data, “CNA” is chromosomal copy number aberration data, etc.,
curatedTCGAData(diseaseCode="BRCA", assays="*",dry.run=TRUE, version="2.0.1")
## snapshotDate(): 2021-10-19
## See '?curatedTCGAData' for 'diseaseCode' and 'assays' inputs
## ah_id title file_size rdataclass
## 1 EH4769 BRCA_CNASeq-20160128 0 Mb RaggedExperiment
## 2 EH4770 BRCA_CNASNP-20160128 9.8 Mb RaggedExperiment
## 3 EH4771 BRCA_CNVSNP-20160128 2.8 Mb RaggedExperiment
## 4 EH4773 BRCA_GISTIC_AllByGene-20160128 1.2 Mb SummarizedExperiment
## 5 EH4774 BRCA_GISTIC_Peaks-20160128 0 Mb RangedSummarizedExperiment
## 6 EH4775 BRCA_GISTIC_ThresholdedByGene-20160128 0.3 Mb SummarizedExperiment
## 7 EH4777 BRCA_Methylation_methyl27-20160128_assays 60.7 Mb SummarizedExperiment
## 8 EH4778 BRCA_Methylation_methyl27-20160128_se 0.4 Mb SummarizedExperiment
## 9 EH4779 BRCA_Methylation_methyl450-20160128_assays 2646.4 Mb RaggedExperiment
## 10 EH4780 BRCA_Methylation_methyl450-20160128_se 6 Mb SummarizedExperiment
## 11 EH4781 BRCA_miRNASeqGene-20160128 0.6 Mb SummarizedExperiment
## 12 EH4782 BRCA_mRNAArray-20160128 27.3 Mb SummarizedExperiment
## 13 EH4783 BRCA_Mutation-20160128 4.5 Mb SummarizedExperiment
## 14 EH4784 BRCA_RNASeq2Gene-20160128 43.1 Mb DFrame
## 15 EH4785 BRCA_RNASeq2GeneNorm-20160128 64.5 Mb SummarizedExperiment
## 16 EH4786 BRCA_RNASeqGene-20160128 30 Mb SummarizedExperiment
## 17 EH4787 BRCA_RPPAArray-20160128 1.5 Mb SummarizedExperiment
## rdatadateadded rdatadateremoved
## 1 2021-01-27 <NA>
## 2 2021-01-27 <NA>
## 3 2021-01-27 <NA>
## 4 2021-01-27 <NA>
## 5 2021-01-27 <NA>
## 6 2021-01-27 <NA>
## 7 2021-01-27 <NA>
## 8 2021-01-27 <NA>
## 9 2021-01-27 <NA>
## 10 2021-01-27 <NA>
## 11 2021-01-27 <NA>
## 12 2021-01-27 <NA>
## 13 2021-01-27 <NA>
## 14 2021-01-27 <NA>
## 15 2021-01-27 <NA>
## 16 2021-01-27 <NA>
## 17 2021-01-27 <NA>
We want to get unprocessed gene expression read counts, so let’s fetch the BRCA_RNASeq2Gene-20160128
layer. You can use ?curatedTCGAData
to see a description of all data layers.
<- suppressMessages(
brca curatedTCGAData(
"BRCA",
assays="RNASeqGene",
dry.run=FALSE,
version="2.0.1")
)
This call returns a MultiAssayExperiment
object. Recall that this is a container for storing multiple assays performed on the same set of samples. See this tutorial to learn more.
Let’s briefly explore the brca
MultiAssayExperiment
object.
brca
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] BRCA_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 878 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
assays()
returns a list
with all -omic data associated with this object. Here we just have the one we downloaded.
summary(assays(brca))
## [1] "List object of length 1 with 0 metadata columns"
names()
shows the datatypes in each slot of assays()
:
names(assays(brca))
## [1] "BRCA_RNASeqGene-20160128"
So gene expression in slot 1.
We can subset the data to see what it looks like. Let’s look at just the first five measures, for the first 10 samples:
<- assays(brca)[[1]]
xpr head(xpr[1:10,1:5])
## TCGA-A1-A0SB-01A-11R-A144-07 TCGA-A1-A0SD-01A-11R-A115-07 TCGA-A1-A0SE-01A-11R-A084-07
## A1BG 164 546 1341
## A1CF 0 0 0
## A2BP1 22 1 2
## A2LD1 127 331 498
## A2ML1 94 144 114
## A2M 102123 107181 101192
## TCGA-A1-A0SF-01A-11R-A144-07 TCGA-A1-A0SG-01A-11R-A144-07
## A1BG 836 512
## A1CF 1 3
## A2BP1 0 25
## A2LD1 526 451
## A2ML1 77 76
## A2M 50316 45826
How many measures do we have?
nrow(xpr)
## [1] 20502
Prepare data for differential expression analysis
Process the expression values:
<- colnames(xpr)
cnames <- gregexpr("-",cnames)
hpos <- sapply(1:length(cnames),function(i) {substr(cnames[i],1,hpos[[i]][3]-1)})
tmp <- !duplicated(tmp)
idx
# subset values
<- tmp[idx]
tmp <- xpr[,idx]
xpr colnames(xpr) <- tmp
Patient metadata is contained in the colData()
slot. Rows contain data for each patient and columns contain measures such as clinical characteristics:
<- colData(brca)
pheno colnames(pheno)[1:20]
## [1] "patientID" "years_to_birth"
## [3] "vital_status" "days_to_death"
## [5] "days_to_last_followup" "tumor_tissue_site"
## [7] "pathologic_stage" "pathology_T_stage"
## [9] "pathology_N_stage" "pathology_M_stage"
## [11] "gender" "date_of_initial_pathologic_diagnosis"
## [13] "days_to_last_known_alive" "radiation_therapy"
## [15] "histological_type" "number_of_lymph_nodes"
## [17] "race" "ethnicity"
## [19] "admin.bcr" "admin.day_of_dcc_upload"
head(pheno[,1:5])
## DataFrame with 6 rows and 5 columns
## patientID years_to_birth vital_status days_to_death days_to_last_followup
## <character> <integer> <integer> <integer> <integer>
## TCGA-A1-A0SB TCGA-A1-A0SB 70 0 NA 259
## TCGA-A1-A0SD TCGA-A1-A0SD 59 0 NA 437
## TCGA-A1-A0SE TCGA-A1-A0SE 56 0 NA 1321
## TCGA-A1-A0SF TCGA-A1-A0SF 54 0 NA 1463
## TCGA-A1-A0SG TCGA-A1-A0SG 61 0 NA 434
## TCGA-A1-A0SH TCGA-A1-A0SH 39 0 NA 1437
Let’s confirm that samples are in the same order in the xpr
matrix and in the metadata table pheno
.
<- intersect(pheno$patientID, colnames(xpr))
ids_in_both <- pheno[which(pheno$patientID %in% ids_in_both),]
pheno <- xpr[,which(colnames(xpr) %in% ids_in_both)]
xpr
if (all.equal(pheno$patientID, colnames(xpr))!=TRUE) {
<- match(pheno$patientID, colnames(xpr))
midx <- xpr[,midx]
xpr }
Always check that samples are in the same order in the tables you are comparing, and use match()
to reorder them if necessary. If they are not in the same order, you are matching data to the wrong sample, and your results will be wrong.
This table has many columns. Let’s just keep those related to tumour type according to the PAM50 classification. What tumour types do we have in this dataset?
<- pheno[,c("patientID","PAM50.mRNA")]
pheno table(pheno$PAM50.mRNA)
##
## Basal-like HER2-enriched Luminal A Luminal B Normal-like
## 90 55 216 117 7
Let’s limit the comparison to Luminal A
and Basal
type tumours.
<- which(pheno$PAM50.mRNA %in% c("Luminal A","Basal-like"))
idx <- pheno[idx,]
pheno <- xpr[,idx]
xpr
dim(pheno)
## [1] 306 2
dim(xpr)
## [1] 20502 306
Differential expression analysis with edgeR
Now that the data are prepared, let’s created a DGEList
object (DGE stands for “Differential Gene Expression”). This object is what we will use for our differential expression analysis.
Note: Make phenotype of interest categorical. In R that means converting to a factor
type with categorical levels
. You can think of levels as ordinal representations (e.g., first level = 1, second = 2, etc., )
If levels=
are not set, the default uses alphabetical order. We recommend explicitly setting levels so that there are no assumptions.
Load the edgeR
package:
suppressMessages(require(edgeR))
Let’s create a DGEList
object for the differential expression analysis. Note that group
must be a categorical variable (use factor()
to convert it to one):
<- factor(
group $PAM50.mRNA,
phenolevels=c("Luminal A","Basal-like")
)<- DGEList(
dge counts = xpr,
group = group
)
Remove low-count genes: To filter low count genes, we’re going to use a normalized count measure called cpm
(counts per million). We are going to keep genes with 100 or greater counts per million for at least two samples:
$counts[1:6,1:20]) (dge
## TCGA-A1-A0SD TCGA-A1-A0SE TCGA-A1-A0SH TCGA-A1-A0SJ TCGA-A1-A0SK TCGA-A1-A0SO TCGA-A2-A04N
## A1BG 546 1341 1126 1433 626 2502 592
## A1CF 0 0 1 0 1 4 1
## A2BP1 1 2 4 1 1 0 0
## A2LD1 331 498 838 831 2948 883 487
## A2ML1 144 114 127 119 367 335 87
## A2M 107181 101192 81857 128591 27972 32286 35748
## TCGA-A2-A04P TCGA-A2-A04Q TCGA-A2-A04T TCGA-A2-A04U TCGA-A2-A04V TCGA-A2-A04Y TCGA-A2-A0CM
## A1BG 503 660 485 1269 870 1392 297
## A1CF 4 1 3 5 2 1 0
## A2BP1 2 1 11 0 7 18 0
## A2LD1 384 694 534 200 454 932 161
## A2ML1 7435 13543 89424 3088 160 250 4770
## A2M 38122 70012 76814 178113 50336 75809 37519
## TCGA-A2-A0CP TCGA-A2-A0CQ TCGA-A2-A0CS TCGA-A2-A0CU TCGA-A2-A0CV TCGA-A2-A0CZ
## A1BG 1240 919 1422 688 532 1628
## A1CF 0 1 0 0 0 4
## A2BP1 2 1 8 1 6 3
## A2LD1 335 111 589 329 458 468
## A2ML1 76 217 65 168 162 102
## A2M 76407 45968 18353 197374 85242 85978
Look at counts per million using cpm
:
cpm(dge)[1:5,1:5]
## TCGA-A1-A0SD TCGA-A1-A0SE TCGA-A1-A0SH TCGA-A1-A0SJ TCGA-A1-A0SK
## A1BG 6.40538988 10.40014782 7.874879329 11.979028558 4.544931080
## A1CF 0.00000000 0.00000000 0.006993676 0.000000000 0.007260273
## A2BP1 0.01173148 0.01551103 0.027974705 0.008359406 0.007260273
## A2LD1 3.88312097 3.86224729 5.860700602 6.946666247 21.403285660
## A2ML1 1.68933359 0.88412890 0.888196869 0.994769294 2.664520298
This next line is a bit complex so let’s unpack it:
- We are using
cpm(dge)>100
as a logical test (“which genes have cpm > 100?”). - For each gene, we want that test to be true for at least two samples. For this we use
rowSums()
to add up how many samples meet that criteria.
dim(dge) #before
## [1] 20502 306
<- rowSums(cpm(dge)>100) >= 2
tokeep <- dge[tokeep,keep.lib.sizes = FALSE]
dge dim(dge) #after
## [1] 7699 306
Normalize the data:
<- calcNormFactors(dge) dge
Visualize the data:
plotMDS(
dge, col=as.numeric(dge$samples$group),
pch=16
)legend(
"bottomleft",
as.character(unique(dge$samples$group)),
col=c(1,2), pch=16
)
Let’s create a model design to identify genes with a group
effect:
<- dge$samples$group
group <- model.matrix(~group) mod
Estimate variation (“dispersion”) for each gene:
<- estimateDisp(dge, mod) dge
Call differentially expressed genes.
Here we:
- fit a model for each gene, using
glmFit
- we have built in an estimate of gene-wise dispersion to better identify treatment effect (or “contrast”)
- for each gene, we run a likelihood ratio test which compares which model fits the data better: a null model (treatment effect = 0) or a full model (treatment effect is non-zero)
Note that coef=2
fetches the effects for the treatment effect; coef=1
would fetch effects of the intercept term.
<- glmFit(dge,mod)
fit <- glmLRT(fit, coef = 2) # get coefficients for group term diffEx
Look at the top 10 differentially expressed genes:
<- topTags(diffEx, n=10)
tt tt
## Coefficient: groupBasal-like
## logFC logCPM LR PValue FDR
## ART3 4.437401 3.455108 795.7399 4.552774e-175 3.505180e-171
## EN1 4.493149 4.469186 568.8034 1.022382e-125 3.935658e-122
## MTHFD1L 2.072720 4.744988 548.6874 2.429549e-121 6.235032e-118
## CLDN6 5.917771 1.735389 528.0710 7.423651e-117 1.428867e-113
## CENPW 2.628365 3.806891 523.0933 8.985654e-116 1.383611e-112
## IL12RB2 3.981078 3.080989 507.2309 2.539169e-112 3.258177e-109
## TTLL4 1.857331 5.441088 504.0036 1.279029e-111 1.406749e-108
## CXorf49B 6.007052 1.678586 484.6339 2.096210e-107 2.017340e-104
## GFRA3 4.890609 1.725042 468.7621 5.958600e-104 5.097251e-101
## HAPLN3 2.737617 5.215775 460.3033 4.129466e-102 3.179276e-99
For the next steps we’re going to need stats on all the genes we’ve tested. So let’s get those:
<- as.data.frame(
tt topTags(diffEx, n=nrow(dge)
) )
A QQplot directly compares the pvalues from our statistical tests to the expected values from a random uniform distribution (p-value selected at random).
A deviation from the x=y line (diagonal) towards the top indicates an enrichment of signal.
qqplot(
$PValue,
ttrunif(nrow(tt)), # randomly sample from uniform distribution
xlab="p-values from real data",
ylab="Randomly-sampled values from Uniform distribution",
pch=16,cex=0.5
)
# x=y line as reference
abline(0,1,col="red")
Now let’s call differentially expressed genes using the decideTestDGE()
function and use summary()
to see how many genes are upregulated (value +1
), downregulated (value -1
) and not called as changed (value 0
)
<- decideTestsDGE(diffEx,
diffEx2 adjust.method="BH",
p.value=0.05
)summary(diffEx2)
## groupBasal-like
## Down 2813
## NotSig 1681
## Up 3205
A volcano plot can help visualize effect magnitude - log2 fold-change or log2FC
in the table ` against the corresponding p-value. Here we create a volcano plot, and colour-code upregulated genes in red, and downregulated genes in blue.
Note that we are combining two different tables, tt
and diffEx2
so we need to ensure the order is the same. Otherwise the colours will be in the wrong order (try it for yourself!).
<- match(rownames(tt), rownames(diffEx2))
midx <- diffEx2[midx,]
diffEx2
<- rep("black",nrow(diffEx2))
cols which(diffEx2>0 )] <- "red"
cols[which(diffEx2<0)] <- "blue"
cols[# volcano plot
plot(tt$logFC,-log10(tt$PValue),pch=16,
col=cols)
abline(v=0,lty=3)
Finally we can write our differential expression results out to file:
write.table(tt,file="diffEx.results.txt",
sep="\t",
col=TRUE,
row=TRUE,
quote=FALSE
)
Exercise
- Install the
yeastRNASeq
package from Bioconductor andlibrary
it into your environment - Import the geneLevelData using:
data("geneLevelData")
- Learn about this data and then put it through the same workflow we just did for the breast cancer:
- Create a new
DGEList
object with your gene counts - Filter genes with CPM > 25 in at least two samples
- Normalize and plot your data
- Create a model matrix for analysis
- Fit your model
- How many significantly up-regulated genes are there at the 5% FDR level? How many significantly down-regulated genes? How many in total
- Create a volcano plot
- Bonus: Create a histogram of p-values. Is there a signal?
Is there anything about the data that might make you question the results?