Module 11 Lab 2: Pathway-level features

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.

Introduction

In this example, we will learn how to:

  • Use custom similarity metrics
  • Group variables into biologically-meaningful units such as pathways, for improved interpretability

Here, we will again use breast cancer samples, and discriminate between Luminal A and other tumours. We will limit ourselves to clinical data and gene expression data but will make the following design changes:

  • Clinical variables: A feature (or patient similarity network;PSN) will be defined at the level of selected variables (e.g. age); we will define similarity as normalized difference.
  • Gene expression: Features will be defined at the level of pathways; i.e. each feature groups only those genes corresponding to the pathway. Similarity is defined as pairwise Pearson correlation.
Lab 2 design: We will integrate clinical and gene expression data. Each layer will be converted into a single patient similarity network using Pearson correlation for pairwise similarity.

Figure 3: Lab 2 design: We will integrate clinical and gene expression data. Each layer will be converted into a single patient similarity network using Pearson correlation for pairwise similarity.

  1. In practice I recommend running a predictor design with 2-3 different sets of pathway definitions, and comparing the predictive pathway themes. For instance, it could be useful to compare results from using all curated pathways, to just domain-specific ones.
    Feature design is something of an art, and the choice of pathways depends on what your goals with building the predictor are. Are you looking to prioritize a known set of biological processes or interested in general discovery? These are tradeoffs. At the very least I would recommend running with all curated pathways as a baseline, because you may generate novel hypotheses.

  2. Note that while in this design we group gene expression measures into pathways, the same design can be used to group other types of data based on prior knowledge. For instance, measures from imaging data could be grouped by prior knowledge of correlated networks of regions of interest subserving specific functions.

Get and prepare data

Let’s fetch the BRCA data using curatedTCGAData again, this time only the gene expression data. Remember, the clinical data we automatically get in the colData() slot returned.

suppressMessages(library(curatedTCGAData))
brca <- suppressMessages(
   curatedTCGAData(
       "BRCA",c("mRNAArray"),
       dry.run=FALSE)
    )

Let’s look at the data. Notice that we now only have one -omic assay, gene expression measures from microarrays.

brca
## A MultiAssayExperiment object of 1 listed
##  experiment with a user-defined name and respective class.
##  Containing an ExperimentList class object of length 1:
##  [1] BRCA_mRNAArray-20160128: SummarizedExperiment with 17814 rows and 590 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save all data to files

As before, we prepare the data. I highly recommend separating the script that prepares the data from the one running the predictor for improved management, readability and debuggability.

source("prepare_data.R")
brca <- prepareDataForCBW(brca, setBinary=TRUE)
## harmonizing input:
##   removing 44 sampleMap rows with 'colname' not in colnames of experiments

Create feature design rules

Load the netDx package and initialize the groupList object, where we will store our grouping rules. Recall that groupList is a list-of-lists, with the top tier containing data layer names, and that the layer names must match names(assays(brca)) or whatever your MultiAssayExperiment object is called.

suppressWarnings(suppressMessages(require(netDx)))
groupList <- list()

RNA: Pathway features

Let’s group genes into pathway-level features, i.e. instead of one PSN for transcriptomic data, we create one PSN for each pathway. So if you had a pathway set with 2,000 curated pathways, this would generate 2,000 input PSN.

This design changes model-building time to several hours, so avoid large gene sets (e.g. the full set of ~44,000 Gene Ontology terms, or even ~29,000 GO Biological Process terms). A reasonable start is a compilation of pathways from all curated pathway databases, as in below. Whichever list you use can be pruned by constraining the min/max number of genes in a set, but the size is something to keep in mind.

There are two ways of providing pathway data to netDx: 1. You can download a compilation of pathways from curated databases using the fetchPathwayDefinitions() function in netDx, like so:

x <- fetchPathwayDefinitions("March",2021)
## Fetching http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/Human_AllPathways_March_01_2021_symbol.gmt
x
##                                                                     BFC2 
## "~/.cache/netDx/20e22fee64ce_Human_AllPathways_March_01_2021_symbol.gmt"

The above pathway set was downloaded from download.baderlab.org/EM_Genesets, which is a good resource for routinely-updated curated pathway definitions. Pathways are compiled from Reactome, Panther, NCI, MSigDB, etc.,,5 and data is contained in GMT format, a common format to represent gene-sets, such as pathways. You can see from the value of x that the file isn’t downloaded to our working directory, but rather is stored in a BioConductor-specific location for files (file cache), using the BiocFileCache class. This way if we rerun the script, BioConductor will only download the file again if it has changed since our last download.

Here is an example of the GMT format:

Lab 2: Example of GMT file format.

Figure 4: Lab 2: Example of GMT file format.

We then use readPathways() to read the pathways into a list format to provide the predictor with. So the full call looks like this:

pathList <- readPathways(fetchPathwayDefinitions("March",2021))
## ---------------------------------------
## Fetching http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/Human_AllPathways_March_01_2021_symbol.gmt
## File: 20e22fee64ce_Human_AllPathways_March_01_2021_symbol.gmt
## Warning in scan(f, what = "character", nlines = 1,
## quiet = TRUE, sep = "\t"): EOF within quoted string
## Read 1123 pathways in total, internal list has 1110 entries
##  FILTER: sets with num genes in [10, 200]
##    => 341 pathways excluded
##    => 769 left
head(pathList)
## $STEARATE_BIOSYNTHESIS_I__ANIMALS_
##  [1] "ELOVL1"  "ACOT7"   "ACSL1"   "ACSL5"   "ELOVL6" 
##  [6] "ACSL4"   "ACSL3"   "ACOT2"   "ACOT1"   "ACSBG1" 
## [11] "ACSBG2"  "SLC27A2" "ACOT4"  
## 
## $SUPERPATHWAY_OF_INOSITOL_PHOSPHATE_COMPOUNDS
##  [1] "PI4K2B"  "MTMR14"  "INPPL1"  "PTEN"    "PIK3CD" 
##  [6] "PIK3C2G" "PIK3CB"  "PIK3C2A" "PIK3CG"  "PIK3C2B"
## [11] "PIP4P2"  "PIP4P1"  "PLCZ1"   "PPIP5K1" "PPIP5K2"
## [16] "PIP5KL1" "PLCE1"   "PIP4K2A" "PIP4K2B" "PIP4K2C"
## [21] "SACM1L"  "ITPK1"   "IPMK"    "OCRL"    "ITPKB"  
## [26] "ITPKC"   "MINPP1"  "PLCB3"   "PLCB4"   "PIK3CA" 
## [31] "ITPKA"   "PIK3C3"  "PLCB1"   "PLCB2"   "PI4K2A" 
## [36] "IPPK"    "MTMR3"   "PIK3R4"  "PIK3R3"  "PIK3R2" 
## [41] "PIK3R1"  "PIK3R6"  "PIK3R5"  "INPP5B"  "INPP5A" 
## [46] "INPP5D"  "PLCG2"   "INPP5J"  "PIP5K1A" "INPP5K" 
## [51] "PIP5K1B" "PIP5K1C" "PLCG1"   "CDIPT"   "IP6K1"  
## [56] "IP6K3"   "IP6K2"   "SYNJ2"   "FIG4"    "PIKFYVE"
## [61] "SYNJ1"   "PI4KA"   "PLCH1"   "PI4KB"   "PLCH2"  
## [66] "PLCD3"   "PLCD4"   "PLCD1"  
## 
## $PUTRESCINE_DEGRADATION_III
##  [1] "ALDH3A2" "ALDH3B2" "ALDH3A1" "ALDH1B1" "MAOB"   
##  [6] "ALDH2"   "MAOA"    "ALDH3B1" "SAT2"    "SAT1"   
## 
## $TRYPTOPHAN_DEGRADATION_III__EUKARYOTIC_
##  [1] "ACAT1" "HADHB" "GCDH"  "TDO2"  "KYNU"  "HAAO" 
##  [7] "AFMID" "KMO"   "ACAA1" "ACAT2" "ACMSD"
## 
## $MEVALONATE_PATHWAY_I
##  [1] "ACAT1"  "IDI1"   "MVK"    "HMGCS1" "IDI2"  
##  [6] "HADHB"  "PMVK"   "MVD"    "HMGCR"  "HMGCS2"
## [11] "ACAA1"  "ACAT2" 
## 
## $`D-_I_MYO__I_-INOSITOL-5-PHOSPHATE_METABOLISM`
##  [1] "PLCG1"   "PLCB3"   "PLCB4"   "PLCH1"   "PLCH2"  
##  [6] "PLCB1"   "MTMR14"  "PLCD3"   "PLCB2"   "PLCD4"  
## [11] "PLCD1"   "MTMR3"   "PIP4P2"  "PIP4P1"  "PLCZ1"  
## [16] "PLCG2"   "PLCE1"   "PIP4K2A" "PIP4K2C"
  1. Alternatively, you can also provide a custom pathway set to netDx by reading in a GMT file, using the readPathways() function. In this example, I’ve downloaded a geneset of pathways often dysregulated in cancer, set C6 from MSigDB:6
gmtFile <- sprintf("%s2/cancer_pathways.gmt",tempdir())
if (!file.exists(sprintf("%s2",tempdir()))) {
    dir.create(sprintf("%s2",tempdir()))
}
download.file("https://raw.githubusercontent.com/RealPaiLab/CBW_CAN_DataIntegration_2021/master/supporting_files/c6.all.v7.4.symbols.gmt",gmtFile)
x <- readPathways(gmtFile)
## ---------------------------------------
## File: cancer_pathways.gmt
## Read 189 pathways in total, internal list has 189 entries
##  FILTER: sets with num genes in [10, 200]
##    => 14 pathways excluded
##    => 175 left
x[1:3]
## $GLI1_UP.V1_DN
##  [1] "CACUL1"  "CCNL2"   "YIPF1"   "GTDC1"   "OPN3"   
##  [6] "SLC22A1" "RNF10"   "RUNX1"   "SEPTIN9" "CD2AP"  
## [11] "ACSL6"   "ZMYND11" "FOLR1"   "PPIL1"   "ESYT1"  
## [16] "BTBD16"  "SPECC1" 
## 
## $GLI1_UP.V1_UP
##  [1] "IGFBP6"  "CCND2"   "PTH1R"   "TXNDC12" "M6PR"   
##  [6] "PPT2"    "STAU1"   "JCHAIN"  "TMOD3"   "SAR1A"  
## [11] "ZBTB16"  "YBX1"    "SGTA"    "WEE1"    "TRIM23" 
## [16] "BMERB1"  "NPHP1"   "IFT74"   "PRKD3"   "PLPP6"  
## [21] "FEM1A"   "LMO7"    "C7orf50" "ZC3H18" 
## 
## $GCNP_SHH_UP_EARLY.V1_DN
##   [1] "SRRM2"    "KIAA1217" "DEFA1"    "DLK1"    
##   [5] "PITX2"    "CCL2"     "UPK3B"    "SEZ6"    
##   [9] "TAF15"    "EMP1"     "SH3BGR"   "CCR2"    
##  [13] "VAV1"     "LGALS9"   "WIPF1"    "LYST"    
##  [17] "RRAGD"    "CPT1A"    "FAM241A"  "MAN2B1"  
##  [21] "WWC2"     "TXLNG"    "CNGA2"    "NPPB"    
##  [25] "MYO5B"    "XIAP"     "HOXB9"    "TASOR"   
##  [29] "PEX7"     "SOX18"    "CCRL2"    "NR6A1"   
##  [33] "CITED2"   "LTBP3"    "EEPD1"    "SIM1"    
##  [37] "CRABP2"   "PPFIBP2"  "CYBA"     "RAMP1"   
##  [41] "TM4SF1"   "LIN9"     "ASPH"     "UBTF"    
##  [45] "PIGT"     "EOMES"    "GBP7"     "MERTK"   
##  [49] "MAP3K12"  "IL1R1"    "HAS2"     "SLC43A3" 
##  [53] "CBX1"     "GLB1L"    "PHLDB2"   "GPR19"   
##  [57] "GUSB"     "CRYGS"    "DBP"      "VPS54"   
##  [61] "FRMD5"    "DCAF8"    "BTD"      "ADCY9"   
##  [65] "SIX4"     "PLA2G10"  "SLC6A4"   "GCSAM"   
##  [69] "CBFA2T3"  "GABRA1"   "S100A9"   "LAT2"    
##  [73] "SMYD5"    "MX2"      "AKR1C1"   "PLK2"    
##  [77] "PDE1B"    "ID2"      "HOXB3"    "GAPDHS"  
##  [81] "IER2"     "DDB1"     "ADAM7"    "ETS1"    
##  [85] "CDK5RAP3" "POLD4"    "ITGA4"    "FOXA1"   
##  [89] "PEPD"     "AMOTL2"   "CALCR"    "BPIFB1"  
##  [93] "MAF"      "EGR1"     "CYP27B1"  "URI1"    
##  [97] "HTRA3"    "LCP1"     "ACSS1"    "SETDB1"  
## [101] "BTC"      "NOTCH1"   "NR5A2"    "LRRC57"  
## [105] "TM2D1"    "RABAC1"   "DPPA4"    "CNN1"    
## [109] "JPH3"     "MAP4K4"   "CD38"     "SLC22A6" 
## [113] "MIPOL1"   "NLRX1"    "GDF3"     "ZNF436"  
## [117] "TIRAP"    "CFB"      "MST1R"    "NUB1"    
## [121] "H2BC21"   "IFNG"     "UROS"     "POPDC3"  
## [125] "ERO1B"    "GRIN1"    "PPY"      "HES1"    
## [129] "UIMC1"    "CAMLG"    "LRRC23"   "SLC7A3"  
## [133] "C4BPA"    "PPP2R3C"  "CDK5"     "MGP"     
## [137] "STS"      "PCBD2"    "IRF3"     "RAG2"    
## [141] "GABRD"    "ANKRD44"  "TERF2IP"  "CCL27"   
## [145] "LCLAT1"   "MIIP"     "TMPO"     "AGTR2"   
## [149] "ELMOD3"   "GNA14"    "JAM3"     "RSPO2"   
## [153] "CRYBG1"   "PTGER3"   "LSR"      "NSUN3"   
## [157] "MAP2K1"   "TMEM160"  "ZNF518B"  "NR4A2"   
## [161] "PCP2"     "TMEM176A" "DDX3Y"    "MYRF"    
## [165] "LAMB3"    "PLA2G4A"  "HTR1A"

For this tutorial we will limit ourselves to the first set of pathways and use groupList to tell netDx to group transcriptomic data using pathways.

The pathway definition file should use the same identifier type as your patient data. For instance, if the genes in your transcriptomic data are represented using HGNC symbols, then your pathway definition file must also use HGNC symbols (e.g. ID2S), and not a different type of identifier, such as Ensembl IDs (which look like this: ENSG00000010404).

groupList[["BRCA_mRNAArray-20160128"]] <- pathList[1:3] 

Clinical: Single variables

Models often include clinical variables such as demographic or disease-related features such as age, sex, or treatment regimen. In this example, we take two variables from the sample metadata, and include create one PSN for each variable.

So here, we map the following:

  • patient.age_at_initial_pathologic_diagnosis to age (just shorter)
  • stage to STAGE

These variables must be present in the colData() slot:

pheno <- colData(brca)
head(pheno[,c("patient.age_at_initial_pathologic_diagnosis","STAGE")])
## DataFrame with 6 rows and 2 columns
##              patient.age_at_initial_pathologic_diagnosis
##                                                <integer>
## TCGA-A1-A0SD                                          59
## TCGA-A1-A0SE                                          56
## TCGA-A1-A0SH                                          39
## TCGA-A1-A0SJ                                          39
## TCGA-A1-A0SK                                          54
## TCGA-A1-A0SO                                          67
##                  STAGE
##              <integer>
## TCGA-A1-A0SD         2
## TCGA-A1-A0SE         1
## TCGA-A1-A0SH         2
## TCGA-A1-A0SJ         3
## TCGA-A1-A0SK         2
## TCGA-A1-A0SO         2

We now add the entry into groupList. The entry for clinical is special because netDx will look for corresponding variables in the sample metadata table, colData(), rather than looking for “clinical” within assays(brca).

groupList[["clinical"]] <- list(
      age="patient.age_at_initial_pathologic_diagnosis",
       stage="STAGE"
)

We now tell netDx what similarity metric to use

This is makeNets() just as we had seen in the previous exercise, which provides netDx with a custom function to generate similarity networks (i.e. features). We previously used the following code to create PSN based on Pearson correlation: makePSN_NamedMatrix(..., writeProfiles=TRUE,...)

We will now make a different call to makePSN_NamedMatrix() but this time, requesting the use of the normalized difference similarity metric.

This is achieved by calling the following code: makePSN_NamedMatrix(,..., simMetric="custom", customFunc=normDiff,writeProfiles=FALSE)

normDiff is a function provided in the netDx package, but the user may define custom similarity functions in this block of code and pass those to makePSN_NamedMatrix(), using the customFunc parameters; additionally set simMetric to custom.

Other presets provided by netDx include:

  • sim.pearscale: Pearson correlation followed by exponential scaling; used with a vector
  • sim.eucscale: Euclidean disance followed by exponential scaling; used with a vector
  • normDiff: Normalized difference; used with a single variable such as age
  • avgNormDiff: Average normalized difference; used with a vector

When we’re done, this is what the makeNets function look like.

Note: I realize this may be complicated. Future versions of netDx will simplify this syntax.

makeNets <- function(dataList, groupList, netDir,...) {
    netList <- c() 

    # make RNA nets (Pearson correlation)
    rna <- "BRCA_mRNAArray-20160128"
    if (!is.null(groupList[[rna]])) { ## REMEMBER TO CHECK FOR NULL
        netList <- makePSN_NamedMatrix(
            dataList[[rna]],
            rownames(dataList[[rna]]),
            groupList[[rna]],
            netDir,
            verbose=FALSE,
            writeProfiles=TRUE,         ## define Pearson similarity as before
            ...) 
    }
    
    # make clinical nets (normalized difference)
    netList2 <- c()
    if (!is.null(groupList[["clinical"]])) {
    netList2 <- makePSN_NamedMatrix(
        dataList$clinical, 
        rownames(dataList$clinical),
        groupList[["clinical"]],netDir,
        simMetric="custom",customFunc=normDiff, ### Notice simMetric & customFunc
        writeProfiles=FALSE,
        sparsify=TRUE,
        verbose=FALSE,
        ...)
    }
    netList <- c(unlist(netList),unlist(netList2))
    return(netList)
}

Build predictor

Finally! We have:

  • prepared our data,
  • grouped RNA by pathways (fetchPathwayDefinitions(), readPathways()),
  • created two PSN using clinical variables (groupList$clinical), and
  • defined our similarity metrics (makeNets()).

Now we build our predictors. For this tutorial, we use two train/test splits. For each split, we assign features a score between zero and two, and call features with score of 1+ “feature-selected”.

Realistic parameters: When running this with your project, reasonable values are numSplits=10L (10 is a good start, just to see if you get signal, ~100 if firming up for publication), featScoreMax=10L, featSelCutoff=9L.

set.seed(42) # make results reproducible
outDir <- paste(sprintf("%s2",tempdir()),"pred_output",sep=getFileSep()) # use absolute path
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
numSplits <- 10L
##model <- suppressMessages(
##   buildPredictor(
##     dataList=brca,
##     groupList=groupList,
##     makeNetFunc=makeNets,
##     outDir=outDir, 
##     numSplits=numSplits, 
##     featScoreMax=10L, 
##     featSelCutoff=9L,
##     numCores=8L
##     )
##)

Let’s load the results:

outFile <- sprintf("%s2/CBW_Lab2_full.rda",tempdir())
download.file("https://github.com/RealPaiLab/CBW_CAN_DataIntegration_2021/raw/master/supporting_files/Lab2_files/brca_binary_pathways.rda",
    destfile=outFile)
lnames <- load(outFile)

We can see the objects in the file using lnames().

Examine results

As before, we get model results, using getResults() from our helper script, helper.R.

source("helper.R")
results <- getResults(brca,model_full,
    featureSelCutoff=9L,
    featureSelPct=0.9)
## Detected 10 splits and 2 classes
## * Plotting performance
## * Compiling feature scores and calling selected features

As this is binary classification, we get ROC and precision-recall curves (bottom panel). The average AUROC and AUPR are shown in the top panel. Let’s take a look at the average performance:

perf <- results$performance
round(mean(perf$splitAUROC),2)*100
## [1] 99
round(mean(perf$splitAUPR),2)*100
## [1] 97
round(mean(perf$splitAccuracy),2)*100
## [1] 86

Notice that although the AUROC is near perfect, the accuracy is lower, at around 86%. It is important to consider several measures of predictor performance to understand the behaviour of the predictor.

Let’s examine our confusion matrix:

confMat <- confusionMatrix(model_full)

Note: Rows of this matrix don’t add up to 100% because the matrix is an average of the confusion matrices from all of the train/test splits.

We see that the predictor performs perfectly at classifying Luminal A tumours but does worse at classifying the residual, mainly because it often classifies other tumours as Luminal A!

Particularly when there is class imbalance, i.e. one class has several-fold the number of samples than the other, predictors can achieve a baseline high accuracy by “cheating” and simply predicting all samples as being of the dominating label. e.g. if you have 99:1 class imbalance of two classes A and B, the predictor can achieve 99% accuracy simply by calling all samples of type “A”!

In practice, class imbalance should be handled using suitable performance evaluation metrics and sampling proportionally for training/test sets.

We indeed have a class imbalance in this dataset:

table(colData(brca)$STATUS,useNA="always")
## 
## Luminal.A     other      <NA> 
##       230        94         0

This function returns the confusion matrix for each split, as well as the average shown in the image above:

summary(confMat)
##                     Length Class  Mode   
## splitWiseConfMatrix 10     -none- list   
## average              4     table  numeric

Visualize top pathways in Cytoscape

We will now visualize top-scoring pathways using the Cytoscape visualization EnrichmentMap. Recall from Module 8 that an EnrichmentMap is a network visualization of related gene-sets, where each node is a pathway and edges connect similar pathways.

Let us say we only want to see pathways that scored in some target range for “most” of the trials; i.e. consistently high-scoring pathways. We quantify this by asking only for features that score [EMapMinScore,EMapMaxScore] for EMapPctPass fraction of trials.

Here we ask for features scoring 7+ out of 10 for at least 70% of the train/test splits.

emap <- makeInputForEnrichmentMap (
    model=model_full,
    results=results,
    pathwayList=pathList,
    EMapMinScore=7L, 
    EMapMaxSore=10L,
    EMapPctPass=0.7,
    outDir="~/workspace"
)
## * Creating input files for EnrichmentMap
## * Writing files for network visualization

This call will return paths to the output files, which you need to now download from AWS to your personal computer:

emap
## $GMTfiles
## $GMTfiles$Luminal.A
## [1] "~/workspace/Luminal.A.gmt"
## 
## $GMTfiles$other
## [1] "~/workspace/other.gmt"
## 
## 
## $NodeStyles
## $NodeStyles$Luminal.A
## [1] "~/workspace/Luminal.A_nodeAttrs.txt"
## 
## $NodeStyles$other
## [1] "~/workspace/other_nodeAttrs.txt"

netDx can directly generate EnrichmentMaps on a locally-installed version of Cytosape using RCy37. RCy3 allows programmatic control of Cytoscape from within R, so that visualizations such as EnrichmentMap can be programmatically created. This step needs a local copy of Cytoscape installed, so it won’t work on the lab AWS instance. Try it on your laptop after the lab.

###plotEmap(
### emap$GMTfiles[[1]],
### emap$NodeStyles[[1]],
### groupClusters=TRUE, 
### hideNodeLabels=TRUE
### )

If you wanted to look at an EnrichmentMap anyway, you can download the input files shown in the output of the emap object above by opening your web browser to: http://your-student-instance and downloading the files from there. Then build an Enrichment map using Cytoscape on your laptop as discussed earlier in the Cancer Analysis workshop.

When you’re done, the Enrichment Map for top pathways predictive of Luminal A status would look like this (below). The speech bubbles show pathway names within a couple top clusters.

Lab 2: EnrichmentMap for top-scoring pathways predictive of Luminal A status. Nodes show pathways scoring 7+ out of 10 in over 70% of train/test splits. Edges connect pathways with shared genes. Yellow bubbles show pathway clusters, labelled by AutoAnnotate [@Kucera2016-yj]. Node fills indicate top score in 70%+ of the splits. Speech bubbles show pathways in example clusters.

Figure 5: Lab 2: EnrichmentMap for top-scoring pathways predictive of Luminal A status. Nodes show pathways scoring 7+ out of 10 in over 70% of train/test splits. Edges connect pathways with shared genes. Yellow bubbles show pathway clusters, labelled by AutoAnnotate.8 Node fills indicate top score in 70%+ of the splits. Speech bubbles show pathways in example clusters.

Optional: You can also download the Cytoscape session file here to see the finished result.

In practice, it can take a good portion of an hour to adjust the layout of the EnrichmentMap, and often longer to explore the contents. The automatically-generated pathway theme labels are often a reasonable first guess, but you’ll find you’ll often revise them upon closer inspection of the pathways within.

We notice that the Enrichment Map identifies several pathways known to be dysregulated in Luminal A breast tumours, including estrogen receptor-mediating signaling (PLASMA_MEMBRANE_ESTROGEN_RESPONSE_SIGNALING), p38 signaling, pathways related to genomic stability and DNA damage response (ATM), cell profileration-related signaling (ErbB signaling).

Outlook

Where do you go from here? That depends on your study goals and prior knowledge. Here are some possibilities:

  • You may choose to focus on particular pathways for which you have targeting tools, for follow-up in vitro or in vivo experiments.
  • You may want to improve performance, in which case consider changing predictor design, including features, predictor settings, or pathway set choices. Note: This is usually a good idea.
  • Once you’re satisfied with predictor design, you may want to validate on an independent dataset. This is something we haven’t covered in this lab.

Machine learning is very much trial-and-error, and you have to be prepared to try a variety of designs to see what works.

Future versions of netDx will include tools to help with some of this automation & also digging deeper into predictive features such as pathways, to understand what components contribute to predictive power.

That’s it! This completes our lab exercises for Module 11.

sessionInfo

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=C                
##  [2] LC_NUMERIC=C              
##  [3] LC_TIME=C                 
##  [4] LC_COLLATE=C              
##  [5] LC_MONETARY=C             
##  [6] LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8      
##  [8] LC_NAME=C                 
##  [9] LC_ADDRESS=C              
## [10] LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8
## [12] LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices
## [6] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotrix_3.8-1              
##  [2] ggplot2_3.3.3              
##  [3] Rtsne_0.15                 
##  [4] netDx_1.2.3                
##  [5] bigmemory_4.5.36           
##  [6] rhdf5_2.34.0               
##  [7] curatedTCGAData_1.12.1     
##  [8] MultiAssayExperiment_1.16.0
##  [9] SummarizedExperiment_1.20.0
## [10] Biobase_2.50.0             
## [11] GenomicRanges_1.42.0       
## [12] GenomeInfoDb_1.26.7        
## [13] IRanges_2.24.1             
## [14] S4Vectors_0.28.1           
## [15] BiocGenerics_0.36.1        
## [16] MatrixGenerics_1.2.1       
## [17] matrixStats_0.58.0         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.1                   
##   [2] R.utils_2.10.1               
##   [3] tidyselect_1.1.1             
##   [4] RSQLite_2.2.7                
##   [5] AnnotationDbi_1.52.0         
##   [6] grid_4.0.5                   
##   [7] combinat_0.0-8               
##   [8] BiocParallel_1.24.1          
##   [9] RNeXML_2.4.5                 
##  [10] munsell_0.5.0                
##  [11] codetools_0.2-18             
##  [12] withr_2.4.2                  
##  [13] colorspace_2.0-1             
##  [14] highr_0.9                    
##  [15] knitr_1.33                   
##  [16] uuid_0.1-4                   
##  [17] zinbwave_1.12.0              
##  [18] rstudioapi_0.13              
##  [19] SingleCellExperiment_1.12.0  
##  [20] ROCR_1.0-11                  
##  [21] NMF_0.23.0                   
##  [22] labeling_0.4.2               
##  [23] GenomeInfoDbData_1.2.4       
##  [24] farver_2.1.0                 
##  [25] bit64_4.0.5                  
##  [26] vctrs_0.3.8                  
##  [27] generics_0.1.0               
##  [28] xfun_0.23                    
##  [29] BiocFileCache_1.14.0         
##  [30] R6_2.5.0                     
##  [31] doParallel_1.0.16            
##  [32] ggbeeswarm_0.6.0             
##  [33] netSmooth_1.10.0             
##  [34] rsvd_1.0.5                   
##  [35] RJSONIO_1.3-1.4              
##  [36] locfit_1.5-9.4               
##  [37] bitops_1.0-7                 
##  [38] rhdf5filters_1.2.1           
##  [39] cachem_1.0.5                 
##  [40] DelayedArray_0.16.3          
##  [41] assertthat_0.2.1             
##  [42] promises_1.2.0.1             
##  [43] scales_1.1.1                 
##  [44] beeswarm_0.3.1               
##  [45] gtable_0.3.0                 
##  [46] phylobase_0.8.10             
##  [47] beachmat_2.6.4               
##  [48] rlang_0.4.11                 
##  [49] genefilter_1.72.1            
##  [50] splines_4.0.5                
##  [51] lazyeval_0.2.2               
##  [52] BiocManager_1.30.15          
##  [53] yaml_2.2.1                   
##  [54] reshape2_1.4.4               
##  [55] httpuv_1.6.1                 
##  [56] tools_4.0.5                  
##  [57] bookdown_0.22                
##  [58] gridBase_0.4-7               
##  [59] ellipsis_0.3.2               
##  [60] jquerylib_0.1.4              
##  [61] RColorBrewer_1.1-2           
##  [62] Rcpp_1.0.6                   
##  [63] plyr_1.8.6                   
##  [64] sparseMatrixStats_1.2.1      
##  [65] progress_1.2.2               
##  [66] zlibbioc_1.36.0              
##  [67] purrr_0.3.4                  
##  [68] RCurl_1.98-1.3               
##  [69] prettyunits_1.1.1            
##  [70] viridis_0.6.1                
##  [71] cluster_2.1.1                
##  [72] tinytex_0.31                 
##  [73] magrittr_2.0.1               
##  [74] data.table_1.14.0            
##  [75] RSpectra_0.16-0              
##  [76] hms_1.1.0                    
##  [77] mime_0.10                    
##  [78] evaluate_0.14                
##  [79] xtable_1.8-4                 
##  [80] XML_3.99-0.6                 
##  [81] jpeg_0.1-8.1                 
##  [82] gridExtra_2.3                
##  [83] shape_1.4.6                  
##  [84] compiler_4.0.5               
##  [85] scater_1.18.6                
##  [86] tibble_3.1.2                 
##  [87] RCy3_2.10.2                  
##  [88] crayon_1.4.1                 
##  [89] R.oo_1.24.0                  
##  [90] htmltools_0.5.1.1            
##  [91] entropy_1.3.0                
##  [92] later_1.2.0                  
##  [93] tidyr_1.1.3                  
##  [94] howmany_0.3-1                
##  [95] DBI_1.1.1                    
##  [96] ExperimentHub_1.16.1         
##  [97] dbplyr_2.1.1                 
##  [98] MASS_7.3-53.1                
##  [99] rappdirs_0.3.3               
## [100] Matrix_1.3-2                 
## [101] ade4_1.7-16                  
## [102] R.methodsS3_1.8.1            
## [103] igraph_1.2.6                 
## [104] pkgconfig_2.0.3              
## [105] bigmemory.sri_0.1.3          
## [106] rncl_0.8.4                   
## [107] registry_0.5-1               
## [108] locfdr_1.1-8                 
## [109] scuttle_1.0.4                
## [110] xml2_1.3.2                   
## [111] foreach_1.5.1                
## [112] annotate_1.68.0              
## [113] vipor_0.4.5                  
## [114] bslib_0.2.5.1                
## [115] rngtools_1.5                 
## [116] pkgmaker_0.32.2              
## [117] XVector_0.30.0               
## [118] stringr_1.4.0                
## [119] digest_0.6.27                
## [120] pracma_2.3.3                 
## [121] graph_1.68.0                 
## [122] softImpute_1.4-1             
## [123] rmarkdown_2.8                
## [124] edgeR_3.32.1                 
## [125] DelayedMatrixStats_1.12.3    
## [126] curl_4.3.1                   
## [127] kernlab_0.9-29               
## [128] shiny_1.6.0                  
## [129] lifecycle_1.0.0              
## [130] nlme_3.1-152                 
## [131] jsonlite_1.7.2               
## [132] clusterExperiment_2.10.1     
## [133] Rhdf5lib_1.12.1              
## [134] BiocNeighbors_1.8.2          
## [135] viridisLite_0.4.0            
## [136] limma_3.46.0                 
## [137] fansi_0.4.2                  
## [138] pillar_1.6.1                 
## [139] lattice_0.20-41              
## [140] fastmap_1.1.0                
## [141] httr_1.4.2                   
## [142] survival_3.2-10              
## [143] interactiveDisplayBase_1.28.0
## [144] glue_1.4.2                   
## [145] png_0.1-7                    
## [146] iterators_1.0.13             
## [147] BiocVersion_3.12.0           
## [148] glmnet_4.1-1                 
## [149] bit_4.0.4                    
## [150] stringi_1.6.2                
## [151] sass_0.4.0                   
## [152] HDF5Array_1.18.1             
## [153] blob_1.2.1                   
## [154] BiocSingular_1.6.0           
## [155] AnnotationHub_2.22.1         
## [156] memoise_2.0.0                
## [157] dplyr_1.0.6                  
## [158] irlba_2.3.3                  
## [159] ape_5.5

References

5.
Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: A network-based method for gene-set enrichment visualization and interpretation. PLoS One 5, e13984 (2010).
6.
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550 (2005).
7.
Gustavsen, J. A., Pai, S., Isserlin, R., Demchak, B. & Pico, A. R. RCy3: Network biology using cytoscape from within R. F1000Res. 8, 1774 (2019).
8.
Kucera, M., Isserlin, R., Arkhangorodsky, A. & Bader, G. D. AutoAnnotate: A cytoscape app for summarizing networks with semantic annotations. F1000Res. 5, 1717 (2016).