Module 3: Generalized Linear Models
In this module we’re going to cover:
- Reading data from files and evaluating missingness
- Creating publication-quality plots with
ggplot2
- Fitting models with binary outcomes using generalized linear models
Essential R: Read tables from files and merge
For this we’re going to use two data files available in the course data directory.
Download these for use and put them in your R working directory.
tomerge1.csv
: comma-separated valuestomerge2.txt
: space-delimited values
Use read.delim()
to read in tables. Note the use of the sep=
parameter to indicate what the column separator is:
<- read.delim("data/tomerge1.csv",sep=",")
x1 head(x1)
## Sample_ID Exposure Biomarker_value
## 1 A 0 35
## 2 B 0 22
## 3 C 0 91
## 4 D 0 3
## 5 E 1 56
## 6 F 1 37
<- read.delim("data/tomerge2.txt",sep=" ")
x2 head(x2)
## sampleID Exposure_level Biomarker2_detected
## 1 A 0.65405517 1
## 2 B 0.67202852 0
## 3 C 0.88646372 1
## 4 D 0.28433256 0
## 5 E 0.04166839 0
## 6 F 0.45263534 1
Use merge()
to combine the two tables by sample ID. Note the use of by.x
and by.y
to tell merge()
which columns are equivalent:
<- merge(x1, x2, by.x = "Sample_ID", by.y = "sampleID")
x_merge head(x_merge)
## Sample_ID Exposure Biomarker_value Exposure_level Biomarker2_detected
## 1 A 0 35 0.65405517 1
## 2 B 0 22 0.67202852 0
## 3 C 0 91 0.88646372 1
## 4 D 0 3 0.28433256 0
## 5 E 1 56 0.04166839 0
## 6 F 1 37 0.45263534 1
Read data from files and explore variable distribution
We’re going to use a popular dataset for data analysis, pertaining to survival of passengers aboard the Titanic.
Download the dataset here and copy it into your working directory.
Let’s read in the data from a file:
<- read.delim(
dat "data/titanic.csv",
sep="," # indicate the column separator
)
Examine the columns:
head(dat)
## PassengerId Survived Pclass Name Sex Age SibSp
## 1 1 0 3 Braund, Mr. Owen Harris male 22 1
## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 3 1 3 Heikkinen, Miss. Laina female 26 0
## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 5 0 3 Allen, Mr. William Henry male 35 0
## 6 6 0 3 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
Some of the columns are categorical, use table()
to look at the tallies:
Examine the columns:
table(dat$Survived)
##
## 0 1
## 549 342
table(dat$Pclass)
##
## 1 2 3
## 216 184 491
Use summary()
to look at continuous-valued data:
summary(dat$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Notice that there are 177 NA
(missing) values for age.
Let’s visualize the missing data more systematically.
Explore missing data
For this let’s use a little script that converts a table into black and white squares to visualize missing data. For this, install the plotrix
package
if (!requireNamespace("plotrix", quietly = TRUE)) install.packages("plotrix")
suppressMessages(require(plotrix))
#' show data missingness as a chequered matrix
#'
#' @param x (matrix) data matrix.
#' @param outFile (char) path to file for printing graph
#' @param wd (numeric) width in inches
#' @param ht (numeric) height in inches
#' @return plots missingness matrix to file
#' @import plotrix
#' @export
<- function(x,xlab="columns",
plotMissMat ylab="rows",border=NA) {
<- !is.na(x)
x class(x) <- "numeric"
color2D.matplot(x,show.values=FALSE,axes=FALSE,
cs1=c(1,0),cs2=c(1,0),cs3=c(1,0),border=border,
cex=0.8,
xlab=xlab,ylab=ylab)
}
Let’s look at the missingness in the Titanic dataset. Missing data is shown as a white cell, and non-missing data is shown in black.
plotMissMat(dat)
We can see a column with many missing values. This is probably the “age” data. Let’s count the number of missing values on a per-column level.
For this we combine is.na()
, which returns a TRUE/FALSE value for NA values, and colSums()
which adds up the TRUE values down the columns.
colSums(is.na(dat))
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 0 0 0 0 0 177 0 0
## Ticket Fare Cabin Embarked
## 0 0 0 0
This confirms that Age
is the only column with missing data.
Now let’s explore the data using plots.
Create plots with ggplot2
to explore variable relationships
ggplot2
is a popular plotting package that uses an additive paradigm to build plots. The ggplot2 website is a wealth of reference information, so here we just touch on the basics to get you started.
Anytime you need to generate a specific kind of plot, the website will most likely have documentation for how to achieve it.
Let’s start by creating a scatterplot. For this let’s load a dataset measuring statistics around quality of life in US states in the late 70’s:
<- as.data.frame(state.x77)
state.x77 head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
Create a base plot using the ggplot
function. Then “add” a scatterplot to it. Notice that the plot has been assigned to a variable named p
.
This setup is standard for ggplot2
and allows multiple visualizations to be applied to the same base plot.
We use aes
to tell ggplot what the x
and y
axes are, and later, if we want to colour-code by a particular column.
require(ggplot2)
<- ggplot(state.x77,
p aes(x = Illiteracy,y = Income)
)
<- p + geom_point() # scatter plot
p p
Now let’s add confidence intervals:
+ geom_smooth() p
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It looks like there is a negative relationship between illiteracy and income. We can confirm this by looking at correlation:
<- state.x77$Illiteracy
x <- state.x77$Income
y cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -3.3668, df = 48, p-value = 0.001505
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6378257 -0.1807128
## sample estimates:
## cor
## -0.4370752
cor.test(x,y)$p.value
## [1] 0.001505073
Let’s now examine categorical variables. In the titanic set, let’s look at the fare paid based on passenger class:
<- ggplot(dat)
p + geom_boxplot(
p aes(x = factor(Pclass), # make categorical
y = Fare))
We can use barplots to examine counts and proportions. Let’s look at number of survivors, split by passenger class
+ geom_bar(
p aes(fill=factor(Survived), Pclass)
)
The plot above shows count data. Let’s convert this to proportions. We can see that the fraction of non-survivors in “Class 3” is high.
+ geom_bar(
p aes(fill=factor(Survived), Pclass),
position = "fill"
)
How about males versus females?
+ geom_bar(
p aes(fill=factor(Survived), Sex),
position = "fill"
)
Fit binary response variable using glm()
and logistic regression
Let’s fit a model to a binary outcome. For this we load a dataset that measures physiological variables in a cohort of Pima Indians.
require(mlbench)
data(PimaIndiansDiabetes2)
# type ?PimaIndiansDiabetes2 to learn more about the dataset.
<- PimaIndiansDiabetes2
dat head(dat)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
Let’s look at the impact of blood glucose levels on diabetes diagnosis. First let’s make a scatterplot.
Could there be a relationship?
<- ggplot(dat, aes(x = glucose, y = factor(diabetes)))
p + geom_point() p
## Warning: Removed 5 rows containing missing values (`geom_point()`).
Could this be fit with a linear model?
<- ggplot(dat, aes(x = glucose, y = factor(diabetes)))
p + geom_point() + geom_smooth() p
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 5 rows containing missing values (`geom_point()`).
This is a situation where a logistic regression model would be an appropriate choice. We’re going to use glm()
to fit a model to these data:
<- glm(factor(diabetes)~glucose,
mod
dat,family = "binomial" # set to model binary outcome
)summary(mod)
##
## Call:
## glm(formula = factor(diabetes) ~ glucose, family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1846 -0.7773 -0.5094 0.8232 2.2896
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.715088 0.438100 -13.04 <2e-16 ***
## glucose 0.040634 0.003382 12.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 986.70 on 762 degrees of freedom
## Residual deviance: 786.56 on 761 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 790.56
##
## Number of Fisher Scoring iterations: 4
Which factors explain a diabetes diagnosis?
What if we include a couple other factors?
<- glm(factor(diabetes)~ glucose + pregnant + age + pressure + triceps,
mod
dat,family = "binomial")
summary(mod)
##
## Call:
## glm(formula = factor(diabetes) ~ glucose + pregnant + age + pressure +
## triceps, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1790 -0.6813 -0.4061 0.7167 2.3204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.6685706 0.8451262 -9.074 < 2e-16 ***
## glucose 0.0365926 0.0041105 8.902 < 2e-16 ***
## pregnant 0.0972493 0.0417560 2.329 0.019860 *
## age 0.0249707 0.0135945 1.837 0.066235 .
## pressure -0.0009321 0.0099322 -0.094 0.925230
## triceps 0.0420045 0.0117044 3.589 0.000332 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 678.40 on 533 degrees of freedom
## Residual deviance: 494.38 on 528 degrees of freedom
## (234 observations deleted due to missingness)
## AIC: 506.38
##
## Number of Fisher Scoring iterations: 5
Note: This morning’s session only intends to introduce you to fitting non-linear models.
In practice you may need to do more work to test multiple models to ascertain best fits your data, using measures such as goodness-of-fit. You will also likely compute odds ratio (odds of increased Y per unit increase X), which is out of scope for the current tutorial.
We strongly recommend that you learn these topics before applying these methods to your own data.
Exercise
Now let’s apply the ideas above to a dataset for classifying a breast cell as being either benign or malignant.
data(BreastCancer)
<- BreastCancer
bc for (k in 2:10) # altered for current lab
<- as.numeric(bc[,k])
bc[,k] head(bc)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin
## 1 1000025 5 1 1 1 2 1 3
## 2 1002945 5 4 4 5 7 10 3
## 3 1015425 3 1 1 1 2 2 3
## 4 1016277 6 8 8 1 3 4 3
## 5 1017023 4 1 1 3 2 1 3
## 6 1017122 8 10 10 8 7 10 9
## Normal.nucleoli Mitoses Class
## 1 1 1 benign
## 2 2 1 benign
## 3 1 1 benign
## 4 7 1 benign
## 5 1 1 benign
## 6 7 1 malignant
Learn more about the dataset:
?BreastCancer
For your exercise, answer the following questions:
- Is there missing data?
- Which columns are
- Use plots to explore the relationship between explanatory variables.
- Fit a logistic model to identify which factors explain class (benign vs. malignant).