Summary of the Dataset

In this report, we are going to clean and normalize GEO data GSE75168 for future analysis. The data target methlyation and acethylation of histone H3 lysine 4 correlation with breast cancer sub-types. Using a genome-wide ChIP-Seq approach, the data provide raw genomic count data of three human mammary cell lines: normal-like sub-type (MCF10A), and two cancer sub-types: luminal(MCF7) and basal-like metastatic (MDA-MB-231)(Messier et al. 2016).

Step 1: Clean the data and map to HUGO symbols

Load in required package

# install and loading required packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)

if (!requireNamespace("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")
library(GEOquery)

if (!requireNamespace("knitr", quietly = TRUE))
  install.packages("knitr")
library(knitr)

if (!requireNamespace("kableExtra", quietly = TRUE))
  install.packages("kableExtra")
library(kableExtra)

if (!requireNamespace("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")
library(edgeR)

if (!requireNamespace("tidyr", quietly = TRUE))
  install.packages("tidyr")
library(tidyr)

if (!requireNamespace("biomaRt", quietly = TRUE))
  BiocManager::install("biomaRt")
library(biomaRt)

Download the Data

Download data using GEOquery package.

# downloading data
raw_files <- getGEOSuppFiles('GSE75168')

# get file path of the downloaded file
fpath <- rownames(raw_files)

# read in data
raw_data <- read.delim(fpath[1], header = TRUE, check.names = TRUE)

A quick look over the data.

# only display the first 15 rows due to space limit
kable(raw_data[1:15, ], format = "html") %>% kable_styling(full_width = T, font_size = 12)
Ensembl_ID MCF10A_R1 MCF10A_R2 MCF10A_R3 MCF7_R1 MCF7_R2 MCF7_R3 MDA.MB.231_R1 MDA.MB.231_R2 MDA.MB.231_R3
ENSG00000000003.12 713 1084 982 651 750 662 564 709 720
ENSG00000000419.10 686 392 1242 2287 2618 2461 582 733 875
ENSG00000000457.11 498 396 469 285 422 408 224 346 294
ENSG00000000460.14 347 69 746 392 636 568 650 796 959
ENSG00000000971.13 49 562 46 1 1 1 550 771 839
ENSG00000001036.11 3007 977 3913 2396 3245 3066 1579 1917 2207
ENSG00000001084.8 950 1397 3065 1134 1282 1309 435 577 713
ENSG00000001167.12 346 424 657 525 596 646 428 472 592
ENSG00000001460.15 137 95 331 69 102 91 104 146 142
ENSG00000001461.14 2863 3002 3083 259 413 395 905 1148 1400
ENSG00000001497.14 407 358 1173 605 997 988 705 918 1145
ENSG00000001561.6 221 355 297 381 413 440 110 157 162
ENSG00000001617.9 894 1326 838 368 757 660 30 38 42
ENSG00000001629.7 2217 2241 3905 3330 3635 3630 1753 2398 2728
ENSG00000001630.13 205 94 338 125 328 297 127 282 198


Each row represent a difference gene, in which the gene’s Ensembl ID is the rowname, and each column repsence a different sample.

Access

To look at the total coverage of the raw data:

dim(raw_data)
## [1] 20575    10

We can see that there are 20575 rows in total. Since each row represent a different gene, the inital coverage of the raw data is 20575 genes. Moreover, we have 10 columns, with one column containing gene Ensembl ID, we have 9 difference samples, which display below:

colnames(raw_data)
##  [1] "Ensembl_ID"    "MCF10A_R1"     "MCF10A_R2"     "MCF10A_R3"    
##  [5] "MCF7_R1"       "MCF7_R2"       "MCF7_R3"       "MDA.MB.231_R1"
##  [9] "MDA.MB.231_R2" "MDA.MB.231_R3"

We can see that we have total of 9 samples, including 3 replicates of each cell line.

Dataset Description and Platform Description

Contact information about dataset:

# getting GEO SOFT format file
gse <- getGEO("GSE75168", GSEMatrix = FALSE)
kable(data.frame(head(Meta(gse))), format = "html") %>% kable_styling(full_width = F)
contact_address contact_city contact_country contact_department contact_email contact_institute
89 Beaumont Ave Given E209 Burlington USA Biochemistry University of Vermont


Platform Information:

# getting platform information
gpl_info <- Meta(getGEO(names(GPLList(gse))[1]))
Category <- c("Platform title", "Submission data", "Last update data", "Organims", "Number of GEO datasets that use this techology", "Number of GEO samples that use this technology")
Infomation <- c(gpl_info$title, gpl_info$submission_date, gpl_info$last_update_date, gpl_info$organism, length(gpl_info$series_id), length(gpl_info$sample_id))
kable(data.frame(Category, Infomation), format = "html") %>% kable_styling(full_width = F)
Category Infomation
Platform title Illumina HiSeq 1500 (Homo sapiens)
Submission data Mar 24 2014
Last update data Oct 02 2018
Organims Homo sapiens
Number of GEO datasets that use this techology 329
Number of GEO samples that use this technology 6407

Clean

Remove gene with Low Count

We need to remove gene without at least 1 read per million in n of samples. Since we have 3 replciates for each different cell line, thus, we need to filter out genes that don’t have 3 read per million in each cell line.

# convert count to count per miliion
cpms <- cpm(raw_data[, 2:10])
rownames(cpms) <- raw_data[, 1]

# only keep rows with at least 3 read per million
keep <- rowSums(cpms > 1) >= 3
raw_data_hig <- raw_data[keep, ]
rownames(raw_data_hig) <- raw_data_hig[, 1]

Number of genes we filtered out:

nrow(raw_data) - nrow(raw_data_hig)
## [1] 6297

Duplicate Gene

# sort by number of times each Ensembl ID appears in the data
summarized_gene_counts <- sort(table(raw_data_hig$Ensembl_ID), decreasing = TRUE)
kable(summarized_gene_counts[which(summarized_gene_counts > 1)[1: 10]], format = "html") %>% kable_styling(full_width = F)
Var1 Freq
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA


Since the table is empty, there are no duplicate genes in our data.

Define Sample Groups

We than create a data frame containing the group infomation, whcih stored the infomation of each sample’s cell line and replicate.

# split the sample name into cell line and replicate
sample_group <- data.frame(lapply(colnames(raw_data)[2:10], 
                                  FUN=function(x){unlist(strsplit(x, split = "\\_"))}))
colnames(sample_group) <- colnames(raw_data)[2:10]
rownames(sample_group) <- c("cell_line", "replicate")

# transfrom into dataframe
sample_group <- data.frame(t(sample_group))
kable(sample_group, format = "html") %>% kable_styling(full_width = F)
cell_line replicate
MCF10A_R1 MCF10A R1
MCF10A_R2 MCF10A R2
MCF10A_R3 MCF10A R3
MCF7_R1 MCF7 R1
MCF7_R2 MCF7 R2
MCF7_R3 MCF7 R3
MDA.MB.231_R1 MDA.MB.231 R1
MDA.MB.231_R2 MDA.MB.231 R2
MDA.MB.231_R3 MDA.MB.231 R3

Map

By the data displayed above, we see that our data only the Ensembl ID. We need to convert ensemble id to HUGO gene symbol. First, we need to get a conversion table of Ensembl ID map to corresponding gene symbol.

# select Mart and Dataset
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)

# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id_version","hgnc_symbol"),
                         filters = c("ensembl_gene_id_version"),
                         values = raw_data_hig$Ensembl_ID,
                         mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id_version hgnc_symbol
ENSG00000019995.6 ZRANB1
ENSG00000064195.7 DLX3
ENSG00000065491.8 TBC1D22B
ENSG00000073905.8 VDAC1P1
ENSG00000076641.4 PAG1
ENSG00000102096.9 PIM2
ENSG00000105988.6 NHP2P1
ENSG00000106125.14 MINDY4
ENSG00000108958.4
ENSG00000109089.7 CDR2L


However, consider the number of map we get:

nrow(id_conversion)
## [1] 1532

We only get about 10% match. In order to resolve this issue, since the Ensembl ID of our data include the version number, we will drop the version and map with only the Ensembl ID.

Another important factor we need to consider is that due to the time of submission data date and the context of the paper, we will not using the latest Ensembl-HUGO conversion version, instead, we will use the Ensembl 77(published on 2014 Oct), which is the version the data used.

With the consideration of the above two factors, we will remap our data.

# select Mart and Dataset with wanted version
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="https://oct2014.archive.ensembl.org")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)

# split current Ensembl ID with version into two columns: Ensembl ID and version number
raw_data_hig <- raw_data_hig %>% tidyr::separate(Ensembl_ID, c("id", "version"), "\\.")

# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                         filters = c("ensembl_gene_id"),
                         values = raw_data_hig$id,
                         mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id hgnc_symbol
ENSG00000000003 TSPAN6
ENSG00000000419 DPM1
ENSG00000000457 SCYL3
ENSG00000000460 C1orf112
ENSG00000000971 CFH
ENSG00000001036 FUCA2
ENSG00000001084 GCLC
ENSG00000001167 NFYA
ENSG00000001460 STPG1
ENSG00000001461 NIPAL3


To check again the number of match we got:

nrow(id_conversion)
## [1] 14276

Which is a much better match result(~99%).

Since we omit the version number, we need to check our conversion table to prevent the issue of one ensembl id map to more than one gene symbol.

# sort by number of times each Ensembl ID appears in the data
n_occur <- data.frame(table(id_conversion$ensembl_gene_id))

# filter out only genes appears more than once in the data
kable(n_occur[n_occur$Freq > 1, ], format = "html") %>% kable_styling(full_width = F)
Var1 Freq
14233 ENSG00000279010 2


There is one Ensembl id that map to 2 different gene symbol ENSG00000279010.

kable(id_conversion %>% dplyr::filter(ensembl_gene_id == 'ENSG00000279010'), 
      format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id hgnc_symbol
ENSG00000279010 MIR4534
ENSG00000279010 MIR6820


With the consideration of its version number:

kable((raw_data_hig %>% dplyr::filter(id == 'ENSG00000279010'))[, 1:6], 
      format = "html") %>% kable_styling(full_width = F)
id version MCF10A_R1 MCF10A_R2 MCF10A_R3 MCF7_R1
ENSG00000279010.1 ENSG00000279010 1 223 241 212 39


We find the the proper Ensembl ID with version number is ENSG00000279010.1. After searching online, we will keep the row with gene symbol MIR4534.

# remove the row with gene symbol MIR6820
id_conversion <- subset(id_conversion, !(id_conversion$ensembl_gene_id == "ENSG00000279010" 
                                 & id_conversion$hgnc_symbol == "MIR6820"))

With no duplicate Ensembl ID, we will merge the conversion table and our cleaned raw data.

# merge raw data with mapping conversion using Ensembl ID
raw_data_hig_annot <- merge(id_conversion, raw_data_hig, by.x = 1, by.y = 1, all.y = TRUE)

Now we need to check the number of gene we were not able to match.

nrow(raw_data_hig) - length(which(raw_data_hig$id %in% id_conversion$ensembl_gene_id))
## [1] 3

There are 3 genes we were not able to match, we will print them out in table to investigate why.

# filter out rows that don't have HUGO symbol
ensembl_id_missing_gene <- raw_data_hig_annot$ensembl_gene_id[which(is.na(raw_data_hig_annot$hgnc_symbol))]
kable(raw_data_hig_annot[is.na(raw_data_hig_annot$hgnc_symbol), ][, 1:6], 
      format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id hgnc_symbol version MCF10A_R1 MCF10A_R2 MCF10A_R3
1 alignment_not_unique NA NA 11805625 16487627 25094879
2 ambiguous NA NA 1018332 1805665 1281009
14278 no_feature NA NA 3019369 2326257 3102003


We will not filter those rows our right now since for our research purpose of differential expression, those rows might contribute to potential significant result. Also, those rows each represent unique element and there is no need to worried them right now. Notice that the alignment not unique row could explain why we did not find any replicate gene before.

Step 2: Apply Normalization

Normalization Process

In our data, the deregulated genes and the non-deregulated genes does not have similar behaviors, but the total expression are the same. Also, according to the paper, the experimental design and condition are similar and the cell amount is the same for each sample. Thus, with the above assumptions, we will use the normalize method total count normalization (Evans et al. 2017).

# convert data into DGEList object
filtered_data_matrix <- as.matrix(raw_data_hig_annot[, 4:12])
rownames(filtered_data_matrix) <- raw_data_hig_annot$hgnc_symbol
d <- DGEList(counts=filtered_data_matrix, group = sample_group$cell_line)

# normalize by count per million
nf <- calcNormFactors(d)
normalzied_data_annon <- cpm(nf)

Pre-normalization and Post-normalization Data Distribution

After our data is normalized, we will compare the pre-normalization and post-normalization data distribution to examine the quality change of the normalized data.

Boxplot

# draw pre-normalized boxplot
bplot_raw <- log2(cpm(raw_data_hig_annot[,4:12]))
boxplot(bplot_raw, xlab = "Samples", ylab = "log2 Counts per Million",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Pre-normalization box plot")
# draw the median on box plot
abline(h = median(apply(bplot_raw, 2, median)),
       col = "red", lwd = 2, lty = "dashed")

# draw post-normalized boxplot
bplot_norm <- log2(normalzied_data_annon)
boxplot(bplot_norm, xlab = "Samples", ylab = "log2 Counts per Million",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Post-normalization box plot")
# draw the median on each box plot
abline(h = median(apply(bplot_norm, 2, median)),
       col = "red", lwd = 2, lty = "dashed")

Density Plot

# pre-normalization plot
counts_density_raw <- apply(log2(cpm(raw_data_hig_annot[, 4:12])), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_raw)) {
  xlim <- range(c(xlim, counts_density_raw[[i]]$x))
  ylim <- range(c(ylim, counts_density_raw[[i]]$y))
}
cols <- rainbow(length(counts_density_raw))
ltys <- rep(1, length(counts_density_raw))
# plot the first density plot to initialize the plot
plot(counts_density_raw[[1]], xlim = xlim, ylim = ylim, type = "n",
     ylab = "Smoothing density of log2-CPM",
     main = "Pre-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_raw)) {
  lines(counts_density_raw[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_raw),
       col = cols, lty = ltys, cex = 0.75,
       border = "blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")

# post-normalization plot
counts_density_norm <- apply(log2(normalzied_data_annon), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_norm)) {
  xlim <- range(c(xlim, counts_density_norm[[i]]$x))
  ylim <- range(c(ylim, counts_density_norm[[i]]$y))
}
cols <- rainbow(length(counts_density_norm))
ltys <- rep(1, length(counts_density_norm))
# plot the first density plot to initialize the plot
plot(counts_density_norm[[1]], xlim = xlim, ylim = ylim, type = "n",
     ylab = "Smoothing density of log2-CPM",
     main = "Post-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_norm)) {
  lines(counts_density_norm[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_norm),
       col = cols, lty = ltys, cex = 0.75,
       border ="blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")

By comparing the pre-normalization and post-normalization data distribution, we can obersve that there are less variance within data. From the box plot, the post-normalization median for each sample are more close to the total median line and the data is less dispersed compare to the pre-normalization. Similar trend is also observed from the density plot.

Post Normalization MDS Plot

We will also use MDS plot to compare between single samples, and to investigate the distance between samples. Ideally, samples with similaire expression will cluster together.

plotMDS(nf, labels = rownames(sample_group), 
        col = c("darkgreen", "blue")[factor(sample_group$cell_line)], 
        main = "MDS Plot")

From this plot, the blue cluster is the luminal cell line, the black cluster is the basal-like metastatic, and the green cluster is the normal-like subtype. We can see that our data is strong that our data cluster together, which indicates there is little variation between replicates when grouping by cell lines.

Dispersion

# setup plot assumptions
model_design <- model.matrix(~sample_group$replicate + sample_group$cell_line + 0)
nf_d <- estimateDisp(nf, model_design)

# plot BCV plot to investigate dispersion
plotBCV(nf_d,col.tagwise = "black",col.common = "red",main = "BCV Plot")

# plot Mean-Variance plot to investigate relationship between mean and variance 
# of expression data
plotMeanVar(nf_d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
            show.ave.raw.vars = TRUE,
            NBline = TRUE,
            show.binned.common.disp.vars = TRUE, main = "Mean Variance")

Biological coefficient of variation(BCV) Plots can be used to to look at dispersion. Each of the dots in the given BCV plot is the BCV for a gene from our data set. Since the common dispersion is between 0.2 - 0.4, our data is considered reasonable and could detect more deregulated genes(Gu 2015). The Mean-Variance plot also shows a promising correlation between mean and variance of the sample data. The factors dispalyed by the two plot indicate that out data has been normalized to the point where it is suitable for further analysis.

Step 3: Interpret and Document

Final Prepared Data

Here display our final data cleaned and normalized. The row name is HUGO gene symbol, and each row represent a different gene. The column name is the sample name, and each column represent a different sample’s normalized count per million expression data.

kable(normalzied_data_annon[10:20, ], format = "html") %>% 
  kable_styling(full_width = F, font_size = 12)
MCF10A_R1 MCF10A_R2 MCF10A_R3 MCF7_R1 MCF7_R2 MCF7_R3 MDA.MB.231_R1 MDA.MB.231_R2 MDA.MB.231_R3
NFYA 10.986435 13.949391 11.320310 14.847536 12.0786911 13.7190011 15.438415 12.7555456 14.800126
STPG1 4.350120 3.125453 5.703231 1.951391 2.0671585 1.9325528 3.751390 3.9455713 3.550030
NIPAL3 90.907983 98.764317 53.121028 7.324784 8.3699655 8.3885533 32.644311 31.0240812 35.000298
LAS1L 12.923349 11.778023 20.211147 17.110018 20.2054614 20.9820017 25.430099 24.8084551 28.625244
ENPP4 7.017347 11.679325 5.117400 10.775069 8.3699655 9.3442113 3.967817 4.2428404 4.050034
SEMA3F 28.386915 43.624745 14.438995 10.407416 15.3415590 14.0163169 1.082132 1.0269295 1.050009
ANKIB1 70.395738 73.727793 67.284338 94.175800 73.6678558 77.0897432 63.232572 64.8046573 68.200580
CYP51A1 6.509304 3.092554 5.823843 3.535128 6.6473333 6.3073426 4.581025 7.6208980 4.950042
KRIT1 40.484694 39.282010 37.458681 26.216507 21.1579756 22.1712650 16.664831 17.2956550 18.600158
RAD52 8.223950 7.435288 5.427546 3.506847 3.8100569 3.2280003 4.436741 5.4589411 5.450046
MYH16 1.301861 1.546277 1.533497 0.000000 0.0405325 0.0212368 1.478913 0.9458561 1.475013


Final coverage:

nrow(normalzied_data_annon)
## [1] 14278

Interpretation Questions

What are the control and test conditions of the dataset?

The control of the dataset is the cell line is normal-like subtype (MCF10A), and two test conditions are luminal(MCF7) and basal-like metastatic (MDA-MB-231).

Why is the dataset of interest to you?

Breast cancer is one of the most common type of the cancer, thus, investigate the potential genetic factors invovled in breast cancer’s mechanism becomes crucial. Moreover, previous studies also indicate that methlyation and acethylation of histone H3 lysine 4 is a well-established marker of active or poised transcription(Kimura 2013). With this information, I think it would be intersting to look at how the modification of histone H3 lysine 4 would affect gene expression of the two breast cancer subtypes luminal and basal-like metastatic compare to normal breast cancer cell line.

Were there expression values that were not unique for specific genes? How did you handle these?

There were no expression value that were not unique for specific genes.

Were there expression values that could not be mapped to current HUGO symbols?

There are 3 observations that could not mapped to current HUGO symbols, which is mention above. The 3 observerations are alignment not unique, no feathers, and ambiguous, which will not affect our result and we will not filtered out in current phases.

How many outliers were removed?

There are total of 6297 genes removed due to their low count number that they don’t have 3 read per million in each cell line. No other outliers were removed.

How did you handle replicates?

There are 3 biological replicate per each of the 3 cell lines like mentioned above. When we normalize our data, we group the data by cell line, and perform normalization across replicates for each cell line. Also, genes don’t have 3 read per million in each cell line(because we have 3 replciates) were removed.

What is the final coverage of your dataset?

The final coverage is 14278 genes, which is about 69% of the initial raw data.

Reference

Messier TL, Gordon JAR, Boyd JR, Tye CE, Browne G, Stein JL, Lian JB, Stein GS. 2016. Histone H3 lysine 4 acetylation and methylation dynamics define breast cancer subtypes. Oncotarget. 7(5):5094–5109. doi:https://doi.org/10.18632/oncotarget.6922.

Evans C, Hardin J, Stoebel DM. 2017. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics. 19(5):776–792. doi:https://doi.org/10.1093/bib/bbx008.

Some key factors for number of significant DE genes. 2015 Nov 2. CVR Bioinformatics. https://bioinformatics.cvr.ac.uk/some-key-factors-for-number-of-significant-de-genes/.

Kimura H. Histone modifications for human epigenome analysis. J Hum Genet. 2013; 58:439-445.

Ruth Isserlin, Course Lectures (2023)