Summary of the Dataset

In this report, we are going to clean and normalize GEO data GSE75168 for future analysis. The data target methlyation and acethylation of histone H3 lysine 4 correlation with breast cancer sub-types. Using a genome-wide ChIP-Seq approach, the data provide raw genomic count data of three human mammary cell lines: normal-like sub-type (MCF10A), and two cancer sub-types: luminal(MCF7) and basal-like metastatic (MDA-MB-231)(Messier et al. 2016).

Step 1: Clean the data and map to HUGO symbols

Load in required package

# install and loading required packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)

if (!requireNamespace("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")
library(GEOquery)

if (!requireNamespace("knitr", quietly = TRUE))
  install.packages("knitr")
library(knitr)

if (!requireNamespace("kableExtra", quietly = TRUE))
  install.packages("kableExtra")
library(kableExtra)

if (!requireNamespace("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")
library(edgeR)

if (!requireNamespace("tidyr", quietly = TRUE))
  install.packages("tidyr")
library(tidyr)

if (!requireNamespace("biomaRt", quietly = TRUE))
  BiocManager::install("biomaRt")
library(biomaRt)

Download the Data

Download data using GEOquery package.

# downloading data
raw_files <- getGEOSuppFiles('GSE75168')

# get file path of the downloaded file
fpath <- rownames(raw_files)

# read in data
raw_data <- read.delim(fpath[1], header = TRUE, check.names = TRUE)

A quick look over the data.

# only display the first 15 rows due to space limit
kable(raw_data[1:15, ], format = "html") %>% kable_styling(full_width = T, font_size = 12)

Ensembl_ID	MCF10A_R1	MCF10A_R2	MCF10A_R3	MCF7_R1	MCF7_R2	MCF7_R3	MDA.MB.231_R1	MDA.MB.231_R2	MDA.MB.231_R3
ENSG00000000003.12	713	1084	982	651	750	662	564	709	720
ENSG00000000419.10	686	392	1242	2287	2618	2461	582	733	875
ENSG00000000457.11	498	396	469	285	422	408	224	346	294
ENSG00000000460.14	347	69	746	392	636	568	650	796	959
ENSG00000000971.13	49	562	46	1	1	1	550	771	839
ENSG00000001036.11	3007	977	3913	2396	3245	3066	1579	1917	2207
ENSG00000001084.8	950	1397	3065	1134	1282	1309	435	577	713
ENSG00000001167.12	346	424	657	525	596	646	428	472	592
ENSG00000001460.15	137	95	331	69	102	91	104	146	142
ENSG00000001461.14	2863	3002	3083	259	413	395	905	1148	1400
ENSG00000001497.14	407	358	1173	605	997	988	705	918	1145
ENSG00000001561.6	221	355	297	381	413	440	110	157	162
ENSG00000001617.9	894	1326	838	368	757	660	30	38	42
ENSG00000001629.7	2217	2241	3905	3330	3635	3630	1753	2398	2728
ENSG00000001630.13	205	94	338	125	328	297	127	282	198

Each row represent a difference gene, in which the gene’s Ensembl ID is the rowname, and each column repsence a different sample.

Access

To look at the total coverage of the raw data:

dim(raw_data)

## [1] 20575    10

We can see that there are 20575 rows in total. Since each row represent a different gene, the inital coverage of the raw data is 20575 genes. Moreover, we have 10 columns, with one column containing gene Ensembl ID, we have 9 difference samples, which display below:

colnames(raw_data)

##  [1] "Ensembl_ID"    "MCF10A_R1"     "MCF10A_R2"     "MCF10A_R3"    
##  [5] "MCF7_R1"       "MCF7_R2"       "MCF7_R3"       "MDA.MB.231_R1"
##  [9] "MDA.MB.231_R2" "MDA.MB.231_R3"

We can see that we have total of 9 samples, including 3 replicates of each cell line.

Dataset Description and Platform Description

Contact information about dataset:

# getting GEO SOFT format file
gse <- getGEO("GSE75168", GSEMatrix = FALSE)
kable(data.frame(head(Meta(gse))), format = "html") %>% kable_styling(full_width = F)

contact_address	contact_city	contact_country	contact_department	contact_email	contact_institute
89 Beaumont Ave Given E209	Burlington	USA	Biochemistry	Jonathan.A.Gordon@med.uvm.edu	University of Vermont

Platform Information:

# getting platform information
gpl_info <- Meta(getGEO(names(GPLList(gse))[1]))
Category <- c("Platform title", "Submission data", "Last update data", "Organims", "Number of GEO datasets that use this techology", "Number of GEO samples that use this technology")
Infomation <- c(gpl_info$title, gpl_info$submission_date, gpl_info$last_update_date, gpl_info$organism, length(gpl_info$series_id), length(gpl_info$sample_id))
kable(data.frame(Category, Infomation), format = "html") %>% kable_styling(full_width = F)

Category	Infomation
Platform title	Illumina HiSeq 1500 (Homo sapiens)
Submission data	Mar 24 2014
Last update data	Oct 02 2018
Organims	Homo sapiens
Number of GEO datasets that use this techology	329
Number of GEO samples that use this technology	6407

Clean

Remove gene with Low Count

We need to remove gene without at least 1 read per million in n of samples. Since we have 3 replciates for each different cell line, thus, we need to filter out genes that don’t have 3 read per million in each cell line.

# convert count to count per miliion
cpms <- cpm(raw_data[, 2:10])
rownames(cpms) <- raw_data[, 1]

# only keep rows with at least 3 read per million
keep <- rowSums(cpms > 1) >= 3
raw_data_hig <- raw_data[keep, ]
rownames(raw_data_hig) <- raw_data_hig[, 1]

Number of genes we filtered out:

nrow(raw_data) - nrow(raw_data_hig)

## [1] 6297

Duplicate Gene

# sort by number of times each Ensembl ID appears in the data
summarized_gene_counts <- sort(table(raw_data_hig$Ensembl_ID), decreasing = TRUE)
kable(summarized_gene_counts[which(summarized_gene_counts > 1)[1: 10]], format = "html") %>% kable_styling(full_width = F)

Var1	Freq
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA

Since the table is empty, there are no duplicate genes in our data.

Define Sample Groups

We than create a data frame containing the group infomation, whcih stored the infomation of each sample’s cell line and replicate.

# split the sample name into cell line and replicate
sample_group <- data.frame(lapply(colnames(raw_data)[2:10], 
                                  FUN=function(x){unlist(strsplit(x, split = "\\_"))}))
colnames(sample_group) <- colnames(raw_data)[2:10]
rownames(sample_group) <- c("cell_line", "replicate")

# transfrom into dataframe
sample_group <- data.frame(t(sample_group))
kable(sample_group, format = "html") %>% kable_styling(full_width = F)

	cell_line	replicate
MCF10A_R1	MCF10A	R1
MCF10A_R2	MCF10A	R2
MCF10A_R3	MCF10A	R3
MCF7_R1	MCF7	R1
MCF7_R2	MCF7	R2
MCF7_R3	MCF7	R3
MDA.MB.231_R1	MDA.MB.231	R1
MDA.MB.231_R2	MDA.MB.231	R2
MDA.MB.231_R3	MDA.MB.231	R3

Map

By the data displayed above, we see that our data only the Ensembl ID. We need to convert ensemble id to HUGO gene symbol. First, we need to get a conversion table of Ensembl ID map to corresponding gene symbol.

# select Mart and Dataset
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)

# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id_version","hgnc_symbol"),
                         filters = c("ensembl_gene_id_version"),
                         values = raw_data_hig$Ensembl_ID,
                         mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)

ensembl_gene_id_version	hgnc_symbol
ENSG00000019995.6	ZRANB1
ENSG00000064195.7	DLX3
ENSG00000065491.8	TBC1D22B
ENSG00000073905.8	VDAC1P1
ENSG00000076641.4	PAG1
ENSG00000102096.9	PIM2
ENSG00000105988.6	NHP2P1
ENSG00000106125.14	MINDY4
ENSG00000108958.4
ENSG00000109089.7	CDR2L

However, consider the number of map we get:

nrow(id_conversion)

## [1] 1532

We only get about 10% match. In order to resolve this issue, since the Ensembl ID of our data include the version number, we will drop the version and map with only the Ensembl ID.

Another important factor we need to consider is that due to the time of submission data date and the context of the paper, we will not using the latest Ensembl-HUGO conversion version, instead, we will use the Ensembl 77(published on 2014 Oct), which is the version the data used.

With the consideration of the above two factors, we will remap our data.

# select Mart and Dataset with wanted version
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="https://oct2014.archive.ensembl.org")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)

# split current Ensembl ID with version into two columns: Ensembl ID and version number
raw_data_hig <- raw_data_hig %>% tidyr::separate(Ensembl_ID, c("id", "version"), "\\.")

# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                         filters = c("ensembl_gene_id"),
                         values = raw_data_hig$id,
                         mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)

ensembl_gene_id	hgnc_symbol
ENSG00000000003	TSPAN6
ENSG00000000419	DPM1
ENSG00000000457	SCYL3
ENSG00000000460	C1orf112
ENSG00000000971	CFH
ENSG00000001036	FUCA2
ENSG00000001084	GCLC
ENSG00000001167	NFYA
ENSG00000001460	STPG1
ENSG00000001461	NIPAL3

To check again the number of match we got:

nrow(id_conversion)

## [1] 14276

Which is a much better match result(~99%).

Since we omit the version number, we need to check our conversion table to prevent the issue of one ensembl id map to more than one gene symbol.

# sort by number of times each Ensembl ID appears in the data
n_occur <- data.frame(table(id_conversion$ensembl_gene_id))

# filter out only genes appears more than once in the data
kable(n_occur[n_occur$Freq > 1, ], format = "html") %>% kable_styling(full_width = F)

	Var1	Freq
14233	ENSG00000279010	2

There is one Ensembl id that map to 2 different gene symbol ENSG00000279010.

kable(id_conversion %>% dplyr::filter(ensembl_gene_id == 'ENSG00000279010'), 
      format = "html") %>% kable_styling(full_width = F)

ensembl_gene_id	hgnc_symbol
ENSG00000279010	MIR4534
ENSG00000279010	MIR6820

With the consideration of its version number:

kable((raw_data_hig %>% dplyr::filter(id == 'ENSG00000279010'))[, 1:6], 
      format = "html") %>% kable_styling(full_width = F)

	id	version	MCF10A_R1	MCF10A_R2	MCF10A_R3	MCF7_R1
ENSG00000279010.1	ENSG00000279010	1	223	241	212	39

We find the the proper Ensembl ID with version number is ENSG00000279010.1. After searching online, we will keep the row with gene symbol MIR4534.

# remove the row with gene symbol MIR6820
id_conversion <- subset(id_conversion, !(id_conversion$ensembl_gene_id == "ENSG00000279010" 
                                 & id_conversion$hgnc_symbol == "MIR6820"))

With no duplicate Ensembl ID, we will merge the conversion table and our cleaned raw data.

# merge raw data with mapping conversion using Ensembl ID
raw_data_hig_annot <- merge(id_conversion, raw_data_hig, by.x = 1, by.y = 1, all.y = TRUE)

Now we need to check the number of gene we were not able to match.

nrow(raw_data_hig) - length(which(raw_data_hig$id %in% id_conversion$ensembl_gene_id))

## [1] 3

There are 3 genes we were not able to match, we will print them out in table to investigate why.

# filter out rows that don't have HUGO symbol
ensembl_id_missing_gene <- raw_data_hig_annot$ensembl_gene_id[which(is.na(raw_data_hig_annot$hgnc_symbol))]
kable(raw_data_hig_annot[is.na(raw_data_hig_annot$hgnc_symbol), ][, 1:6], 
      format = "html") %>% kable_styling(full_width = F)

	ensembl_gene_id	hgnc_symbol	version	MCF10A_R1	MCF10A_R2	MCF10A_R3
1	alignment_not_unique	NA	NA	11805625	16487627	25094879
2	ambiguous	NA	NA	1018332	1805665	1281009
14278	no_feature	NA	NA	3019369	2326257	3102003

We will not filter those rows our right now since for our research purpose of differential expression, those rows might contribute to potential significant result. Also, those rows each represent unique element and there is no need to worried them right now. Notice that the alignment not unique row could explain why we did not find any replicate gene before.

Step 2: Apply Normalization

Normalization Process

In our data, the deregulated genes and the non-deregulated genes does not have similar behaviors, but the total expression are the same. Also, according to the paper, the experimental design and condition are similar and the cell amount is the same for each sample. Thus, with the above assumptions, we will use the normalize method total count normalization (Evans et al. 2017).

# convert data into DGEList object
filtered_data_matrix <- as.matrix(raw_data_hig_annot[, 4:12])
rownames(filtered_data_matrix) <- raw_data_hig_annot$hgnc_symbol
d <- DGEList(counts=filtered_data_matrix, group = sample_group$cell_line)

# normalize by count per million
nf <- calcNormFactors(d)
normalzied_data_annon <- cpm(nf)

Pre-normalization and Post-normalization Data Distribution

After our data is normalized, we will compare the pre-normalization and post-normalization data distribution to examine the quality change of the normalized data.

Boxplot

# draw pre-normalized boxplot
bplot_raw <- log2(cpm(raw_data_hig_annot[,4:12]))
boxplot(bplot_raw, xlab = "Samples", ylab = "log2 Counts per Million",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Pre-normalization box plot")
# draw the median on box plot
abline(h = median(apply(bplot_raw, 2, median)),
       col = "red", lwd = 2, lty = "dashed")

# draw post-normalized boxplot
bplot_norm <- log2(normalzied_data_annon)
boxplot(bplot_norm, xlab = "Samples", ylab = "log2 Counts per Million",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Post-normalization box plot")
# draw the median on each box plot
abline(h = median(apply(bplot_norm, 2, median)),
       col = "red", lwd = 2, lty = "dashed")

Density Plot

# pre-normalization plot
counts_density_raw <- apply(log2(cpm(raw_data_hig_annot[, 4:12])), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_raw)) {
  xlim <- range(c(xlim, counts_density_raw[[i]]$x))
  ylim <- range(c(ylim, counts_density_raw[[i]]$y))
}
cols <- rainbow(length(counts_density_raw))
ltys <- rep(1, length(counts_density_raw))
# plot the first density plot to initialize the plot
plot(counts_density_raw[[1]], xlim = xlim, ylim = ylim, type = "n",
     ylab = "Smoothing density of log2-CPM",
     main = "Pre-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_raw)) {
  lines(counts_density_raw[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_raw),
       col = cols, lty = ltys, cex = 0.75,
       border = "blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")

# post-normalization plot
counts_density_norm <- apply(log2(normalzied_data_annon), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_norm)) {
  xlim <- range(c(xlim, counts_density_norm[[i]]$x))
  ylim <- range(c(ylim, counts_density_norm[[i]]$y))
}
cols <- rainbow(length(counts_density_norm))
ltys <- rep(1, length(counts_density_norm))
# plot the first density plot to initialize the plot
plot(counts_density_norm[[1]], xlim = xlim, ylim = ylim, type = "n",
     ylab = "Smoothing density of log2-CPM",
     main = "Post-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_norm)) {
  lines(counts_density_norm[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_norm),
       col = cols, lty = ltys, cex = 0.75,
       border ="blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")

By comparing the pre-normalization and post-normalization data distribution, we can obersve that there are less variance within data. From the box plot, the post-normalization median for each sample are more close to the total median line and the data is less dispersed compare to the pre-normalization. Similar trend is also observed from the density plot.

Post Normalization MDS Plot

We will also use MDS plot to compare between single samples, and to investigate the distance between samples. Ideally, samples with similaire expression will cluster together.

plotMDS(nf, labels = rownames(sample_group), 
        col = c("darkgreen", "blue")[factor(sample_group$cell_line)], 
        main = "MDS Plot")

From this plot, the blue cluster is the luminal cell line, the black cluster is the basal-like metastatic, and the green cluster is the normal-like subtype. We can see that our data is strong that our data cluster together, which indicates there is little variation between replicates when grouping by cell lines.

Dispersion

# setup plot assumptions
model_design <- model.matrix(~sample_group$replicate + sample_group$cell_line + 0)
nf_d <- estimateDisp(nf, model_design)

# plot BCV plot to investigate dispersion
plotBCV(nf_d,col.tagwise = "black",col.common = "red",main = "BCV Plot")

# plot Mean-Variance plot to investigate relationship between mean and variance 
# of expression data
plotMeanVar(nf_d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
            show.ave.raw.vars = TRUE,
            NBline = TRUE,
            show.binned.common.disp.vars = TRUE, main = "Mean Variance")

Biological coefficient of variation(BCV) Plots can be used to to look at dispersion. Each of the dots in the given BCV plot is the BCV for a gene from our data set. Since the common dispersion is between 0.2 - 0.4, our data is considered reasonable and could detect more deregulated genes(Gu 2015). The Mean-Variance plot also shows a promising correlation between mean and variance of the sample data. The factors dispalyed by the two plot indicate that out data has been normalized to the point where it is suitable for further analysis.

Step 3: Interpret and Document

Final Prepared Data

Here display our final data cleaned and normalized. The row name is HUGO gene symbol, and each row represent a different gene. The column name is the sample name, and each column represent a different sample’s normalized count per million expression data.

kable(normalzied_data_annon[10:20, ], format = "html") %>% 
  kable_styling(full_width = F, font_size = 12)

	MCF10A_R1	MCF10A_R2	MCF10A_R3	MCF7_R1	MCF7_R2	MCF7_R3	MDA.MB.231_R1	MDA.MB.231_R2	MDA.MB.231_R3
NFYA	10.986435	13.949391	11.320310	14.847536	12.0786911	13.7190011	15.438415	12.7555456	14.800126
STPG1	4.350120	3.125453	5.703231	1.951391	2.0671585	1.9325528	3.751390	3.9455713	3.550030
NIPAL3	90.907983	98.764317	53.121028	7.324784	8.3699655	8.3885533	32.644311	31.0240812	35.000298
LAS1L	12.923349	11.778023	20.211147	17.110018	20.2054614	20.9820017	25.430099	24.8084551	28.625244
ENPP4	7.017347	11.679325	5.117400	10.775069	8.3699655	9.3442113	3.967817	4.2428404	4.050034
SEMA3F	28.386915	43.624745	14.438995	10.407416	15.3415590	14.0163169	1.082132	1.0269295	1.050009
ANKIB1	70.395738	73.727793	67.284338	94.175800	73.6678558	77.0897432	63.232572	64.8046573	68.200580
CYP51A1	6.509304	3.092554	5.823843	3.535128	6.6473333	6.3073426	4.581025	7.6208980	4.950042
KRIT1	40.484694	39.282010	37.458681	26.216507	21.1579756	22.1712650	16.664831	17.2956550	18.600158
RAD52	8.223950	7.435288	5.427546	3.506847	3.8100569	3.2280003	4.436741	5.4589411	5.450046
MYH16	1.301861	1.546277	1.533497	0.000000	0.0405325	0.0212368	1.478913	0.9458561	1.475013

Final coverage:

nrow(normalzied_data_annon)

## [1] 14278

Interpretation Questions

What are the control and test conditions of the dataset?

The control of the dataset is the cell line is normal-like subtype (MCF10A), and two test conditions are luminal(MCF7) and basal-like metastatic (MDA-MB-231).

Why is the dataset of interest to you?

Breast cancer is one of the most common type of the cancer, thus, investigate the potential genetic factors invovled in breast cancer’s mechanism becomes crucial. Moreover, previous studies also indicate that methlyation and acethylation of histone H3 lysine 4 is a well-established marker of active or poised transcription(Kimura 2013). With this information, I think it would be intersting to look at how the modification of histone H3 lysine 4 would affect gene expression of the two breast cancer subtypes luminal and basal-like metastatic compare to normal breast cancer cell line.

Were there expression values that were not unique for specific genes? How did you handle these?

There were no expression value that were not unique for specific genes.

Were there expression values that could not be mapped to current HUGO symbols?

There are 3 observations that could not mapped to current HUGO symbols, which is mention above. The 3 observerations are alignment not unique, no feathers, and ambiguous, which will not affect our result and we will not filtered out in current phases.

How many outliers were removed?

There are total of 6297 genes removed due to their low count number that they don’t have 3 read per million in each cell line. No other outliers were removed.

How did you handle replicates?

There are 3 biological replicate per each of the 3 cell lines like mentioned above. When we normalize our data, we group the data by cell line, and perform normalization across replicates for each cell line. Also, genes don’t have 3 read per million in each cell line(because we have 3 replciates) were removed.

What is the final coverage of your dataset?

The final coverage is 14278 genes, which is about 69% of the initial raw data.

Reference

Messier TL, Gordon JAR, Boyd JR, Tye CE, Browne G, Stein JL, Lian JB, Stein GS. 2016. Histone H3 lysine 4 acetylation and methylation dynamics define breast cancer subtypes. Oncotarget. 7(5):5094–5109. doi:https://doi.org/10.18632/oncotarget.6922.

Evans C, Hardin J, Stoebel DM. 2017. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics. 19(5):776–792. doi:https://doi.org/10.1093/bib/bbx008.

Some key factors for number of significant DE genes. 2015 Nov 2. CVR Bioinformatics. https://bioinformatics.cvr.ac.uk/some-key-factors-for-number-of-significant-de-genes/.

Kimura H. Histone modifications for human epigenome analysis. J Hum Genet. 2013; 58:439-445.

Ruth Isserlin, Course Lectures (2023)

BCB420 Assignment 1: Data Selection and Initial Processing

Wenzhu Ye

2023-02-14