In this report, we are going to clean and normalize GEO data GSE75168 for future analysis. The data target methlyation and acethylation of histone H3 lysine 4 correlation with breast cancer sub-types. Using a genome-wide ChIP-Seq approach, the data provide raw genomic count data of three human mammary cell lines: normal-like sub-type (MCF10A), and two cancer sub-types: luminal(MCF7) and basal-like metastatic (MDA-MB-231)(Messier et al. 2016).
# install and loading required packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if (!requireNamespace("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery")
library(GEOquery)
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
library(knitr)
if (!requireNamespace("kableExtra", quietly = TRUE))
install.packages("kableExtra")
library(kableExtra)
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
library(edgeR)
if (!requireNamespace("tidyr", quietly = TRUE))
install.packages("tidyr")
library(tidyr)
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
library(biomaRt)
Download data using GEOquery package.
# downloading data
raw_files <- getGEOSuppFiles('GSE75168')
# get file path of the downloaded file
fpath <- rownames(raw_files)
# read in data
raw_data <- read.delim(fpath[1], header = TRUE, check.names = TRUE)
A quick look over the data.
# only display the first 15 rows due to space limit
kable(raw_data[1:15, ], format = "html") %>% kable_styling(full_width = T, font_size = 12)
Ensembl_ID | MCF10A_R1 | MCF10A_R2 | MCF10A_R3 | MCF7_R1 | MCF7_R2 | MCF7_R3 | MDA.MB.231_R1 | MDA.MB.231_R2 | MDA.MB.231_R3 |
---|---|---|---|---|---|---|---|---|---|
ENSG00000000003.12 | 713 | 1084 | 982 | 651 | 750 | 662 | 564 | 709 | 720 |
ENSG00000000419.10 | 686 | 392 | 1242 | 2287 | 2618 | 2461 | 582 | 733 | 875 |
ENSG00000000457.11 | 498 | 396 | 469 | 285 | 422 | 408 | 224 | 346 | 294 |
ENSG00000000460.14 | 347 | 69 | 746 | 392 | 636 | 568 | 650 | 796 | 959 |
ENSG00000000971.13 | 49 | 562 | 46 | 1 | 1 | 1 | 550 | 771 | 839 |
ENSG00000001036.11 | 3007 | 977 | 3913 | 2396 | 3245 | 3066 | 1579 | 1917 | 2207 |
ENSG00000001084.8 | 950 | 1397 | 3065 | 1134 | 1282 | 1309 | 435 | 577 | 713 |
ENSG00000001167.12 | 346 | 424 | 657 | 525 | 596 | 646 | 428 | 472 | 592 |
ENSG00000001460.15 | 137 | 95 | 331 | 69 | 102 | 91 | 104 | 146 | 142 |
ENSG00000001461.14 | 2863 | 3002 | 3083 | 259 | 413 | 395 | 905 | 1148 | 1400 |
ENSG00000001497.14 | 407 | 358 | 1173 | 605 | 997 | 988 | 705 | 918 | 1145 |
ENSG00000001561.6 | 221 | 355 | 297 | 381 | 413 | 440 | 110 | 157 | 162 |
ENSG00000001617.9 | 894 | 1326 | 838 | 368 | 757 | 660 | 30 | 38 | 42 |
ENSG00000001629.7 | 2217 | 2241 | 3905 | 3330 | 3635 | 3630 | 1753 | 2398 | 2728 |
ENSG00000001630.13 | 205 | 94 | 338 | 125 | 328 | 297 | 127 | 282 | 198 |
Each row represent a difference gene, in which the gene’s
Ensembl ID is the rowname, and each column repsence a different
sample.
To look at the total coverage of the raw data:
dim(raw_data)
## [1] 20575 10
We can see that there are 20575 rows in total. Since each row represent a different gene, the inital coverage of the raw data is 20575 genes. Moreover, we have 10 columns, with one column containing gene Ensembl ID, we have 9 difference samples, which display below:
colnames(raw_data)
## [1] "Ensembl_ID" "MCF10A_R1" "MCF10A_R2" "MCF10A_R3"
## [5] "MCF7_R1" "MCF7_R2" "MCF7_R3" "MDA.MB.231_R1"
## [9] "MDA.MB.231_R2" "MDA.MB.231_R3"
We can see that we have total of 9 samples, including 3 replicates of each cell line.
Contact information about dataset:
# getting GEO SOFT format file
gse <- getGEO("GSE75168", GSEMatrix = FALSE)
kable(data.frame(head(Meta(gse))), format = "html") %>% kable_styling(full_width = F)
contact_address | contact_city | contact_country | contact_department | contact_email | contact_institute |
---|---|---|---|---|---|
89 Beaumont Ave Given E209 | Burlington | USA | Biochemistry | Jonathan.A.Gordon@med.uvm.edu | University of Vermont |
Platform Information:
# getting platform information
gpl_info <- Meta(getGEO(names(GPLList(gse))[1]))
Category <- c("Platform title", "Submission data", "Last update data", "Organims", "Number of GEO datasets that use this techology", "Number of GEO samples that use this technology")
Infomation <- c(gpl_info$title, gpl_info$submission_date, gpl_info$last_update_date, gpl_info$organism, length(gpl_info$series_id), length(gpl_info$sample_id))
kable(data.frame(Category, Infomation), format = "html") %>% kable_styling(full_width = F)
Category | Infomation |
---|---|
Platform title | Illumina HiSeq 1500 (Homo sapiens) |
Submission data | Mar 24 2014 |
Last update data | Oct 02 2018 |
Organims | Homo sapiens |
Number of GEO datasets that use this techology | 329 |
Number of GEO samples that use this technology | 6407 |
We need to remove gene without at least 1 read per million in n of samples. Since we have 3 replciates for each different cell line, thus, we need to filter out genes that don’t have 3 read per million in each cell line.
# convert count to count per miliion
cpms <- cpm(raw_data[, 2:10])
rownames(cpms) <- raw_data[, 1]
# only keep rows with at least 3 read per million
keep <- rowSums(cpms > 1) >= 3
raw_data_hig <- raw_data[keep, ]
rownames(raw_data_hig) <- raw_data_hig[, 1]
Number of genes we filtered out:
nrow(raw_data) - nrow(raw_data_hig)
## [1] 6297
# sort by number of times each Ensembl ID appears in the data
summarized_gene_counts <- sort(table(raw_data_hig$Ensembl_ID), decreasing = TRUE)
kable(summarized_gene_counts[which(summarized_gene_counts > 1)[1: 10]], format = "html") %>% kable_styling(full_width = F)
Var1 | Freq |
---|---|
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
Since the table is empty, there are no duplicate genes in our
data.
We than create a data frame containing the group infomation, whcih stored the infomation of each sample’s cell line and replicate.
# split the sample name into cell line and replicate
sample_group <- data.frame(lapply(colnames(raw_data)[2:10],
FUN=function(x){unlist(strsplit(x, split = "\\_"))}))
colnames(sample_group) <- colnames(raw_data)[2:10]
rownames(sample_group) <- c("cell_line", "replicate")
# transfrom into dataframe
sample_group <- data.frame(t(sample_group))
kable(sample_group, format = "html") %>% kable_styling(full_width = F)
cell_line | replicate | |
---|---|---|
MCF10A_R1 | MCF10A | R1 |
MCF10A_R2 | MCF10A | R2 |
MCF10A_R3 | MCF10A | R3 |
MCF7_R1 | MCF7 | R1 |
MCF7_R2 | MCF7 | R2 |
MCF7_R3 | MCF7 | R3 |
MDA.MB.231_R1 | MDA.MB.231 | R1 |
MDA.MB.231_R2 | MDA.MB.231 | R2 |
MDA.MB.231_R3 | MDA.MB.231 | R3 |
By the data displayed above, we see that our data only the Ensembl ID. We need to convert ensemble id to HUGO gene symbol. First, we need to get a conversion table of Ensembl ID map to corresponding gene symbol.
# select Mart and Dataset
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id_version","hgnc_symbol"),
filters = c("ensembl_gene_id_version"),
values = raw_data_hig$Ensembl_ID,
mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id_version | hgnc_symbol |
---|---|
ENSG00000019995.6 | ZRANB1 |
ENSG00000064195.7 | DLX3 |
ENSG00000065491.8 | TBC1D22B |
ENSG00000073905.8 | VDAC1P1 |
ENSG00000076641.4 | PAG1 |
ENSG00000102096.9 | PIM2 |
ENSG00000105988.6 | NHP2P1 |
ENSG00000106125.14 | MINDY4 |
ENSG00000108958.4 | |
ENSG00000109089.7 | CDR2L |
However, consider the number of map we get:
nrow(id_conversion)
## [1] 1532
We only get about 10% match. In order to resolve this issue, since the Ensembl ID of our data include the version number, we will drop the version and map with only the Ensembl ID.
Another important factor we need to consider is that due to the time of submission data date and the context of the paper, we will not using the latest Ensembl-HUGO conversion version, instead, we will use the Ensembl 77(published on 2014 Oct), which is the version the data used.
With the consideration of the above two factors, we will remap our data.
# select Mart and Dataset with wanted version
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="https://oct2014.archive.ensembl.org")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
# split current Ensembl ID with version into two columns: Ensembl ID and version number
raw_data_hig <- raw_data_hig %>% tidyr::separate(Ensembl_ID, c("id", "version"), "\\.")
# use desired data to convert ensemble id to HUGO gene symbol
id_conversion <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = raw_data_hig$id,
mart = ensembl)
kable(id_conversion[1:10, ], format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id | hgnc_symbol |
---|---|
ENSG00000000003 | TSPAN6 |
ENSG00000000419 | DPM1 |
ENSG00000000457 | SCYL3 |
ENSG00000000460 | C1orf112 |
ENSG00000000971 | CFH |
ENSG00000001036 | FUCA2 |
ENSG00000001084 | GCLC |
ENSG00000001167 | NFYA |
ENSG00000001460 | STPG1 |
ENSG00000001461 | NIPAL3 |
To check again the number of match we got:
nrow(id_conversion)
## [1] 14276
Which is a much better match result(~99%).
Since we omit the version number, we need to check our conversion table to prevent the issue of one ensembl id map to more than one gene symbol.
# sort by number of times each Ensembl ID appears in the data
n_occur <- data.frame(table(id_conversion$ensembl_gene_id))
# filter out only genes appears more than once in the data
kable(n_occur[n_occur$Freq > 1, ], format = "html") %>% kable_styling(full_width = F)
Var1 | Freq | |
---|---|---|
14233 | ENSG00000279010 | 2 |
There is one Ensembl id that map to 2 different gene symbol
ENSG00000279010.
kable(id_conversion %>% dplyr::filter(ensembl_gene_id == 'ENSG00000279010'),
format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id | hgnc_symbol |
---|---|
ENSG00000279010 | MIR4534 |
ENSG00000279010 | MIR6820 |
With the consideration of its version number:
kable((raw_data_hig %>% dplyr::filter(id == 'ENSG00000279010'))[, 1:6],
format = "html") %>% kable_styling(full_width = F)
id | version | MCF10A_R1 | MCF10A_R2 | MCF10A_R3 | MCF7_R1 | |
---|---|---|---|---|---|---|
ENSG00000279010.1 | ENSG00000279010 | 1 | 223 | 241 | 212 | 39 |
We find the the proper Ensembl ID with version number is
ENSG00000279010.1. After searching online, we will keep
the row with gene symbol MIR4534.
# remove the row with gene symbol MIR6820
id_conversion <- subset(id_conversion, !(id_conversion$ensembl_gene_id == "ENSG00000279010"
& id_conversion$hgnc_symbol == "MIR6820"))
With no duplicate Ensembl ID, we will merge the conversion table and our cleaned raw data.
# merge raw data with mapping conversion using Ensembl ID
raw_data_hig_annot <- merge(id_conversion, raw_data_hig, by.x = 1, by.y = 1, all.y = TRUE)
Now we need to check the number of gene we were not able to match.
nrow(raw_data_hig) - length(which(raw_data_hig$id %in% id_conversion$ensembl_gene_id))
## [1] 3
There are 3 genes we were not able to match, we will print them out in table to investigate why.
# filter out rows that don't have HUGO symbol
ensembl_id_missing_gene <- raw_data_hig_annot$ensembl_gene_id[which(is.na(raw_data_hig_annot$hgnc_symbol))]
kable(raw_data_hig_annot[is.na(raw_data_hig_annot$hgnc_symbol), ][, 1:6],
format = "html") %>% kable_styling(full_width = F)
ensembl_gene_id | hgnc_symbol | version | MCF10A_R1 | MCF10A_R2 | MCF10A_R3 | |
---|---|---|---|---|---|---|
1 | alignment_not_unique | NA | NA | 11805625 | 16487627 | 25094879 |
2 | ambiguous | NA | NA | 1018332 | 1805665 | 1281009 |
14278 | no_feature | NA | NA | 3019369 | 2326257 | 3102003 |
We will not filter those rows our right now since for our
research purpose of differential expression, those rows might contribute
to potential significant result. Also, those rows each represent unique
element and there is no need to worried them right now. Notice that the
alignment not unique row could explain why we did not find any replicate
gene before.
In our data, the deregulated genes and the non-deregulated genes does not have similar behaviors, but the total expression are the same. Also, according to the paper, the experimental design and condition are similar and the cell amount is the same for each sample. Thus, with the above assumptions, we will use the normalize method total count normalization (Evans et al. 2017).
# convert data into DGEList object
filtered_data_matrix <- as.matrix(raw_data_hig_annot[, 4:12])
rownames(filtered_data_matrix) <- raw_data_hig_annot$hgnc_symbol
d <- DGEList(counts=filtered_data_matrix, group = sample_group$cell_line)
# normalize by count per million
nf <- calcNormFactors(d)
normalzied_data_annon <- cpm(nf)
After our data is normalized, we will compare the pre-normalization and post-normalization data distribution to examine the quality change of the normalized data.
# draw pre-normalized boxplot
bplot_raw <- log2(cpm(raw_data_hig_annot[,4:12]))
boxplot(bplot_raw, xlab = "Samples", ylab = "log2 Counts per Million",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Pre-normalization box plot")
# draw the median on box plot
abline(h = median(apply(bplot_raw, 2, median)),
col = "red", lwd = 2, lty = "dashed")
# draw post-normalized boxplot
bplot_norm <- log2(normalzied_data_annon)
boxplot(bplot_norm, xlab = "Samples", ylab = "log2 Counts per Million",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Post-normalization box plot")
# draw the median on each box plot
abline(h = median(apply(bplot_norm, 2, median)),
col = "red", lwd = 2, lty = "dashed")
# pre-normalization plot
counts_density_raw <- apply(log2(cpm(raw_data_hig_annot[, 4:12])), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_raw)) {
xlim <- range(c(xlim, counts_density_raw[[i]]$x))
ylim <- range(c(ylim, counts_density_raw[[i]]$y))
}
cols <- rainbow(length(counts_density_raw))
ltys <- rep(1, length(counts_density_raw))
# plot the first density plot to initialize the plot
plot(counts_density_raw[[1]], xlim = xlim, ylim = ylim, type = "n",
ylab = "Smoothing density of log2-CPM",
main = "Pre-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_raw)) {
lines(counts_density_raw[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_raw),
col = cols, lty = ltys, cex = 0.75,
border = "blue", text.col = "green4",
merge = TRUE, bg = "gray90")
# post-normalization plot
counts_density_norm <- apply(log2(normalzied_data_annon), 2, density)
# calculate the limits across all the samples
xlim <- 0
ylim <- 0
for (i in 1:length(counts_density_norm)) {
xlim <- range(c(xlim, counts_density_norm[[i]]$x))
ylim <- range(c(ylim, counts_density_norm[[i]]$y))
}
cols <- rainbow(length(counts_density_norm))
ltys <- rep(1, length(counts_density_norm))
# plot the first density plot to initialize the plot
plot(counts_density_norm[[1]], xlim = xlim, ylim = ylim, type = "n",
ylab = "Smoothing density of log2-CPM",
main = "Post-normalization denstiy plot", cex.lab = 0.85)
# plot each line
for (i in 1:length(counts_density_norm)) {
lines(counts_density_norm[[i]], col = cols[i], lty = ltys[i])
}
# create legend
legend("topright", colnames(bplot_norm),
col = cols, lty = ltys, cex = 0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
By comparing the pre-normalization and post-normalization data distribution, we can obersve that there are less variance within data. From the box plot, the post-normalization median for each sample are more close to the total median line and the data is less dispersed compare to the pre-normalization. Similar trend is also observed from the density plot.
We will also use MDS plot to compare between single samples, and to investigate the distance between samples. Ideally, samples with similaire expression will cluster together.
plotMDS(nf, labels = rownames(sample_group),
col = c("darkgreen", "blue")[factor(sample_group$cell_line)],
main = "MDS Plot")
From this plot, the blue cluster is the luminal cell line, the black cluster is the basal-like metastatic, and the green cluster is the normal-like subtype. We can see that our data is strong that our data cluster together, which indicates there is little variation between replicates when grouping by cell lines.
# setup plot assumptions
model_design <- model.matrix(~sample_group$replicate + sample_group$cell_line + 0)
nf_d <- estimateDisp(nf, model_design)
# plot BCV plot to investigate dispersion
plotBCV(nf_d,col.tagwise = "black",col.common = "red",main = "BCV Plot")
# plot Mean-Variance plot to investigate relationship between mean and variance
# of expression data
plotMeanVar(nf_d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline = TRUE,
show.binned.common.disp.vars = TRUE, main = "Mean Variance")
Biological coefficient of variation(BCV) Plots can be used to to look at dispersion. Each of the dots in the given BCV plot is the BCV for a gene from our data set. Since the common dispersion is between 0.2 - 0.4, our data is considered reasonable and could detect more deregulated genes(Gu 2015). The Mean-Variance plot also shows a promising correlation between mean and variance of the sample data. The factors dispalyed by the two plot indicate that out data has been normalized to the point where it is suitable for further analysis.
Here display our final data cleaned and normalized. The row name is HUGO gene symbol, and each row represent a different gene. The column name is the sample name, and each column represent a different sample’s normalized count per million expression data.
kable(normalzied_data_annon[10:20, ], format = "html") %>%
kable_styling(full_width = F, font_size = 12)
MCF10A_R1 | MCF10A_R2 | MCF10A_R3 | MCF7_R1 | MCF7_R2 | MCF7_R3 | MDA.MB.231_R1 | MDA.MB.231_R2 | MDA.MB.231_R3 | |
---|---|---|---|---|---|---|---|---|---|
NFYA | 10.986435 | 13.949391 | 11.320310 | 14.847536 | 12.0786911 | 13.7190011 | 15.438415 | 12.7555456 | 14.800126 |
STPG1 | 4.350120 | 3.125453 | 5.703231 | 1.951391 | 2.0671585 | 1.9325528 | 3.751390 | 3.9455713 | 3.550030 |
NIPAL3 | 90.907983 | 98.764317 | 53.121028 | 7.324784 | 8.3699655 | 8.3885533 | 32.644311 | 31.0240812 | 35.000298 |
LAS1L | 12.923349 | 11.778023 | 20.211147 | 17.110018 | 20.2054614 | 20.9820017 | 25.430099 | 24.8084551 | 28.625244 |
ENPP4 | 7.017347 | 11.679325 | 5.117400 | 10.775069 | 8.3699655 | 9.3442113 | 3.967817 | 4.2428404 | 4.050034 |
SEMA3F | 28.386915 | 43.624745 | 14.438995 | 10.407416 | 15.3415590 | 14.0163169 | 1.082132 | 1.0269295 | 1.050009 |
ANKIB1 | 70.395738 | 73.727793 | 67.284338 | 94.175800 | 73.6678558 | 77.0897432 | 63.232572 | 64.8046573 | 68.200580 |
CYP51A1 | 6.509304 | 3.092554 | 5.823843 | 3.535128 | 6.6473333 | 6.3073426 | 4.581025 | 7.6208980 | 4.950042 |
KRIT1 | 40.484694 | 39.282010 | 37.458681 | 26.216507 | 21.1579756 | 22.1712650 | 16.664831 | 17.2956550 | 18.600158 |
RAD52 | 8.223950 | 7.435288 | 5.427546 | 3.506847 | 3.8100569 | 3.2280003 | 4.436741 | 5.4589411 | 5.450046 |
MYH16 | 1.301861 | 1.546277 | 1.533497 | 0.000000 | 0.0405325 | 0.0212368 | 1.478913 | 0.9458561 | 1.475013 |
Final coverage:
nrow(normalzied_data_annon)
## [1] 14278
What are the control and test conditions of the dataset?
The control of the dataset is the cell line is normal-like subtype (MCF10A), and two test conditions are luminal(MCF7) and basal-like metastatic (MDA-MB-231).
Why is the dataset of interest to you?
Breast cancer is one of the most common type of the cancer, thus, investigate the potential genetic factors invovled in breast cancer’s mechanism becomes crucial. Moreover, previous studies also indicate that methlyation and acethylation of histone H3 lysine 4 is a well-established marker of active or poised transcription(Kimura 2013). With this information, I think it would be intersting to look at how the modification of histone H3 lysine 4 would affect gene expression of the two breast cancer subtypes luminal and basal-like metastatic compare to normal breast cancer cell line.
Were there expression values that were not unique for specific genes? How did you handle these?
There were no expression value that were not unique for specific genes.
Were there expression values that could not be mapped to current HUGO symbols?
There are 3 observations that could not mapped to current HUGO symbols, which is mention above. The 3 observerations are alignment not unique, no feathers, and ambiguous, which will not affect our result and we will not filtered out in current phases.
How many outliers were removed?
There are total of 6297 genes removed due to their low count number that they don’t have 3 read per million in each cell line. No other outliers were removed.
How did you handle replicates?
There are 3 biological replicate per each of the 3 cell lines like mentioned above. When we normalize our data, we group the data by cell line, and perform normalization across replicates for each cell line. Also, genes don’t have 3 read per million in each cell line(because we have 3 replciates) were removed.
What is the final coverage of your dataset?
The final coverage is 14278 genes, which is about 69% of the initial raw data.
Messier TL, Gordon JAR, Boyd JR, Tye CE, Browne G, Stein JL, Lian JB, Stein GS. 2016. Histone H3 lysine 4 acetylation and methylation dynamics define breast cancer subtypes. Oncotarget. 7(5):5094–5109. doi:https://doi.org/10.18632/oncotarget.6922.
Evans C, Hardin J, Stoebel DM. 2017. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics. 19(5):776–792. doi:https://doi.org/10.1093/bib/bbx008.
Some key factors for number of significant DE genes. 2015 Nov 2. CVR Bioinformatics. https://bioinformatics.cvr.ac.uk/some-key-factors-for-number-of-significant-de-genes/.
Kimura H. Histone modifications for human epigenome analysis. J Hum Genet. 2013; 58:439-445.
Ruth Isserlin, Course Lectures (2023)