Differential gene expression (DGE) analysis is a critical tool in the study of biological systems, particularly in understanding how specific conditions or treatments alter cellular processes at the molecular level. This research aims to elucidate changes in gene expression profiles among different developmental stages of Saccharomyces cerevisiae biofilms on sherry velum to identify key genes involved in biofilm maturation. The primary goal is to uncover significant alterations in gene activity that correlate with distinct phases of biofilm formation, which can provide insights into the underlying mechanisms governing these transitions.
The analysis presented here faces several challenges inherent to DGE studies: 1. High Dimensionality: Gene expression datasets typically contain thousands of genes, necessitating robust statistical methods for identifying biologically significant changes amidst noise. 2. Small Sample Size: The study utilizes three replicates per biofilm stage, which poses limitations in terms of the reliability and power of hypothesis tests. 3. Multiple Testing Issues: With a large number of genes being tested simultaneously, controlling false discovery rates (FDR) is crucial to avoid misdirected interpretation.
To address these challenges, my approach aims to employ stringent filtering criteria and advanced statistical models. edgeR is used for normalization and differential expression analysis, leveraging the Trimmed Mean of M-values (TMM) method to correct for library size differences. - Generalized Linear Models (GLMs) with quasi-likelihood tests are applied to account for overdispersion in count data, offering a balance between sensitivity and specificity.
Our analysis workflow involves several stages: 1. Quality Control and Read Alignment: FastQC is used for initial quality control checks, followed by alignment of RNA-seq reads to the yeast genome using STAR1. 2. Feature Counting: featureCounts quantifies gene expression levels across all samples. 3. Differential Expression Analysis: DGE analysis with edgeR includes filtering low-expression genes and normalization before estimating dispersion and fitting GLMs.
By carefully addressing these considerations, this assignment aims to provide a detailed and reliable analysis of gene expression dynamics during biofilm development.
Input files were retrieved from Dr. Lukens’ directory on the graham cluster of the digital research alliance of Canada (DRAC). These included the Saccharomyces cerevisiae reference genome (version undisclosed) and output files from STAR indexing (see Appendix A). The raw data files used by Dr. Lukens’ were pulled from Mardanov et al., 20204.
All data retrieval/processing was performed as per Appendix A (see Appendix A: Data Retrieval/Processing). The reasoning behind much of the pipeline is that it was the pipeline taught/discussed in BINF*6110. Specifics are described in the discussion. Read and execution permissions to relevant project files were provided to the course instructor and TA via the bash command “chmod 755 /scratch/nzeidenb/A2/ -R”.
Figures and tables (for requested lists 1 and 2 in the assignment outline) were generated as per Appendices B and C, respectively (see Appendix B: Figure Generation and Appendix C: Generating Tables). Reasoning behind which packages were used to generate figures was primarily based on a desire for figures to be visually appealing, interactive and detailed. Though ggplot2 can create visually appealing and detailed figures, they are not interactive.
Lastly, the main methodological difference in this pipeline and that discussed in class is the emphasis on false-discovery rate (FDR) over p-values as a measure of confidence. Reasoning is discussed below (see Discussion).
FastQC (aggregated by MultiQC above) indicates that the majority of reads across all samples have a high per-sequence PHRED (quality) score. Few unknown nucleotides (“N”) were found. Sequence GC content and per-base sequence content warn of bias in both categories, and all samples had high duplication levels. Less than 1% of sequences had polyA fragments, and no Illumina adapters were detected.
Table 1 shows a searchable list of genes and their \(log_2(\)fold change\()\) (magnitude + direction) as well as raw p-value between all three treatment groups, as per the assignment instructions. The greatest upregulatory change was a \(2^8\) fold-change between early to thin biofilms, and was for the gene with the alias YJL217W. The greatest negative change in expression (\(2^{12}\)) was between early to mature biofilms, and was for the gene with the alias YCR105W. The gene alias YJL217W corresponds to gene REE1, which is a cytoplasmic protein involved in enolase 1 regulation5. The gene alias YCR105W corresponds to the gene ADH7, which is a medium-chain alcohol dehydrogenase5.
Table 2 shows a searchable list of genes and their \(log_2(\)fold change\()\) (magnitude + direction) as well as raw p-value between early vs late & mature treatment groups, as per the assignment instructions. The greatest upregulatory difference was a \(2^{15}\) fold-change between early to later biofilms and was for the gene with the alias YOL158C. The greatest downregulatory difference (\(2^{18}\) fold-change) was for the gene with the alias YCR106W. The gene alias YOL158C corresponds to gene ENB1, which is an enterobactin transmembrane transport protein5. The gene alias YCR106W corresponds to the gene RDS1, which is a transcription factor involved in cyclohexamide resistance5.
Figure 1: Heatmap of pairwise differential gene expression between early, thin and mature biofilms. The details of the plot can be magnified by selecting a rectangular area via mouse. The orignal view can be restored via double-clicking the plot. The dendrogram generated groups genes together based on how similar logFC values are across pairwise comparisons, using Euclidean clustering (the default).
A heatmap of pairwise differential gene expression shows a visual representation of Table 1. Differences worth noting include genes with aliases YHR180W and YFL055W being downregulated in both early to late and early to mature biofilms, with no significant difference between late and mature biofilms. The gene with alias YOR385W and YOR386W had significant downregulation from early to late, and early to mature biofilms but significant upregulation from late to mature biofilms. YHR180W and YOR385W correspond to conserved proteins of unknown function5. YFL055W corresponds to an amino acid permease believed to supply the cell with nitrogen from amino acids in nitrogen-poor conditions, while YOR385W corresponds to a DNA photolyase involved in photoreactivation5.
Figure 2: Volcano plot comparing gene expression between early vs thin and mature biofilms. The details of the plot can be magnified by selecting a rectangular area via mouse. The orignal view can be restored via double-clicking the plot.
The differential expression in figure 2 compares early biofilm expression to an average of thin and mature expression. Twelve genes are downregulated by more than \(2^{10}\) times in magnitude in thin/mature biofilms compared to early biofilms, while only 7 genes are upregulated by more than \(2^{10}\) times. The downregulated genes are involved in , while the upregulated genes are involved in
The method described in appendices A through C appear to capture differential gene expression well. Though the data was total RNA rather than isolated mRNA, adapters had been removed beforehand which suggests some data manipulation of even the raw files. As such, we cannot be sure of the quality of the raw data as it may have been tampered with in other ways undisclosed by the authors. That said, the STAR procedure for RNA seq is well-established and validated by the bioinformatics community1. The pipeline involved filtering out low-expression genes which should have improved the quality of the end analysis, though a comparison was outside the scope of this assignment. A negative binomial general linear model (GLM) incorporating quasi-likelihood dispersion estimates was used rather than a simple negative binomial GLM, as it performs better for small sample sizes (e.g. 3 replicates over 3 treatments) and accounts for gene-specific variability with greater accuracy3. Moreover, the usefulness of p-values has long been under scrutiny in the scientific community, in the sense that they can be misleading to both author and reader2. FDR-adjusted p-values are calculated as \(q=(m\times p_r)/r\) where \(q\) is the expected number of false positives \((m\times p_r)\) over the number of selections of \(r\) genes6. I used these q-values in quasi-likelihood F-tests rather than p-values and likelihood-ratio tests to be more conservative in my estimates, particularly given the primary concern of false positives when performing multiple testing on so many genes.
Regarding the results of the analysis, it is very hard to attribute downregulated and upregulated genes to a pathway without further work, though it could be done with the KEGG database. In general, looking at the genes with a fold-change greater than \(2^{10}\) for list 2 (early vs. other biofilms), the 7 genes that are upregulated are involved multiple pathways including glucose metabolism, apoptosis, secretory pathways and protein/vitamin trafficking. That is to say, there is no simple way to visually inspect the tables/figures and extract a particular pathway that is upregulated or downregulated between treatments. In that sense, the chosen method does not sufficiently address the hypothesis that there is some difference (assuming that were the hypothesis, though more specific) between treatments.
# run fastqc to check quality of input files
system("for f in raw_files/*.fastq; do fastqc $f; done")
# get multiqc for more readable qc results
system("multiqc .")
# indexing the reference genome was done in advance by Dr. Lukens. The output files were copied from his directory
# the bash command run was "STAR --runThreadN 2 --runMode genomeGenerate --genomeDir Genome --genomeFastaFiles Genome/yeast_genome.fa --sjdbGTFfile Genome/genomic.gtf --sjdbOverhang 49 --genomeSAindexNbases 10"
# If fastqc indicated Illumina adapters were present in the raw data, we would trim those however they appear to have been trimmed by the original authors.
# system("for f in raw_files/*.fastq; do
# trimmed_f=trimmed_files/$(basename $f)
# fastp -i $f -o $trimmed_f --detect_adapter_for_pe --trim_poly_g --cut_right --cut_mean_quality 20
# done")
# sample reads are aligned to the reference genome and sorted bam files generated
system("for i in ../raw_files/*.fastq; do
j=$(basename $i .fastq) # Remove .fastq for cleaner output names
STAR --runThreadN 8 --runMode alignReads \
--genomeDir ../Genome/Assignment_2_Genome/ \
--readFilesIn $i \
--outFileNamePrefix $j \
--outSAMtype BAM SortedByCoordinate
done")
# process .bam files
gtf.file <- file.path("/scratch/nzeidenb/A2/Genome/Assignment_2_Genome/genomic.gtf")
bam.files<- list.files("/scratch/nzeidenb/A2/aligned_reads",pattern="*.bam", full.names=TRUE)
fc <- featureCounts(bam.files, annot.ext=gtf.file, isGTFAnnotationFile=TRUE,isPaired=FALSE)
write.csv(fc$counts, file = "featureCounts_matrix.csv", row.names = TRUE)
# use edgeR for feature counts
if (exists(fc)){
counts <- fc$counts
} else {
fc_matrix <- read.csv("featureCounts_matrix.csv")
counts <- fc_matrix[, -1]
genes <- fc_matrix[, 1]
}
# create sample treatment matrix for early vs thin biofilm samples
group <- factor(rep(c("early", "thin", "mature"), each = 3))
# create design matrix
design <- model.matrix(~0 + group) # No intercept; avoids redundant coefficients
colnames(design) <- levels(group) # Naming the columns for clarity
# create DGEList obj
dge <- DGEList(counts = counts, group = group, gene = genes)
# filter out low expr genes
keep <- filterByExpr(dge)
# how many rows are we dropping?
summary(keep) # 698
# drop
dge <- dge[keep, , keep.lib.sizes=FALSE]
# perform TMM normalization
dge <- calcNormFactors(dge, method="TMM")
# calculate dispersion
dge <- estimateDisp(dge, design, robust = TRUE)
# fit generalized linear model
fit <- glmQLFit(dge, design)
# counts per million on a log2 scale
CPM <- cpm(fit, log=TRUE)
# Pairwise comparison tests:
# using Quasi-Likelihood F-test rather than Likelihood Ratio Test for a few reasons. First, the QLF test is more robust at accounting for false positives which is a primary concern in DE since there are so many genes being tested. Second, though QLF is better at detecting small differences in expression between genes/samples, it is less statistically robust for smaller sample sizes (e.g. 3 replicates per group) when accounting for dispersion (https://www.biostars.org/p/247174/#:~:text=While%20the%20likelihood%20ratio%20test,number%20of%20replicates%20is%20small.) (https://support.bioconductor.org/p/84291/)
# early vs thin biofilm
qlf_early_vs_thin <- glmQLFTest(fit, contrast = c(1, -1, 0))
# thin vs mature biofilm
qlf_thin_vs_mature <- glmQLFTest(fit, contrast = c(0, 1, -1))
# and early vs mature biofilm
qlf_early_vs_mature <- glmQLFTest(fit, contrast = c(1, 0, -1))
# save results if not already done
if (!file.exists("DEG_Early_vs_Thin.csv")){
write.csv(topTags(qlf_early_vs_thin, n=Inf)$table, "DEG_Early_vs_Thin.csv")
}
if (!file.exists("DEG_Thin_vs_Mature.csv")){
write.csv(topTags(qlf_thin_vs_mature, n=Inf)$table, "DEG_Thin_vs_Mature.csv")
}
if (!file.exists("DEG_Early_vs_Mature.csv")){
write.csv(topTags(qlf_early_vs_mature, n=Inf)$table, "DEG_Early_vs_Mature.csv")
}
# these ^ pairwise comparisons will be used for calculating logFC
# list 1 (thin vs. late + mature biofilms):
contrast_early_vs_others <- c(2, -1, -1) # Initially I did c(1, -1, -1) which would test early vs (thin + mature *combined*)... though the assignment does not specify to average late + mature, I decided it was the most statistically interpretable and a reasonable assumption to make on my part for the assignment, so I changed the contrast to c(2, -1, -1).
qlf_early_vs_others <- glmQLFTest(fit, contrast = contrast_early_vs_others)
# list 2: differences across all groups (i.e. ANOVA)
qlf_anova <- glmQLFTest(fit)
QLFTest <- topTags(qlf_anova, n=Inf)$table[, c("genes", "logCPM", "F", "PValue", "FDR")]
# save list 1 and list 2
if (!file.exists("DEG_All_Groups.csv")){
write.csv(QLFTest, "DEG_All_Groups.csv")
}
if (!file.exists("DEG_Early_vs_Others.csv")){
write.csv(topTags(qlf_early_vs_others, n=Inf)$table, "DEG_Early_vs_Others.csv")
}# 1. generate volcano plot
generate_volcano <- function(DGE_data) {
# Add logP for y axis
DGE_data$logP <- -log10(DGE_data$FDR) # Compute -log10(FDR) for visualization
# Define significance thresholds
DGE_data$Significance <- dplyr::case_when( # vectorized multi if-else
DGE_data$FDR < 0.05 & DGE_data$logFC > 0 ~ "Upregulated", # Upregulated if logFC > 0 and FDR < 0.05
DGE_data$FDR < 0.05 & DGE_data$logFC < 0 ~ "Downregulated", # Downregulated if logFC < 0 and FDR < 0.05
TRUE ~ "Not Significant" # Otherwise, Not Significant
)
# Define colors for the points
color_map <- c("Upregulated" = "#c46666", "Downregulated" = "#1a80bb", "Not Significant" = "#b0b0b0")
# Create interactive volcano plot
p <- plot_ly(
data = DGE_data,
x = ~logFC, # logFC on x-axis
y = ~logP, # -log10(FDR) on y-axis
text = ~paste("Gene:", genes, # Gene names
"<br>logFC:", round(logFC, 2),
"<br>FDR:", formatC(FDR, format = "e", digits = 2)), # Show logFC and FDR values
hoverinfo = "text", # Display gene info on hover
mode = "markers", # Use markers (points)
marker = list(size = 6) # Marker size
) %>%
add_markers(
color = ~Significance, # Color by significance (Upregulated, Downregulated, Not Significant)
colors = color_map # Apply the color map
) %>%
layout(
title = "Volcano Plot of Gene Expression", # Title of the plot
xaxis = list(title = "log2 Fold Change"), # x-axis label
yaxis = list(title = "-log10(FDR)"), # y-axis label
shapes = list(
list(type = "line", x0 = -2, x1 = -2, y0 = 0, y1 = max(DGE_data$logP), line = list(dash = "dot"), opacity = 0.5), # Vertical line at logFC = -2
list(type = "line", x0 = 2, x1 = 2, y0 = 0, y1 = max(DGE_data$logP), line = list(dash = "dot"), opacity = 0.5), # Vertical line at logFC = 2
list(type = "line", x0 = min(DGE_data$logFC), x1 = max(DGE_data$logFC),
y0 = -log10(0.05), y1 = -log10(0.05), line = list(dash = "dash", color = "black"), opacity = 0.3) # Horizontal line at FDR = 0.05
)
)
return(p)
}
# Load results
DGE_early_vs_others <- read.csv("DEG_Early_vs_Others.csv")
# plot of thin vs late+mature biofilms
p1 <- generate_volcano(DGE_early_vs_others) %>%
layout(title="Volcano Plot of Gene Expression in Early Biofilms vs. Thin and Mature Biofilms")
htmlwidgets::saveWidget(p1, "thin_v_all_volcano.html", selfcontained = TRUE)
# 2. Generate heatmaps
# load ANOVA table
DGE_All <- read.csv("DEG_All_Groups.csv")
# Load pairwise logFC values
early_vs_thin <- read.csv("DEG_Early_vs_Thin.csv")[, c("genes", "logFC", "FDR")]
early_vs_mature <- read.csv("DEG_Early_vs_Mature.csv")[, c("genes", "logFC", "FDR")]
thin_vs_mature <- read.csv("DEG_Thin_vs_Mature.csv")[, c("genes", "logFC", "FDR")]
# At this point, there are ~ 6000 genes total... rather than plot them all, I'll remove those with p > 0.05
early_vs_thin <- early_vs_thin %>% filter(FDR < 0.05) %>% select(!FDR) # now only ~3400 rows
early_vs_mature <- early_vs_mature %>% filter(FDR < 0.05) %>% select(!FDR) # now only ~3000 rows
thin_vs_mature <- thin_vs_mature %>% filter(FDR < 0.05) %>% select(!FDR) # now only ~2500 rows
# Rename logFC columns for clarity
colnames(early_vs_thin)[2] <- "Early_vs_Thin"
colnames(early_vs_mature)[2] <- "Early_vs_Mature"
colnames(thin_vs_mature)[2] <- "Thin_vs_Mature"
# Merge datasets based on the "genes" column. This will leave NAs on the rows (genes) for which a column (group) had an adj.P value (FDR) too large for our confidence threshold (> 0.05).
logFC_matrix <- purrr::reduce(
list(early_vs_thin, early_vs_mature, thin_vs_mature),
full_join, by = "genes"
)
logFC_matrix <- logFC_matrix %>%
filter(rowSums(!is.na(select(., -genes))) >= 2) %>% # Keep genes that are present in at least 2 of the 3 comparisons by conditionally removing rows with > 2 "NA"s. At about 3400 rows (genes) at this point
filter(rowSums(abs(select(., -genes)) > 1, na.rm = TRUE) >= 1) # keep genes that have |logFC| > 1 in at least 1 of the 3 comparisons
# Convert to matrix format (genes as row names)
rownames(logFC_matrix) <- logFC_matrix$genes
logFC_matrix$genes <- NULL
# # Use plotly again for interactive mapping
# heatmap_plot <- plot_ly(
# x = colnames(logFC_matrix),
# y = rownames(logFC_matrix),
# z = as.matrix(logFC_matrix),
# text = ~paste("Condition:", colnames(logFC_matrix),
# "<br>Gene:", rownames(logFC_matrix),
# "<br>logFC:", round(as.matrix(logFC_matrix), 2)),
# type = "heatmap",
# colorscale = "RdBu", # Red for upregulated, blue for downregulated
# reversescale = TRUE,
# colorbar = list(title = "LogFC") # Title for the legend
# ) %>%
# layout(
# title = "Pairwise Differential Gene Expression Across Treatments",
# xaxis = list(title = "Pairwise Comparisons"),
# yaxis = list(title = "Genes")
# )
# using heatmaply package (plotly wrapper) since it can fit all the genes. The only downside is it's buggy so some things don't work like custom hover text and colorbar labelling.
heatmap_plot <- heatmaply(
logFC_matrix,
xlab = "Pairwise Comparisons",
ylab = "Genes",
colors = colorRampPalette(RColorBrewer::brewer.pal(9, "RdBu"))(100),
main = "Pairwise Differential Expression between Early, Thin and Mature Biofilms",
dendrogram = "row",
showticklabels = c(TRUE, FALSE)
)
# save plot
htmlwidgets::saveWidget(heatmap_plot, "heatmaply_pairwise_heatmap.html", selfcontained = TRUE)# Generate tables 1 and 2
# load pairwise comparisons
dge_early_vs_thin <- read.csv("DEG_Early_vs_Thin.csv") %>%
select(genes, logFC, PValue) %>% # keep only genes, logFC (magnitude + direction) and p-value
mutate(EvT_logFC = logFC, # rename columns to match group
EvT_PValue = PValue, # ""
logFC = NULL, # drop unneccesary columns
PValue = NULL) # ""
# repeat for the other groups
dge_early_vs_mature <- read.csv("DEG_Early_vs_Mature.csv") %>% select(genes, logFC, PValue) %>% mutate(EvM_logFC = logFC, EvM_PValue = PValue, logFC = NULL, PValue = NULL)
dge_thin_vs_mature <- read.csv("DEG_Thin_vs_Mature.csv") %>% select(genes, logFC, PValue) %>% mutate(TvM_logFC = logFC, TvM_PValue = PValue, logFC = NULL, PValue = NULL)
# combine by gene
dge_combined <- purrr::reduce(
list(dge_early_vs_thin, dge_early_vs_mature, dge_thin_vs_mature),
full_join, by = "genes"
)
# Save as a table
write.table(dge_combined, "A2_list1.txt")
# Table 2 is already generated, but with more columns than necessary
table2 <- read.csv("DEG_Early_vs_Others.csv") %>%
select(genes, logFC, PValue)
write.table(table2, "A2_list2.txt")