Population genomics aims to understand how genetic diversity, allelic variation and population structure can provide insights into evolutionary and ecological processes1. Accurately aligning sequence reads to a reference genome, and identifying single nucleotide polymorphisms (SNPs) are critical steps in population genomics. This project focuses on analyzing a threespine stickleback (Gasterosteus aculeatus) dataset derived from Catchen et al. (2013), which investigated the population structure and colonization history of sticklebacks across different regions of Oregon.
The selected dataset includes 24 samples from four collection sites—CS, PCR, SJ and WC—processed using restriction site-associated DNA sequencing (RADseq). This sequencing technique targets a subset of the genome, allowing cost-effective SNP identification across populations2. By aligning these sequences to the threespine stickleback reference genome (Ensembl release 113), we can infer genetic relatedness and make informed hypotheses regarding evolutionary history2.
The use of next-generation sequencing (NGS) technologies like RADseq has enabled population-level studies with high resolution across genomic regions, overcoming historical limitations of marker-based analyses2. However, challenges such as sequencing errors, incorrect read mapping, and biases in SNP calling can impact downstream analyses. Proper parameter tuning and filtering criteria are essential for accurately capturing genetic variation with high confidence and minimizing false positives.
The goals of this assignment include assessing allelic variation and
the efficacy of Stacks’ gstacks and
populations commands for assessing population
differentiation. Though short and preliminary, this work highlights the
power of combining computational tools with genomic data to reveal the
evolutionary dynamics of natural populations.
The project is structured around several key steps:
Data Retrieval: Downloading raw sequencing reads from the dataset.
Quality Control: Ensuring high-quality reads for accurate analysis.
Alignment and SNP Calling: Aligning reads to the
reference genome using bwa and performing SNP calling with
Stacks.
Population Analyses: Evaluating genetic differentiation and population structure through metrics like FST and Principal Components Analysis (PCA).
By following these steps, I aim to assess the allelic variation and population differentiation in the four G. aculeatus populations, and form some hypothesis for further investigation.
The most up-to-date stable G. aculeatus reference genome was retrieved from Ensemble (Appendix A, line 18). Gzipped fastq files were retrieved from Dr. Lukens’ via Cedar (Appendix A, line 32). Files were processed as described in Appendix B, based on lectures from 6110. Figures were generated as described in Appendix C. Further detail can be found in code comments.
Parameters for Appendix B were based on reasoning discussed in-lecture, otherwise are specified as code comments.
As outlined in the multiqc report all reads pass the quality tests, with the note of high estimated duplication rate. The percentage of low-quality reads was < 0.0% for all samples, and the PHRED score for each base averaged between 30 to 40, with the exception of the 95th bp which had a PHRED score of approximately 29 for all groups. The GC content was approximately 50% for all the samples, excluding bps 1 to 6 for the bar-code. The N content was 0% for all reads (no ambiguous nucleotides).
| Pop.ID | Private | Num_Indv | Variant_Sites | %Polymorphic_Loci | Obs_Het | Exp_Het | Pi | Fis | cs_FST | pcr_FST | sj_FST |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cs | 733 | 5.95085 | 39860 | 0.87193 | 0.00286 | 0.00279 | 0.00305 | 0.00045 | NA | NA | NA |
| pcr | 1301 | 5.95884 | 40360 | 0.82647 | 0.00298 | 0.00286 | 0.00313 | 0.00034 | 0.0548212 | NA | NA |
| sj | 470 | 5.93696 | 37911 | 0.86460 | 0.00285 | 0.00289 | 0.00316 | 0.00072 | 0.0174663 | 0.0329177 | NA |
| wc | 925 | 5.94704 | 40610 | 0.92181 | 0.00295 | 0.00307 | 0.00336 | 0.00094 | 0.0171675 | 0.0328171 | 0.0108634 |
Table 1 lists vital summary statistics from
gstacks populations, including:
the number of alleles that are unique to each population
the average number of individuals per site included in the analysis
the number and percentage of polymorphic loci
the observed and expected heterozygosity (i.e the proportion of heterozygous individuals observed at polymorphic loci versus the expected proportion based on allele frequencies under Hardy-Weinberg Equilibria assumptions)
Nucleotide diversity (\(\pi\))
Inbreeding Coefficient (FIS)
Fixation Indices (FST)
Principal Component Analysis revealed two distinct clusters
for the CS and PCR groups when measured by PCs 1 and 2. The SJ and WC
groups had significantly more spread in both dimensions, as well as
similar cluster centroids. The eigenvalues for PCs 1 and 2 were 2.88 and
1.62, respectively. The eigenvalues for PCs 3 to 20 were all close to
1.10.
Figure 2 displays the gene diversity across the four
populations of interest, coloured by chromosome and ordered by Locus ID
(i.e. position). Gene diversity measures the probability that two
randomly selected alleles from the population are heterozygous at a
given locus3.
The quality of data appeared to be good based on the multiqc report. The high duplication rate likely stems from a combination of the RAD tags and single-end sequencing.
From Table 1, it appears that the SJ group had the fewest alleles unique to the population (i.e. the most ‘shared’ alleles), followed by the CS and WC groups, while the PCR group had the highest quantity of unique alleles. This could imply that the PCR group has been more isolated, or has experienced founder effects or genetic drift. The average number of individuals per site reflects the sampling depth for each population, and they are quite low (~ 6). This limits the confidence with which we can form conclusions about our results, particularly relating to population diversity.
Most loci are polymorphic (~80-90%) even with the restriction with the gstacks command specifying that a locus must have reads in at least 5 of the 6 samples to be considered (Appendix B, line 115), as well as filtering out minor alleles present in < 5% of the dataset (Appendix B, line 117). This high polymorphic percentage is evidence against the conclusion for the PCR group being isolated or experiencing founder effects or genetic drift, as it is an indicator of genetic variability. Despite this high polymorphic percentage, the observed proportion of heterozygous individuals appears low at first glance. However, the expected heterozygosity which measures the theoretical genetic diversity if the population were in Hardy-Weinberg Equilibrium (allele and genotype frequencies remain constant from generation to generation), is quite close to the observed values. The coefficient pi (\(\pi\)) is the average number of nucleotide differences per site between two randomly chosen sequences, another measure of polymorphism. The pi values are quite low for all populations, implying that it is a select few nucleotides that make up the difference between alleles.
The Inbreeding Coefficient (FIS) is a ratio between the observed heterozygosity and expected heterozygosity, and the closer the value is to 0, the stronger the evidence that the population is in Hardy-Weinberg equilibrium. As implied by the observed heterozygosity being only slightly greater than the expected heterozygosity, the inbreeding coefficient for all populations is approximately 0, with a slight positive skew.
Lastly, the fixation index is a measure of genetic differentiation (i.e. shared vs unique alleles between populations). FST was highest between CS and PCR, which ties to the results of PCA analysis. A higher FST implies greater genetic difference between the two populations, either due to earlier speciation or lack of gene flow. The lowest FST value was between SJ and WC, which also ties into the results from the PCA analysis. CS was more genetically similar to SJ and WC than PCR was to SJ and WC.
PCA was selected as a dimensionality reduction technique for its interpretability, especially in the context of visualizing genetic differenatiation among populations (and because it was a requirement for the assignment). Based on eigenvalues of 2.88 and 1.62 for PCs 1 and 2, respectively, they make up approximately 22.5% of the total variance for PCs 1 to 20. In terms of limitations, if more PCs were included then differentiating populations by PCs might be viable. However, as in Figure 1 there is still overlap between SJ and WC clusters when using PCs 1 and 2 alone. This suggests that additional factors beyond the two major axes of genetic variation likely contribute to the differentiation observed among these populations. As mentioned in the results section, the PCR and CS groups appear distinct, while SJ and WC groups have greater spread. This aligns with the earlier observations from the fixation index (FST), where PCR and CS were the most distinct, whereas SJ and WC were the most similar.
Table 1 and Figure 1 provide evidence that the PCR group (and the CS group, to a lesser extent) was (were) genetically distinct from the other groups. Figure 2 provides an opportunity to identify where that evidence comes from. Upon visually inspecting the gene diversity plot for PCR versus the other 3 groups, chromosome I, IV and VII are the most distinct. This can be validated with:
read.csv("populations/populations.fst_cs-pcr.tsv", sep = "\t") %>%
filter(Fisher.s.P < 0.05 & # confidence level of 95%
Odds.Ratio > 5 & # highlight loci with substantial differences between populations
Corrected.AMOVA.Fst > 0.6) %>% # ""
.[, 1:8] %>% # only really need cols 1-8
head()## X..Locus.ID Pop.1.ID Pop.2.ID Chr BP Column Fisher.s.P Odds.Ratio
## 1 1868 cs pcr I 7740649 48 0.000053991 121
## 2 1896 cs pcr I 7883333 41 0.000289184 99
## 3 1991 cs pcr I 8203582 9 0.000053991 121
## 4 1991 cs pcr I 8203585 12 0.000053991 121
## 5 1991 cs pcr I 8203645 72 0.000053991 121
## 6 1991 cs pcr I 8203646 73 0.000053991 121
Upon inspection, chromosomes I, IV, VII as well as certain loci in XIV, XV, XX and XXI contribute significantly to genetic differentiation. This is indicated by the large odds-ratios, which measure how frequent an SNP is in one population versus another (though it does not include direction, only magnitude). This investigation could be done for all pairs, however this report is limited by size restrictions. However, we can still conclude that this is a powerful method of capturing allelic variation and describing population differentiation, especially with a well-annotated reference genome. With a high-quality reference genome such as that used here, these loci could be investigated with the potential to uncover differences in particular genetic and metabolic pathways.
In conclusion, the methods described in Appendix B were effective in assessing allelic variation in the four G. aculeatus populations, as well as quantifying population differentiation in a way that allows for more specific investigation of loci, and even nucleotide-level differences. In this particular dataset, the PCR group has been more genetically isolated from the other species comparatively, than they have to each other. Since there was no significant differentiation on the Y chromosome, an investigation of mtDNA may shed light on the evolutionary and ecological history of the four groups.
# set wkdir
cd ~/scratch/A1/
# load modules
module load fastp
module load sra-toolkit
module load bwa
module load samtools
# 1. get the reference genome if it isn't already in the working dir
#REF_GENOME="/home/lukens/scratch/Sequences/file.txt"
REF_GENOME=~/scratch/A1/Gasterosteus_aculeatus.GAculeatus_UGA_version5.dna.toplevel.fa.gz
if [ -e "$REF_GENOME" ]; then
echo "Reference genome file '$REF_GENOME' already exists."
else
echo "Reference genome '$REF_GENOME' does not exist. Downloading..."
wget https://ftp.ensembl.org/pub/release-113/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.GAculeatus_UGA_version5.dna.toplevel.fa.gz .
echo "Download complete."
fi
# end if/else stream
# 2. get the fq.fz input files
# Define the directory containing .fq.gz files
INPUT_DIR='./lukens_files'
# Check if directory exists
if [ -d "$INPUT_DIR" ]; then
echo "Directory '$INPUT_DIR' already exists."
else
echo "Creating directory '$INPUT_DIR' and copying files..."
mkdir -p "$INPUT_DIR"
cp home/lukens/scratch/Assignment_1/* "$INPUT_DIR" && echo "Files copied successfully."
fi
# 3. Index the reference genome
# Check if the reference genome is indexed
if [ -f "${REF_GENOME}.bwt" ] && [ -f "${REF_GENOME}.sa" ]; then
# if the output files from indexing already exist then move on
echo "Reference genome has already been indexed."
else
echo "Indexing reference genome..."
bwa index "$REF_GENOME"
gunzip -c "$REF_GENOME" | bgzip > "${REF_GENOME%.gz}" # avoids some downstream errors
samtools faidx "${REF_GENOME%.gz}"
# bcftools requires .fai index file so both index commands are necessary
if [ $? -eq 0 ]; then
# $? points to the exit code of the last command; 0 = success
echo "Indexing completed successfully."
else
echo "Error during indexing. Please check the reference genome file."
# stop if indexing exits unexpectedly
exit 1
fi
fi
# 4. unzip fq.fz files
# Loop through all .fq.gz files in the input directory and unzip
for FILE in "$INPUT_DIR"/*.fq.gz; do
# Extract the base name (e.g., cs_1335.01) without the .fq.gz extension
BASENAME=$(basename "$FILE" .fq.gz)
echo "Processing $BASENAME ..."
# Define the expected output file
FASTQ_FILE="$INPUT_DIR/${BASENAME}.fastq"
# Check if the output file already exists
if [ -f "$FASTQ_FILE" ]; then
echo "File for $BASENAME already exists. Skipping..."
continue # skips to next iteration of loop
fi
# Uncompress the .fq.gz file
gunzip -c "$FILE" > "$FASTQ_FILE"
# Check if the file was successfully Decompressed
if [ $? -eq 0 ]; then
echo "Decompressed $FILE to $FASTQ_FILE successfully."
else
echo "Error processing $FILE. Skipping..."
continue
fi
echo "Finished processing $BASENAME."
done
echo "All files decompressed."
# 5. run multiqc
# multiqc requires html and json reports
for file in ./raw_files/*.fastq; do
base=$(basename "$file" .fastq)
fastp -i "$file" -o ./filtered_files/"${base}_filtered.fastq" \
--average_qual 20 \
-h ./fastp_reports/"${base}_fastp.html" \
-j ./fastp_reports/"${base}_fastp.json"
done && \
multiqc ./fastp_reports -o ./fastp_reports
# 6. Generate .SAM, .BAM, sorted .BAM and indexed .BAM files
OUTPUT_DIR='./processed_files'
mkdir -p "$OUTPUT_DIR" # Create the output directory if it doesn't exist
# Loop through all .fq files in the input directory
for FILE in ${INPUT_DIR}/*.fastq; do
# Extract the base name (e.g., cs_1335.01) from the file name
BASENAME=$(basename "$FILE" .fastq)
# Define input FASTQ file
FASTQ_FILE="${INPUT_DIR}/${BASENAME}.fastq"
# Define output file paths
SAM_FILE="${OUTPUT_DIR}/${BASENAME}.sam"
BAM_FILE="${OUTPUT_DIR}/${BASENAME}.bam"
SORTED_BAM_FILE="${OUTPUT_DIR}/${BASENAME}_sorted.bam"
echo "Processing $BASENAME..."
# Align reads to the reference genome using bwa mem
bwa mem "$REF_GENOME" "$FASTQ_FILE" > "$SAM_FILE"
# Convert SAM to BAM
samtools view -b "$SAM_FILE" > "$BAM_FILE"
# Sort BAM file
samtools sort "$BAM_FILE" -o "$SORTED_BAM_FILE"
# Index BAM file
samtools index "$SORTED_BAM_FILE"
# Optionally clean up intermediate files. In a true pipeline I would do this.
# rm "$SAM_FILE" "$BAM_FILE"
echo "Finished processing $BASENAME."
done
echo "All files processed."
# 7. Run gstacks for aligning/calling genotypes
# originally I'd used mpileup but did not like that it included regions with a coverage of 0, so I made .bed files for the high-coverage regions in the .bam files and then used mpileup using those .bed files... it was a whole mess. So, I used gstacks instead, since it addresses the population genetics component of the assignment.
mkdir gstacks
gstacks -I bam_files/ -M mpileup/pop_map_2.tsv -O gstacks/ # I had to remove the header from pop_map.tsv
# 8. Then populations
populations -P gstacks/ -O gstacks/populations/ -M mpileup/pop_map_2.tsv \
# at least two populations read that locus
-p 2
# at least 5 out of 6 samples per group must have data at a locus
--min-samples-per-pop 0.8 \
# variant must be present in at least 5% of all alleles across the dataset
--min-maf 0.05 \
# minimum number of reads per genotype call to reduce false positives -> set based on mean depth
--min-gt-depth 5 \
# removes loci with high heterozygosity (possible sequencing artifacts). A threshold of 0.7 (70%) is commonly used to filter loci deviating from Hardy-Weinberg expectations.
--max-obs-het 0.7 \
# output in vcf format and plink files (for pca), with F-statistics and hardy-weinberg equilibria values
--vcf --plink --fstats --hwe \
# FST (fixation index) values -> how much genetic variation is partitioned among populations versus within populations. Ranges from 0 (identical) to 1 (completely different)
--fst-correction \
# only keep values above significant threshold of 95%
--p-value-cutoff 0.05
# 9. Run PCA using plink files
plink --file populations.plink --make-bed --allow-extra-chr --out plink_data
plink --bfile plink_data --pca 20 --allow-extra-chr --out pca_out# 10. Generate PCA plot
# load data
pca_data <- read.table("pca/pca_out.eigenvec", header=FALSE)
# Add colnames
colnames(pca_data) <- c("Population", "SampleID", paste0("PC", 1:10))
# Plot PC1 vs PC2
# PCA plot with ggplot2
p <- ggplot(pca_data, aes(x = PC1, y = PC2, color = Population)) +
geom_point(size = 3, alpha = 0.8) + # Add points
labs(
x = "Principal Component 1",
y = "Principal Component 2",
title = "PCA of Populations"
) +
theme_bw() +
theme(
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")
)
# Save the plot
ggsave("PCA_plot_populations.png", p, dpi = 600)
# 11. FST heatmap
# I did not end up using this, and just included the FST values in the summary statistics table. I kept the code as a "show-your-work" ethic.
fst_raw <- read.table("populations/populations.fst_summary.tsv", header=TRUE, sep="\t")
colnames(fst_raw) <- c("Population1", "cs", "pcr", "sj", "wc")
fst_data <- fst_raw %>%
pivot_longer(
cols = -Population1, # all cols besides first are treated as value cols
names_to = "Population2", # create Pop 2 column
values_to = "Fst") %>% # add Fst values
dplyr::filter(!is.na(.$Fst)) # drop NA's
ggplot(fst_data, aes(x = Population1, y = Population2, fill = Fst)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "blue", high = "red", na.value = "grey") +
labs(title = "Pairwise Fst Heatmap", x = "Population 1", y = "Population 2") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 12. Hapmap heterozygosity stats
hapstats_data <- read.table("populations/populations.hapstats.tsv", header=TRUE, sep = "\t")
colnames(hapstats_data) <- c("Locus_ID", "Chr", "BP", "Pop_ID", "N", "Haplotype_Cnt", "Gene_Diversity", "Smoothed_Gene_Diversity", "Smoothed_Gene_Diversity_Pval", "Haplotype_Diversity", "Smoothed_Haplotype_Diversity", "Smoothed_Haplotype_Diversity_Pval", "HWE_Pval", "HWE_Pval_SE", "Haplotypes")
# distinguish contigs in order to group them together later as grey
hapstats_data$Chrom_Type <- ifelse(grepl("^JACDQR", hapstats_data$Chr), "Contig", "Chromosome")
# Function to plot Gene Diversity for a given Pop_ID with randomized high-contrast colors. Prior to doing this I couldn't tell where one Chr ended and the next began. I tried adding Chromosome name annotations as a second x axis but that was too messy.
plot_gene_diversity <- function(data, pop_id) {
# filter data for the specified Pop_ID
pop_data <- data %>%
filter(Pop_ID == pop_id)
# order the data by Locus_ID for each chromosome
pop_data <- pop_data %>%
arrange(Chr, Locus_ID)
# convert Chr to a factor to enforce order in the plot
pop_data$Chr <- factor(pop_data$Chr, levels = unique(pop_data$Chr))
# number of chromosomes
num_chromosomes <- length(unique(pop_data$Chr[pop_data$Chrom_Type == "Chromosome"]))
# generate high-contrast colours (I'm colour-blind) using the "Spectral" palette
vibrant_colors <- hcl.colors(num_chromosomes, palette = "Spectral", alpha = 1)
# shuffle the colors to avoid similar hues being placed next to each other
set.seed(42) # for reproducibility
shuffled_colors <- sample(vibrant_colors)
# map colors to chromosomes
chromosome_levels <- unique(pop_data$Chr[pop_data$Chrom_Type == "Chromosome"])
chromosome_color_mapping <- setNames(shuffled_colors, chromosome_levels)
pop_data$Color <- ifelse(pop_data$Chrom_Type == "Chromosome", chromosome_color_mapping[pop_data$Chr], "grey")
# Create the bar plot
p <- ggplot(pop_data, aes(x = Locus_ID, y = Gene_Diversity, fill = Chr)) +
geom_bar(stat = "identity", width = 1.35) + # 1.35 was a good width for visualization
scale_fill_manual(values = c(chromosome_color_mapping, "grey" = "grey")) +
labs(title = paste(pop_id),
x = "Locus ID",
y = "Gene Diversity",
fill = "Chromosome") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")
return(p)
}
# generate plots for each population
pop_ids <- c("cs", "pcr", "wc", "sj")
plots <- lapply(pop_ids, function(pop_id) plot_gene_diversity(hapstats_data, pop_id))
# Combine the plots using patchwork
combined_plot <- plots[[1]] + plots[[2]] + plots[[3]] + plots[[4]] +
plot_layout(ncol = 2, nrow = 2) +
# add legend
plot_annotation(title = "Gene Diversity across Populations") +
theme(legend.position = "right", # Place legend on the right
legend.justification = c(0.5, 0.5), # Center the legend vertically and horizontally
legend.box.margin = margin(0, 20, 0, 0)) # Add space to the right side
combined_plot
ggsave("Gene_Diversity_Plot_combined.png", plot = combined_plot, dpi = 600)