Introduction

Population genomics aims to understand how genetic diversity, allelic variation and population structure can provide insights into evolutionary and ecological processes¹. Accurately aligning sequence reads to a reference genome, and identifying single nucleotide polymorphisms (SNPs) are critical steps in population genomics. This project focuses on analyzing a threespine stickleback (Gasterosteus aculeatus) dataset derived from Catchen et al. (2013), which investigated the population structure and colonization history of sticklebacks across different regions of Oregon.

The selected dataset includes 24 samples from four collection sites—CS, PCR, SJ and WC—processed using restriction site-associated DNA sequencing (RADseq). This sequencing technique targets a subset of the genome, allowing cost-effective SNP identification across populations². By aligning these sequences to the threespine stickleback reference genome (Ensembl release 113), we can infer genetic relatedness and make informed hypotheses regarding evolutionary history².

The use of next-generation sequencing (NGS) technologies like RADseq has enabled population-level studies with high resolution across genomic regions, overcoming historical limitations of marker-based analyses². However, challenges such as sequencing errors, incorrect read mapping, and biases in SNP calling can impact downstream analyses. Proper parameter tuning and filtering criteria are essential for accurately capturing genetic variation with high confidence and minimizing false positives.

The goals of this assignment include assessing allelic variation and the efficacy of Stacks’ gstacks and populations commands for assessing population differentiation. Though short and preliminary, this work highlights the power of combining computational tools with genomic data to reveal the evolutionary dynamics of natural populations.

The project is structured around several key steps:

Data Retrieval: Downloading raw sequencing reads from the dataset.
Quality Control: Ensuring high-quality reads for accurate analysis.
Alignment and SNP Calling: Aligning reads to the reference genome using bwa and performing SNP calling with Stacks.
Population Analyses: Evaluating genetic differentiation and population structure through metrics like FST and Principal Components Analysis (PCA).

By following these steps, I aim to assess the allelic variation and population differentiation in the four G. aculeatus populations, and form some hypothesis for further investigation.

Methods

The most up-to-date stable G. aculeatus reference genome was retrieved from Ensemble (Appendix A, line 18). Gzipped fastq files were retrieved from Dr. Lukens’ via Cedar (Appendix A, line 32). Files were processed as described in Appendix B, based on lectures from 6110. Figures were generated as described in Appendix C. Further detail can be found in code comments.

Parameters for Appendix B were based on reasoning discussed in-lecture, otherwise are specified as code comments.

Results

As outlined in the multiqc report all reads pass the quality tests, with the note of high estimated duplication rate. The percentage of low-quality reads was < 0.0% for all samples, and the PHRED score for each base averaged between 30 to 40, with the exception of the 95th bp which had a PHRED score of approximately 29 for all groups. The GC content was approximately 50% for all the samples, excluding bps 1 to 6 for the bar-code. The N content was 0% for all reads (no ambiguous nucleotides).

Table 1. Summary statistics from gstacks populations
Pop.ID	Private	Num_Indv	Variant_Sites	%Polymorphic_Loci	Obs_Het	Exp_Het	Pi	Fis	cs_FST	pcr_FST	sj_FST
cs	733	5.95085	39860	0.87193	0.00286	0.00279	0.00305	0.00045	NA	NA	NA
pcr	1301	5.95884	40360	0.82647	0.00298	0.00286	0.00313	0.00034	0.0548212	NA	NA
sj	470	5.93696	37911	0.86460	0.00285	0.00289	0.00316	0.00072	0.0174663	0.0329177	NA
wc	925	5.94704	40610	0.92181	0.00295	0.00307	0.00336	0.00094	0.0171675	0.0328171	0.0108634

Table 1 lists vital summary statistics from gstacks populations, including:

the number of alleles that are unique to each population
the average number of individuals per site included in the analysis
the number and percentage of polymorphic loci
the observed and expected heterozygosity (i.e the proportion of heterozygous individuals observed at polymorphic loci versus the expected proportion based on allele frequencies under Hardy-Weinberg Equilibria assumptions)
Nucleotide diversity (\(\pi\))
Inbreeding Coefficient (FIS)
Fixation Indices (FST)

Figure 1. Principal Components Analysis of Genetic Differentiation among four G. aculeatus Populations

Principal Component Analysis revealed two distinct clusters for the CS and PCR groups when measured by PCs 1 and 2. The SJ and WC groups had significantly more spread in both dimensions, as well as similar cluster centroids. The eigenvalues for PCs 1 and 2 were 2.88 and 1.62, respectively. The eigenvalues for PCs 3 to 20 were all close to 1.10.

Figure 2. Genetic Diversity across four populations of G. aculeatus. Note that contigs are represented in grey at the far right of each plot, but are not included in the Chromosome legend.

Figure 2 displays the gene diversity across the four populations of interest, coloured by chromosome and ordered by Locus ID (i.e. position). Gene diversity measures the probability that two randomly selected alleles from the population are heterozygous at a given locus³.

Discussion

The quality of data appeared to be good based on the multiqc report. The high duplication rate likely stems from a combination of the RAD tags and single-end sequencing.

Summary Statistics

From Table 1, it appears that the SJ group had the fewest alleles unique to the population (i.e. the most ‘shared’ alleles), followed by the CS and WC groups, while the PCR group had the highest quantity of unique alleles. This could imply that the PCR group has been more isolated, or has experienced founder effects or genetic drift. The average number of individuals per site reflects the sampling depth for each population, and they are quite low (~ 6). This limits the confidence with which we can form conclusions about our results, particularly relating to population diversity.

Most loci are polymorphic (~80-90%) even with the restriction with the gstacks command specifying that a locus must have reads in at least 5 of the 6 samples to be considered (Appendix B, line 115), as well as filtering out minor alleles present in < 5% of the dataset (Appendix B, line 117). This high polymorphic percentage is evidence against the conclusion for the PCR group being isolated or experiencing founder effects or genetic drift, as it is an indicator of genetic variability. Despite this high polymorphic percentage, the observed proportion of heterozygous individuals appears low at first glance. However, the expected heterozygosity which measures the theoretical genetic diversity if the population were in Hardy-Weinberg Equilibrium (allele and genotype frequencies remain constant from generation to generation), is quite close to the observed values. The coefficient pi (\(\pi\)) is the average number of nucleotide differences per site between two randomly chosen sequences, another measure of polymorphism. The pi values are quite low for all populations, implying that it is a select few nucleotides that make up the difference between alleles.

The Inbreeding Coefficient (FIS) is a ratio between the observed heterozygosity and expected heterozygosity, and the closer the value is to 0, the stronger the evidence that the population is in Hardy-Weinberg equilibrium. As implied by the observed heterozygosity being only slightly greater than the expected heterozygosity, the inbreeding coefficient for all populations is approximately 0, with a slight positive skew.

Lastly, the fixation index is a measure of genetic differentiation (i.e. shared vs unique alleles between populations). FST was highest between CS and PCR, which ties to the results of PCA analysis. A higher FST implies greater genetic difference between the two populations, either due to earlier speciation or lack of gene flow. The lowest FST value was between SJ and WC, which also ties into the results from the PCA analysis. CS was more genetically similar to SJ and WC than PCR was to SJ and WC.

Principal Components Analysis

PCA was selected as a dimensionality reduction technique for its interpretability, especially in the context of visualizing genetic differenatiation among populations (and because it was a requirement for the assignment). Based on eigenvalues of 2.88 and 1.62 for PCs 1 and 2, respectively, they make up approximately 22.5% of the total variance for PCs 1 to 20. In terms of limitations, if more PCs were included then differentiating populations by PCs might be viable. However, as in Figure 1 there is still overlap between SJ and WC clusters when using PCs 1 and 2 alone. This suggests that additional factors beyond the two major axes of genetic variation likely contribute to the differentiation observed among these populations. As mentioned in the results section, the PCR and CS groups appear distinct, while SJ and WC groups have greater spread. This aligns with the earlier observations from the fixation index (FST), where PCR and CS were the most distinct, whereas SJ and WC were the most similar.

Gene Diversity

Table 1 and Figure 1 provide evidence that the PCR group (and the CS group, to a lesser extent) was (were) genetically distinct from the other groups. Figure 2 provides an opportunity to identify where that evidence comes from. Upon visually inspecting the gene diversity plot for PCR versus the other 3 groups, chromosome I, IV and VII are the most distinct. This can be validated with:

read.csv("populations/populations.fst_cs-pcr.tsv", sep = "\t") %>%
  filter(Fisher.s.P < 0.05 &     # confidence level of 95%
           Odds.Ratio > 5 &      # highlight loci with substantial differences between populations
           Corrected.AMOVA.Fst > 0.6) %>% # ""
  .[, 1:8] %>% # only really need cols 1-8
  head()

##   X..Locus.ID Pop.1.ID Pop.2.ID Chr      BP Column  Fisher.s.P Odds.Ratio
## 1        1868       cs      pcr   I 7740649     48 0.000053991        121
## 2        1896       cs      pcr   I 7883333     41 0.000289184         99
## 3        1991       cs      pcr   I 8203582      9 0.000053991        121
## 4        1991       cs      pcr   I 8203585     12 0.000053991        121
## 5        1991       cs      pcr   I 8203645     72 0.000053991        121
## 6        1991       cs      pcr   I 8203646     73 0.000053991        121

Upon inspection, chromosomes I, IV, VII as well as certain loci in XIV, XV, XX and XXI contribute significantly to genetic differentiation. This is indicated by the large odds-ratios, which measure how frequent an SNP is in one population versus another (though it does not include direction, only magnitude). This investigation could be done for all pairs, however this report is limited by size restrictions. However, we can still conclude that this is a powerful method of capturing allelic variation and describing population differentiation, especially with a well-annotated reference genome. With a high-quality reference genome such as that used here, these loci could be investigated with the potential to uncover differences in particular genetic and metabolic pathways.

In conclusion, the methods described in Appendix B were effective in assessing allelic variation in the four G. aculeatus populations, as well as quantifying population differentiation in a way that allows for more specific investigation of loci, and even nucleotide-level differences. In this particular dataset, the PCR group has been more genetically isolated from the other species comparatively, than they have to each other. Since there was no significant differentiation on the Y chromosome, an investigation of mtDNA may shed light on the evolutionary and ecological history of the four groups.

References

DeFaveri, J., and Merilä, J. (2015). Temporal stability of genetic variability and differentiation in the three-spined stickleback (gasterosteus aculeatus). PLoS One 10, e0123891.

Lemopoulos, A., Prokkola, J.M., Uusi-Heikkilä, S., Vasemägi, A., Huusko, A., Hyvärinen, P., Koljonen, M.-L., Koskiniemi, J., and Vainikka, A. (2019). Comparing RADseq and microsatellites for estimating genetic diversity and relatedness—implications for brown trout conservation. Ecology and Evolution 9, 2106–2120.

Catchen, J., Bassham, S., Wilson, T., Currey, M., O’Brien, C., Yeates, Q., and Cresko, W.A. (2013). The population structure and recent colonization history of o regon threespine stickleback determined using restriction-site associated DNA-sequencing. Molecular ecology 22, 2864–2883.

R Core Team (2024). R: A language and environment for statistical computing (R Foundation for Statistical Computing).

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (Springer-Verlag New York).

Rochette, N.C., and Catchen, J.M. (2017). Deriving genotypes from RAD-seq short-read data using stacks. Nature protocols 12, 2640–2659.

Cheng, J., Sievert, C., Schloerke, B., Chang, W., Xie, Y., and Allen, J. (2024). Htmltools: Tools for HTML.

Appendices

Appendix A: Data Acquisition

# set wkdir
cd ~/scratch/A1/

# load modules
module load fastp
module load sra-toolkit
module load bwa
module load samtools


# 1. get the reference genome if it isn't already in the working dir
#REF_GENOME="/home/lukens/scratch/Sequences/file.txt"
REF_GENOME=~/scratch/A1/Gasterosteus_aculeatus.GAculeatus_UGA_version5.dna.toplevel.fa.gz
if [ -e "$REF_GENOME" ]; then
        echo "Reference genome file '$REF_GENOME' already exists."
else
        echo "Reference genome '$REF_GENOME' does not exist. Downloading..."
        wget https://ftp.ensembl.org/pub/release-113/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.GAculeatus_UGA_version5.dna.toplevel.fa.gz .
        echo "Download complete."
fi                                                                              
# end if/else stream

# 2. get the fq.fz input files
# Define the directory containing .fq.gz files
INPUT_DIR='./lukens_files'
# Check if directory exists
if [ -d "$INPUT_DIR" ]; then
    echo "Directory '$INPUT_DIR' already exists."
else
    echo "Creating directory '$INPUT_DIR' and copying files..."
    mkdir -p "$INPUT_DIR"
    cp home/lukens/scratch/Assignment_1/* "$INPUT_DIR" && echo "Files copied successfully."
fi

Appendix B: Data Processing


# 3. Index the reference genome
# Check if the reference genome is indexed
if [ -f "${REF_GENOME}.bwt" ] && [ -f "${REF_GENOME}.sa" ]; then                
# if the output files from indexing already exist then move on
    echo "Reference genome has already been indexed."
else
    echo "Indexing reference genome..."
    bwa index "$REF_GENOME"
    gunzip -c "$REF_GENOME" | bgzip > "${REF_GENOME%.gz}" # avoids some downstream errors
    samtools faidx "${REF_GENOME%.gz}"                                          
    # bcftools requires .fai index file so both index commands are necessary
    if [ $? -eq 0 ]; then                                                       
    # $? points to the exit code of the last command; 0 = success
        echo "Indexing completed successfully."
    else
        echo "Error during indexing. Please check the reference genome file."   
        # stop if indexing exits unexpectedly
        exit 1
    fi
fi

# 4. unzip fq.fz files
# Loop through all .fq.gz files in the input directory and unzip
for FILE in "$INPUT_DIR"/*.fq.gz; do
    # Extract the base name (e.g., cs_1335.01) without the .fq.gz extension
    BASENAME=$(basename "$FILE" .fq.gz)

    echo "Processing $BASENAME ..."

    # Define the expected output file
    FASTQ_FILE="$INPUT_DIR/${BASENAME}.fastq"

    # Check if the output file already exists
    if [ -f "$FASTQ_FILE" ]; then
        echo "File for $BASENAME already exists. Skipping..."
        continue                                                                # skips to next iteration of loop
    fi

    # Uncompress the .fq.gz file
    gunzip -c "$FILE" > "$FASTQ_FILE"

    # Check if the file was successfully Decompressed
    if [ $? -eq 0 ]; then
        echo "Decompressed $FILE to $FASTQ_FILE successfully."
    else
        echo "Error processing $FILE. Skipping..."
        continue
    fi

    echo "Finished processing $BASENAME."
done

echo "All files decompressed."

# 5. run multiqc
# multiqc requires html and json reports
for file in ./raw_files/*.fastq; do 
    base=$(basename "$file" .fastq)
    fastp -i "$file" -o ./filtered_files/"${base}_filtered.fastq" \
          --average_qual 20 \
          -h ./fastp_reports/"${base}_fastp.html" \
          -j ./fastp_reports/"${base}_fastp.json"
done && \
multiqc ./fastp_reports -o ./fastp_reports

# 6. Generate .SAM, .BAM, sorted .BAM and indexed .BAM files
OUTPUT_DIR='./processed_files'
mkdir -p "$OUTPUT_DIR"                                                          # Create the output directory if it doesn't exist

# Loop through all .fq files in the input directory
for FILE in ${INPUT_DIR}/*.fastq; do
    # Extract the base name (e.g., cs_1335.01) from the file name
    BASENAME=$(basename "$FILE" .fastq)

    # Define input FASTQ file
    FASTQ_FILE="${INPUT_DIR}/${BASENAME}.fastq"

    # Define output file paths
    SAM_FILE="${OUTPUT_DIR}/${BASENAME}.sam"
    BAM_FILE="${OUTPUT_DIR}/${BASENAME}.bam"
    SORTED_BAM_FILE="${OUTPUT_DIR}/${BASENAME}_sorted.bam"

    echo "Processing $BASENAME..."

    # Align reads to the reference genome using bwa mem
    bwa mem "$REF_GENOME" "$FASTQ_FILE" > "$SAM_FILE"

    # Convert SAM to BAM
    samtools view -b "$SAM_FILE" > "$BAM_FILE"

    # Sort BAM file
    samtools sort "$BAM_FILE" -o "$SORTED_BAM_FILE"

    # Index BAM file
    samtools index "$SORTED_BAM_FILE"

    # Optionally clean up intermediate files. In a true pipeline I would do this.
    # rm "$SAM_FILE" "$BAM_FILE"

    echo "Finished processing $BASENAME."
done

echo "All files processed."

# 7. Run gstacks for aligning/calling genotypes
# originally I'd used mpileup but did not like that it included regions with a coverage of 0, so I made .bed files for the high-coverage regions in the .bam files and then used mpileup using those .bed files... it was a whole mess. So, I used gstacks instead, since it addresses the population genetics component of the assignment.
mkdir gstacks
gstacks -I bam_files/ -M mpileup/pop_map_2.tsv -O gstacks/ # I had to remove the header from pop_map.tsv

# 8. Then populations
populations -P gstacks/ -O gstacks/populations/ -M mpileup/pop_map_2.tsv \
# at least two populations read that locus
  -p 2
# at least 5 out of 6 samples per group must have data at a locus
  --min-samples-per-pop 0.8 \
# variant must be present in at least 5% of all alleles across the dataset
  --min-maf 0.05 \ 
# minimum number of reads per genotype call to reduce false positives -> set based on mean depth
  --min-gt-depth 5 \
# removes loci with high heterozygosity (possible sequencing artifacts). A threshold of 0.7 (70%) is commonly used to filter loci deviating from Hardy-Weinberg expectations.
  --max-obs-het 0.7 \
# output in vcf format and plink files (for pca), with F-statistics and hardy-weinberg equilibria values
  --vcf --plink --fstats --hwe \
# FST (fixation index) values -> how much genetic variation is partitioned among populations versus within populations. Ranges from 0 (identical) to 1 (completely different)
--fst-correction \
# only keep values above significant threshold of 95%
--p-value-cutoff 0.05

# 9. Run PCA using plink files
plink --file populations.plink --make-bed --allow-extra-chr --out plink_data
plink --bfile plink_data --pca 20 --allow-extra-chr --out pca_out

Appendix C: Generating Figures

# 10. Generate PCA plot
# load data
pca_data <- read.table("pca/pca_out.eigenvec", header=FALSE)

# Add colnames
colnames(pca_data) <- c("Population", "SampleID", paste0("PC", 1:10))

# Plot PC1 vs PC2
# PCA plot with ggplot2
p <- ggplot(pca_data, aes(x = PC1, y = PC2, color = Population)) +
  geom_point(size = 3, alpha = 0.8) +  # Add points
  labs(
    x = "Principal Component 1",
    y = "Principal Component 2",
    title = "PCA of Populations"
  ) +
  theme_bw() +
  theme(
    legend.title = element_text(size = 12, face = "bold"),
    legend.text = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold")
  )
  

# Save the plot
ggsave("PCA_plot_populations.png", p, dpi = 600)

# 11. FST heatmap
# I did not end up using this, and just included the FST values in the summary statistics table. I kept the code as a "show-your-work" ethic.
fst_raw <- read.table("populations/populations.fst_summary.tsv", header=TRUE, sep="\t")

colnames(fst_raw) <- c("Population1", "cs", "pcr", "sj", "wc")

fst_data <- fst_raw %>%
  pivot_longer(
    cols = -Population1, # all cols besides first are treated as value cols
    names_to = "Population2", # create Pop 2 column
    values_to = "Fst") %>% # add Fst values
  dplyr::filter(!is.na(.$Fst)) # drop NA's

ggplot(fst_data, aes(x = Population1, y = Population2, fill = Fst)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "blue", high = "red", na.value = "grey") +
  labs(title = "Pairwise Fst Heatmap", x = "Population 1", y = "Population 2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 12. Hapmap heterozygosity stats
hapstats_data <- read.table("populations/populations.hapstats.tsv", header=TRUE, sep = "\t")

colnames(hapstats_data) <- c("Locus_ID", "Chr", "BP", "Pop_ID", "N", "Haplotype_Cnt", "Gene_Diversity", "Smoothed_Gene_Diversity", "Smoothed_Gene_Diversity_Pval", "Haplotype_Diversity", "Smoothed_Haplotype_Diversity", "Smoothed_Haplotype_Diversity_Pval", "HWE_Pval", "HWE_Pval_SE", "Haplotypes")

# distinguish contigs in order to group them together later as grey
hapstats_data$Chrom_Type <- ifelse(grepl("^JACDQR", hapstats_data$Chr), "Contig", "Chromosome")

# Function to plot Gene Diversity for a given Pop_ID with randomized high-contrast colors. Prior to doing this I couldn't tell where one Chr ended and the next began. I tried adding Chromosome name annotations as a second x axis but that was too messy.
plot_gene_diversity <- function(data, pop_id) {
  # filter data for the specified Pop_ID
  pop_data <- data %>%
    filter(Pop_ID == pop_id)

  # order the data by Locus_ID for each chromosome
  pop_data <- pop_data %>%
    arrange(Chr, Locus_ID)

  # convert Chr to a factor to enforce order in the plot
  pop_data$Chr <- factor(pop_data$Chr, levels = unique(pop_data$Chr))

  # number of chromosomes
  num_chromosomes <- length(unique(pop_data$Chr[pop_data$Chrom_Type == "Chromosome"]))

  # generate high-contrast colours (I'm colour-blind) using the "Spectral" palette
  vibrant_colors <- hcl.colors(num_chromosomes, palette = "Spectral", alpha = 1)

  # shuffle the colors to avoid similar hues being placed next to each other
  set.seed(42)  # for reproducibility
  shuffled_colors <- sample(vibrant_colors)

  # map colors to chromosomes
  chromosome_levels <- unique(pop_data$Chr[pop_data$Chrom_Type == "Chromosome"])
  chromosome_color_mapping <- setNames(shuffled_colors, chromosome_levels)
  pop_data$Color <- ifelse(pop_data$Chrom_Type == "Chromosome", chromosome_color_mapping[pop_data$Chr], "grey")

  # Create the bar plot
  p <- ggplot(pop_data, aes(x = Locus_ID, y = Gene_Diversity, fill = Chr)) +
    geom_bar(stat = "identity", width = 1.35) +   # 1.35 was a good width for visualization
    scale_fill_manual(values = c(chromosome_color_mapping, "grey" = "grey")) +
    labs(title = paste(pop_id),
         x = "Locus ID",
         y = "Gene Diversity",
         fill = "Chromosome") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    theme(legend.position = "none")

  return(p)
}

# generate plots for each population
pop_ids <- c("cs", "pcr", "wc", "sj")
plots <- lapply(pop_ids, function(pop_id) plot_gene_diversity(hapstats_data, pop_id))

# Combine the plots using patchwork
combined_plot <- plots[[1]] + plots[[2]] + plots[[3]] + plots[[4]] + 
  plot_layout(ncol = 2, nrow = 2) +
  # add legend
  plot_annotation(title = "Gene Diversity across Populations") +
  theme(legend.position = "right",  # Place legend on the right
        legend.justification = c(0.5, 0.5),  # Center the legend vertically and horizontally
        legend.box.margin = margin(0, 20, 0, 0))  # Add space to the right side

combined_plot
ggsave("Gene_Diversity_Plot_combined.png", plot = combined_plot, dpi = 600)

Assignment 1, BINF*6110

Noah Zeidenberg

2025-02-12