Next-generation sequencing (NGS) technologies have revolutionized the bioinformatic community’s ability to explore microbial diversity and understand the genetic basis of various diseases caused by bacteria such as Escherichia coli (E. coli). Genome assembly is a critical step in this process, transforming raw sequence reads into contiguous sequences that can provide insights into gene function, metabolic pathways, and evolutionary history. Accurate genome assembly not only facilitates the identification of novel genes or variations but also aids in understanding pathogenicity, antibiotic resistance, and virulence factors4.
Escherichia coli, a Gram-negative bacterium widely studied for its diverse phenotypes ranging from commensalism to severe pathogenesis, presents unique challenges for genomic studies5. The strain’s genetic diversity, including the presence of large plasmids and mobile genetic elements, complicates assembly efforts. Moreover, the repetitive nature of E. coli genomes often leads to fragmented assemblies with numerous gaps or misassembled regions when using traditional short-read sequencing technologies alone7.
Recent advances in long-read sequencing techniques have significantly improved assembly quality by providing better resolution of repetitive regions and complex genomic structures9. This has led to the development of various assembly strategies: short-read only assemblies, long-read only assemblies, and hybrid approaches that combine both data types. In this assignment, I compare three distinct assembly approaches using both Illumina short-read and Oxford Nanopore long-read sequencing data from an E. coli isolate.
My first pipeline utilizes SPAdes (St. Petersburg Assembler), a widely-used tool optimized for short-read assembly that employs sophisticated algorithms to handle complex genomic regions6. The second pipeline, run previously by Dr. Ricker, uses Flye. Flye specializes in long-read assembly using Oxford Nanopore Technology (ONT) data. Finally, the third approach (also run by Dr. Ricker) employs Unicycler, a hybrid assembler that integrates both short and long reads to potentially leverage the advantages of both technologies - the accuracy of short reads and the spanning capability of long reads.
Through this comparative analysis, my aim is to evaluate how these different assembly strategies perform in terms of contiguity, accuracy, and biological relevance. We assess each assembly using QUAST for quality metrics, while also examining the biological implications through the identification of antimicrobial resistance genes, plasmids, and virulence factors using specialized tools such as Abricate and Prokka.
See Appendix A: Data retrieval and Processing for the complete pipeline including logic behind parameter choice. Log files are zipped in the provided folder with the assignment. I refer to the SPAdes assembly pipeline as Pipeline 1 and the assessment of Flye and Unicycler assemblies as Pipeline 2.
The reference genome used was E.coli str. K-12 substr. MG16555.
Based on the fastqc reports, the input short-read data appears to be
of decent quality. Both forward and reverse reads had ~ 30% duplicate
sequences and no adapters were detected- Trimmomatic had the output of
Input Read Pairs: 1140172 Both Surviving: 1106488 (97.05%) Forward Only Surviving: 24677 (2.16%) Reverse Only Surviving: 6422 (0.56%) Dropped: 2585 (0.23%)
which indicates adapters had been removed prior to upload. Ideally, the
technician that sequenced the data would provide more information about
any processing that had been performed on the raw data before being
provided to me for analysis. I proceeded assuming that the only
modifications to the raw sequencing data was adapter trimming with
removal of bases with a PHRED33 score < 14, based on the multiqc
report of per-sequence base quality.
After running Trimmomatic and SPAdes assembly, QUAST and Abricate were applied to assess the quality of the assembly (see Appendix A: Data Retrieval and Processing). See the QUAST results below.
An assessment of the assembly would be in comparison to what would be expected of an E. coli assembly. A paper by Haendiges et al., 2021, stated: “The quality of [an E. coli assembly] was evaluated by examining the Quast and SPAdes outputs, focusing mainly on the values of contigs above >500 bp, total genome size, N50, average insert size, and Q30 read length quality”10. In the original release paper for QUAST, Table 1 compares the performance of various assembly tools on the same E. coli reference genome used here7. As such, the assessment of the assemblies in Appendix A will be in comparison to the expected values10.
The total size of the genome was ~ 5 Mbp, which is within the expected range of 4.6-5.2 Mbp for E. coli5. There were 185 contigs ≥ 1,000 bp. While a lower number of contigs is preferable for continuity, this count is reasonable for assemblies derived from short-read sequencing, and is actually smaller (better) than what Haendiges et al. found across 30 E. coli datasets10. It is also worth noting that based on QUAST alone we cannot make conclusions about the origin of these contigs, i.e. whether they are chromosomal or plasmid DNA. For example, the two reference genomes immediately available from NCBI, GCF_000005845.2 and GCF_000008865.2, have 1 contig (1 chromosome) and 3 contigs (1 chromosomal and 2 plasmid) respectively. Tools like PlasmidFinder could help distinguish the nature of each contig11. The largest contig was 232,547bp, ~5% of the whole genome on its own. The expected largest contig was between 200-400 kbp, so the observed value is reasonable if on the smaller side7. QUAST reported an N50 value of ~ 60 kbp and an LG50 value of 23, where N50 is the sequence length of the shortest contig at 50% of the total assembly length and L50 is the number of contigs of length at least N50. These values are poor compared to values observed in the literature- for example, Haendiges et al. calculated N50s all > 90 kbp and averaging ~120 kbp across 30 E. coli datasets, while Ishak (2020) calculated an N50 of ~ 270 kbp using SPAdes on PE Illumina data from an E. coli sample10,12]. Moreover, since N50 values can be artificially inflated due to concatenating contigs at the expense of more mismatches, the NA50 value was introduced7. The NA50 value here was even lower (~33 kbp).
As described by QUAST, “Genome fraction (%) is the total number of aligned bases in the reference, divided by genome size. A base in the reference genome is counted as aligned if there is at least one contig with at least one alignment to this base. Contigs … may be counted multiple times”. The genome fraction was calculated to be ~ 90%, meaning ~ 90% of the reference was covered by the short-read assembly, and the 10% may be missing or fragmented. This is in-line with the expected value7. Given that the duplication rate was 1.0 (i.e. for every 1 aligned base in the assembly there was ~1 aligned base in the genome) contigs contributing to the genome fraction likely are not covering the same region of the reference7.
There were 70 total misassemblies, all relocations (the left flanking sequence aligns > 1kbp away from the right flanking sequence on the reference OR they overlap by more than 1kbp) over 44 contigs (meaning some contigs had more than just one misassembly). This is significantly greater than the expected value of 11 misassemblies7. 39 contigs (total length 262 kbp) did not align at all, while 80 contigs had parial unalignment (total length 570 kbp). These could be contamination or regions not present in the reference, and it is very possibly either! It can be very challenging to keep culture isolates uncontaminated, and E. coli has demonstrated significant genomic diversity between strains (up to 800 kbp differences in chromosome length!)1.
Overall the assembly was not bad, though the low N50 and high number of misassemblies/unalignments are areas of concern. The Abricate package was used with NCBI, plasmidfinder, and the virulence factors database to produce tabs 1-3 of the following table, while prokka was used to generate the gene annotations for tab 4.
To summarize the results of the table, 13 plasmids were identified with relatively high % identity, with some characterized in Klebsiella pneumoniae but most characterized as E. coli-derived2. These plasmids ranged in size, but most were fragments and quite small. Five AMR genes (for 4 antimicrobials) were identified, all with 100% identity. All five genes are expected for E. coli15. In terms of virulence factors, 65 were found and only 1 had < 98% coverage while only 3 had < 88% identity. Most were associated with E. coli, though 11 were associated with Shigella dysenteriae, 4 were associated with Salmonella enterica, and 11 were associated with Yersinia pestis. In conclusion, future work could include using the KEGG database with the gene annotation data from prokka, which could improve understanding of which pathways had their genes sequenced in this dataset.
For both Flye and Unicycler, the total length was again ~ 5 Mbp which is expected for E. coli. For the Flye pipeline, there were 19 contigs ≥ 1000 bp while for Unicycler there were 27. In both cases this is significantly lower than the SPAdes assembly from Pipeline 1. The Flye assembly had an N50 of ~1 Mbp while the Unicycler assembly had an N50 of ~ 1.3 Mbp. These are ~ 16X higher than the N50 from pipeline 1, and more in-line with what was expected. Interestingly, the NA50 values are still similar to the SPAdes assembly- ~ 50 kbp for Flye and 51 kbp for Unicycler. As previously discussed, N50 values can be artificially inflated by concatenating contigs at the expense of more misassemblies.
The genome fraction for the Flye and Unicycler assemblies were both ~ 91%, which was similar to that calculated for the SPAdes assembly. The largest alignment was ~200, 000 for both Flye and Unicycler, which is ~2x larger than that of SPAdes. Duplication ratios were similar to SPAdes’.
Flye and Unicycler had larger counts for misassemblies- 156 and 152, respectively. This was predicted by the large difference between N50 and NA50 values, but it is interesting that the N50 values for Flye and Unicycler were ~ 16X greater than that of SPAdes, but the # of misassemblies is only twice as high. The implication with that, is there is a “sweet spot” for contig concatenation threshold such that N50 is maximized and # misassemblies is minimized in an optimization problem. There were 10 and 13 fully unaligned contigs, which is better than SPAdes’ 39, but this was likely at the expense of increasing # of misassemblies. For example, the missasembled contig length was ~ 4.8 Mbp for both Flye and Unicycler- in a genome that’s only ~ 5 Mbp!
Overall, the Flye and Unicycler assemblies are suprisingly poor (particularly for Unicycler, which had both short- and long-read data as input!). However, this is likely a reflection of the quality of the input data (garbage in, garbage out) and the parameters used in the analysis. Parameter tuning might solve the issue with misassembly almost entirely, and should be included in future work.
The above table was generated as in pipeline 1, and displays the same information. The 5 AMR genes and 65 virulence factors found in the SPAdes assembly were also found in the Flye and Unicycler assemblies, with one more virulence factor found in the Flye and Unicycler assemblies. This was possibly captured by the long-read data. Interestingly, running Prokka on the Flye assembly resulted in > 10,000 annotated genes, but SPAdes and Unicycler only had ~ 5,000.
KEGG could be used on the Flye assembly, possibly with better results than SPAdes or Unicycler. However, many of the annotated proteins are hypothetical, so more work must be done to characterize them first.
In conclusion, I would argue that the SPAdes assembly performed best despite the low N50 value, and would characterize the assembly as more “reliable” or “accurate” than the Flye and Unicycler assemblies due to their inflated misassembly count.
# The data is from an Illumina 150bp paired-end run and was provided by Dr. Ricker. I used BLASTn to validate that the sequences were E. coli, and reached out to Dr. Ricker to check if the data was a pure culture isolate or a mixture (it was an isolate).
# 1. I first ran fastqc and multiqc to aggregate quality information for short-reads
fastqc 136x2_R1.fastq 136x2_R2.fastq -o fastqc_results/
multiqc fastqc_results/
# 2. After investigating the multiqc report, I used Trimmomatic to remove Illumina adapters and any bases that did not meet quality standards (see below).
# For Trimmomatic, the manual outlines the order of steps quite clearly. First is clipping adapters from the 3' end.
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE \
# PE is for 'paired-end' mode
136x2_R1.fastq 136x2_R2.fastq \
# forward and reverse reads are passed in as input
trimmed_reads/136x2_R1_trimmed.fastq trimmed_reads/136x2_R1_unpaired.fastq \
# trimmed forward and orphaned forward reads are partial output
trimmed_reads/136x2_R2_trimmed.fastq trimmed_reads/136x2_R2_unpaired.fastq \
# trimmed reverse and orphaned reverse and the other part of the output
ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10 \
# the ILLUMINACLIP command follows the syntax "":<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>.
# I chose to allow 2 mismatches max in the adapter alignments, 30 maximum mismatches while still allowing a full match, 10 percent match minimum between the two ‘adapter ligated’ reads in order for PE palindrome read alignment. I did not specify a simple clip threshold, which specifies the min % match required for an adapter and read.
MINLEN:36 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:15
# after trimming, reads with a min length > 36 will be kept. Bases marked with an "Illumina Low Quality" PHRED score of 2 will be removed from the 5' and 3' ends after adapter trimming. On top of this, a sliding window of 4 bases will remove bases if they cause an average score < 15, which avoids removing high-quality bases that happen to be adjacent to a low-quality base.
# These parameters are based on those observed in literature, and though somewhat arbitrary, they do offer a good balance between removing low-quality bases without removing too much of the actual information present in the data.
# 3. I then used SPAdes for de novo assembly of the genome. I chose SPAdes because it is optimized for short-read Illumina sequencing data, common for bacterial genome assembly, and was already available on the digital research alliance of Canada's computing clusters.
# Like Trimmomatic, the SPAdes manual also outlines its use well.
spades.py \
-1 trimmed_reads/136x2_R1_trimmed.fastq \
-2 trimmed_reads/136x2_R2_trimmed.fastq \
-o spades_output \
--isolate
# The isolate flag should improve assembly quality and is "highly recommended for high-coverage isolate Illumina data". If Dr. Ricker had been unable to verify that the data was from a culture isolate, I would have used the "careful" flag instead, as recommended in the manual.
# 4. I then used QUAST to assess the quality of the SPAdes assembly.
module load StdEnv/2020 gcc/9.3.0 quast # must downgrade to use quast module
quast.py spades_output/contigs.fasta \
# using an E. coli reference should improve accuracy
-r /reference/Ecoli_K12_MG1655.fasta
# providing the short reads too
--pe1 /trimmed_reads/136x2_R1_trimmed.fastq \
--pe2 /trimmed_reads/136x2_R2_trimmed.fastq \
-o analysis_results/quast \
# contigs < 1kb are typically low-quality or artifacts, so I'm excluding them
--min-contig 1000 \
# and I want statistics for the distribution of contig size
--contig-thresholds 0, 1000, 5000, 10000, 25000, 50000
# 5. To improve what interpretations could be made on the data, I used Abricate to search for AMR genes, plasmids and virulence factors. Abricate is an aggregation of various tools rather than it's own single tool.
module load abricate
# For example, ncbi is called *through* abricate in order to search for AMR genes
abricate --db ncbi spades_output/contigs.fasta > analysis_results/amr/ncbi_amr.tab
# the PlasmidFinder database is also called via abricate in order to search for plasmids
abricate --db plasmidfinder spades_output/contigs.fasta > analysis_results/plasmids/plasmids.tab
# then I searched the vfdb (virulence factors database) via abricate
abricate --db vfdb spades_output/contigs.fasta > analysis_results/virulence/vfdb.tab
# 6. Lastly, I used Prokka for gene annotation
module load prokka
prokka --outdir analysis_results/annotation \
--prefix assembly \
# prefix sets the prefix for output files
spades_output/contigs.fasta \
--kingdom Bacteria --genus Escherichia --species coli \
# the default kingdom flag is set to Bacteria but I'm specifying it for proof-of-work. I have also included the genus and species flags to improve accuracy for CDS annotations that may be genus/species specific.
# --centre X --compliant --addgenes \
# if I were submitting to GenBank I'd include these flags as well to ensure my submission fulfilled the required criteria
--force
# The "force" tag just makes sure prokka runs even if I've run it before in the outdir.
# Before I repeated steps 4-6 on the provided output from Dr. Ricker running Flye and Unicycler on the long-read data, I checked the .log files for each to verify the parameters used.
# For flye: "/home/boerlin/anaconda3/bin/flye --nano-raw /home/boerlin/Desktop/GC_JM_plasmids/long_reads/136x2LRmerged.fastq --genome-size 5M -o /home/boerlin/Desktop/flyeOUT_136x2" # specifies nanopore raw (uncorrected) reads and an expected genome size of 5M (reasonable for E. coli)
# For Unicycler: "/home/boerlin/anaconda3/bin/unicycler -1 /home/boerlin/Desktop/GC_JM_plasmids/short_reads/136x2/136x2_S89_L001_R1_001.fastq -2 /home/boerlin/Desktop/GC_JM_plasmids/short_reads/136x2/136x2_S89_L001_R2_001.fastq -l /home/boerlin/Desktop/GC_JM_plasmids/long_reads/136x2LRmerged.fastq -o /home/boerlin/Desktop/GC_JM_plasmids/UNICYCLER_136x2" # hybrid assembly using SPAdes for the short-reads and improve bridging estimates with miniasm for long-reads.
# I repeated steps 4-6 on the provided output from Dr. Ricker running Flye and Unicycler on the longread data.i.e.
quast.py Flye_136x2/Flye_assembly.fasta \
# using an E coli reference sequence helps
-r /reference/Ecoli_K12_MG1655.fasta \
# specifying nanopore long-reads
--nanopore 136x2_longreads.fastq \
-o /analysis_results/flye/quast \
# only looking at contigs > 1kb
--min-contig 1000 \
# generating statistics for different sized contig thresholds
--contig-thresholds 0, 1000, 5000, 10000, 25000, 50000
quast.py Unicycler_136x2/Unicycler_assembly.fasta \
# providing forward and reverse short reads
--pe1 trimmed_reads/136x2_R1_trimmed.fastq \
--pe2 trimmed_reads/136x2_R2_trimmed.fastq \
-o /analysis_results/unicycler/quast
# ... ""# Generate summary report for my pipeline (see Appendix A, steps 1-6)
DATA_DIR=$(readlink -f /home/nzeidenb/scratch/A3/)
CURRENT_DIR=$(pwd)
echo "Generating summary report..."
echo "Assembly and Analysis Summary" > ${CURRENT_DIR}/analysis_results/summary.txt
echo "============================" >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt
# Add QUAST summary
echo "Assembly Statistics:" >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "Total length" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "# contigs" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "N50" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt
# Add AMR summary
echo "Antimicrobial Resistance Genes:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/amr/ncbi_amr.tab >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt
# Add Plasmid summary
echo "Plasmid Replicons:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/plasmids/plasmids.tab >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt
# Add Virulence Factors summary
echo "Virulence Factors:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/virulence/vfdb.tab >> ${CURRENT_DIR}/analysis_results/summary.txt