Introduction

Next-generation sequencing (NGS) has made it practical to characterize bacterial isolates at nucleotide resolution—from commensal strains to pathogens such as Escherichia coli (E. coli). Genome assembly is the central step that turns raw reads into contigs and scaffolds suitable for gene prediction, pathway analysis, and comparative genomics. High-quality assemblies support detection of resistance and virulence loci that inform surveillance and basic research⁴.

Escherichia coli, a Gram-negative bacterium widely studied for its diverse phenotypes ranging from commensalism to severe pathogenesis, presents unique challenges for genomic studies⁵. The strain’s genetic diversity, including the presence of large plasmids and mobile genetic elements, complicates assembly efforts. Moreover, the repetitive nature of E. coli genomes often leads to fragmented assemblies with numerous gaps or misassembled regions when using traditional short-read sequencing technologies alone⁷.

Recent advances in long-read sequencing techniques have significantly improved assembly quality by providing better resolution of repetitive regions and complex genomic structures⁹. This exploratory project compares three assembly strategies on the same E. coli culture isolate: Illumina paired-end short reads, Oxford Nanopore (ONT) long reads, and a hybrid approach that integrates both data types.

I implemented a short-read workflow with SPAdes (St. Petersburg Assembler), which is widely used for bacterial Illumina assemblies and handles uneven coverage and repeats better than many de Bruijn alternatives⁶. Flye was used for ONT-only assembly because it is designed for long, error-prone reads and can resolve repetitive regions that fragment short-read graphs. Unicycler combines Illumina and ONT data in a hybrid graph, aiming to pair the base-level accuracy of short reads with the long-range connectivity of nanopore reads. Flye and Unicycler assemblies were produced on shared compute resources; I re-ran downstream QC and annotation on all three assemblies so comparisons reflect the same evaluation criteria.

The workflow emphasizes reproducible bacterial genomics: raw reads were validated, trimmed, and assembled on a Linux HPC cluster using Bash shell scripts to chain standard command-line tools. FastQC and MultiQC summarize read quality; QUAST quantifies assembly contiguity and reference agreement (N50, genome fraction, misassembly counts); Abricate screens assemblies against curated AMR, plasmid, and virulence databases; and Prokka annotates coding sequences for functional follow-up. Together, these steps mirror common isolate characterization pipelines in public-health and research microbiology labs.

Methods

See Appendix A: Data retrieval and Processing for command-level detail, parameter rationale, and reproducibility notes. The SPAdes short-read workflow is summarized in Short-read assembly (SPAdes); Flye and Unicycler results are discussed under Long-read and hybrid assemblies, evaluated with the same QUAST and Abricate steps applied to the SPAdes contigs.

The reference genome used was E.coli str. K-12 substr. MG1655⁵.

Results and Discussion

Short-read assembly (SPAdes)

FastQC and MultiQC reports indicate the Illumina paired-end libraries were suitable for assembly: both mates showed ~30% duplicate reads and no residual adapters after Trimmomatic (Input Read Pairs: 1140172 Both Surviving: 1106488 (97.05%) Forward Only Surviving: 24677 (2.16%) Reverse Only Surviving: 6422 (0.56%) Dropped: 2585 (0.23%)), consistent with adapter removal and light quality filtering upstream. Without full provenance metadata, I treated the files as adapter-trimmed with bases below PHRED33 14 removed, matching the per-base quality profile in MultiQC.

After Trimmomatic preprocessing and SPAdes assembly, contigs were evaluated with QUAST (reference-based statistics) and Abricate (functional screening); see Appendix A for commands. QUAST summary output is embedded below.

Assembly quality was benchmarked against published E. coli expectations. Haendiges et al. (2021) emphasize contig count above 500 bp, total genome size, N50, insert size, and Q30 read quality when judging SPAdes/QUAST outputs¹⁰. The original QUAST manuscript reports reference-based metrics for multiple assemblers on the same K-12 reference used here⁷, providing external anchors for interpreting N50, misassembly, and coverage statistics¹⁰.

Assembly Size and Contiguity

The total size of the genome was ~ 5 Mbp, which is within the expected range of 4.6-5.2 Mbp for E. coli⁵. There were 185 contigs ≥ 1,000 bp. While a lower number of contigs is preferable for continuity, this count is reasonable for assemblies derived from short-read sequencing, and is actually smaller (better) than what Haendiges et al. found across 30 E. coli datasets¹⁰. It is also worth noting that based on QUAST alone we cannot make conclusions about the origin of these contigs, i.e. whether they are chromosomal or plasmid DNA. For example, the two reference genomes immediately available from NCBI, GCF_000005845.2 and GCF_000008865.2, have 1 contig (1 chromosome) and 3 contigs (1 chromosomal and 2 plasmid) respectively. Tools like PlasmidFinder could help distinguish the nature of each contig¹¹. The largest contig was 232,547bp, ~5% of the whole genome on its own. The expected largest contig was between 200-400 kbp, so the observed value is reasonable if on the smaller side⁷. QUAST reported an N50 value of ~ 60 kbp and an LG50 value of 23, where N50 is the sequence length of the shortest contig at 50% of the total assembly length and L50 is the number of contigs of length at least N50. These values are poor compared to values observed in the literature- for example, Haendiges et al. calculated N50s all > 90 kbp and averaging ~120 kbp across 30 E. coli datasets, while Ishak (2020) calculated an N50 of ~ 270 kbp using SPAdes on PE Illumina data from an E. coli sample^10,¹²]. Moreover, since N50 values can be artificially inflated due to concatenating contigs at the expense of more mismatches, the NA50 value was introduced⁷. The NA50 value here was even lower (~33 kbp).

Alignment Metrics

As described by QUAST, “Genome fraction (%) is the total number of aligned bases in the reference, divided by genome size. A base in the reference genome is counted as aligned if there is at least one contig with at least one alignment to this base. Contigs … may be counted multiple times”. The genome fraction was calculated to be ~ 90%, meaning ~ 90% of the reference was covered by the short-read assembly, and the 10% may be missing or fragmented. This is in-line with the expected value⁷. Given that the duplication rate was 1.0 (i.e. for every 1 aligned base in the assembly there was ~1 aligned base in the genome) contigs contributing to the genome fraction likely are not covering the same region of the reference⁷.

Misassemblies and Partial Alignments

There were 70 total misassemblies, all relocations (the left flanking sequence aligns > 1kbp away from the right flanking sequence on the reference OR they overlap by more than 1kbp) over 44 contigs (meaning some contigs had more than just one misassembly). This is significantly greater than the expected value of 11 misassemblies⁷. 39 contigs (total length 262 kbp) did not align at all, while 80 contigs had partial unalignment (total length 570 kbp). These could be contamination or regions not present in the reference, and it is very possibly either! It can be very challenging to keep culture isolates uncontaminated, and E. coli has demonstrated significant genomic diversity between strains (up to 800 kbp differences in chromosome length!)¹.

Overall Interpretations and Future Analyses

Overall, the short-read assembly is biologically plausible but fragmented: modest N50 and elevated misassembly/unaligned contig counts warrant cautious interpretation. Abricate (tabs 1–3) queries NCBI AMR, PlasmidFinder, and VirulenceFinder databases in one pass; Prokka supplies structural gene calls for tab 4 and downstream pathway analysis.

Functional screening recovered 13 plasmid replicon hits (mostly E. coli-associated, with a subset annotated in Klebsiella pneumoniae)², many on small contigs consistent with incomplete plasmid resolution from short reads alone. Five AMR markers spanning four drug classes matched at 100% identity—profiles typical of environmental or clinical E. coli¹⁵. Abricate reported 65 virulence-associated loci with strong coverage/identity; cross-species database hits (e.g., Shigella, Salmonella, Yersinia) likely reflect shared virulence gene families rather than mixed cultures. Mapping Prokka annotations to KEGG would be a natural next step to link resistance and virulence loci to metabolic pathways.

Long-read and hybrid assemblies (Flye & Unicycler)

Flye QUAST
Unicycler QUAST

Assembly Size and Contiguity

For both Flye and Unicycler, the total length was again ~ 5 Mbp which is expected for E. coli. For the Flye pipeline, there were 19 contigs ≥ 1000 bp while for Unicycler there were 27. In both cases this is significantly lower than the SPAdes assembly from Pipeline 1. The Flye assembly had an N50 of ~1 Mbp while the Unicycler assembly had an N50 of ~ 1.3 Mbp. These are ~ 16X higher than the N50 from pipeline 1, and more in-line with what was expected. Interestingly, the NA50 values are still similar to the SPAdes assembly- ~ 50 kbp for Flye and 51 kbp for Unicycler. As previously discussed, N50 values can be artificially inflated by concatenating contigs at the expense of more misassemblies.

Alignment Metrics

The genome fraction for the Flye and Unicycler assemblies were both ~ 91%, which was similar to that calculated for the SPAdes assembly. The largest alignment was ~200, 000 for both Flye and Unicycler, which is ~2x larger than that of SPAdes. Duplication ratios were similar to SPAdes’.

Misassemblies and Partial Alignments

Flye and Unicycler had larger counts for misassemblies- 156 and 152, respectively. This was predicted by the large difference between N50 and NA50 values, but it is interesting that the N50 values for Flye and Unicycler were ~ 16X greater than that of SPAdes, but the # of misassemblies is only twice as high. The implication with that, is there is a “sweet spot” for contig concatenation threshold such that N50 is maximized and # misassemblies is minimized in an optimization problem. There were 10 and 13 fully unaligned contigs, which is better than SPAdes’ 39, but this was likely at the expense of increasing # of misassemblies. For example, the misassembled contig length was ~ 4.8 Mbp for both Flye and Unicycler- in a genome that’s only ~ 5 Mbp!

Overall Interpretations and Future Analyses

Despite higher headline N50 values, Flye and Unicycler contigs show heavy misassembly signals—especially surprising for Unicycler, which ingested both read types. That pattern underscores why contiguity metrics alone are insufficient for bacterial QC: read depth, ONT error profile, and assembler tuning all influence graph correctness. Iterative parameter optimization (or polishing with short reads) would be the logical extension of this comparison.

Flye Assembly
Unicycler Assembly

The same Abricate/Prokka workflow was applied to long-read and hybrid contigs. Core AMR and virulence calls from the SPAdes assembly were recovered in Flye and Unicycler, with one additional virulence hit on the long-read–based graphs—possibly reflecting improved resolution of repeat-rich loci. Prokka gene counts diverged sharply (~10,000 on Flye versus ~5,000 on SPAdes/Unicycler), highlighting how assembly fragmentation can inflate hypothetical open-reading-frame predictions and complicate comparative annotation.

Pathway reconstruction (e.g., via KEGG) would be most informative once hypothetical annotations are filtered or confirmed with orthogonal evidence; the Flye graph may offer longer coding sequences, but only if misassemblies are addressed first.

Taken together, this project shows end-to-end bacterial genomics skills: HPC workflow design in Bash, Illumina QC/trimming, de novo assembly, and statistical assembly QC with QUAST, followed by database-driven screening for clinically relevant loci. For this isolate and dataset, the SPAdes short-read assembly balanced interpretability and misassembly rate better than the long-read or hybrid graphs, even though nanopore assemblies looked more contiguous on paper—a useful reminder to pair contiguity metrics with reference alignment and functional consistency checks.

References

Ochman, H., and Jones, I.B. (2000). Evolutionary dynamics of full genome content in escherichia coli. The EMBO journal.

Gelalcha, B.D., Mohammed, R.I., Gelgie, A.E., and Kerro Dego, O. (2023). Molecular epidemiology and pathogenomics of extended-spectrum beta-lactamase producing-escherichia coli and-klebsiella pneumoniae isolates from bulk tank milk in tennessee, USA. Frontiers in Microbiology 14, 1283165.

Jakobsen, L., Sandvang, D., Hansen, L.H., Bagger-Skjøt, L., Westh, H., Jørgensen, C., Hansen, D.S., Pedersen, B.M., Monnet, D.L., Frimodt-Møller, N., et al. (2008). Characterisation, dissemination and persistence of gentamicin resistant escherichia coli from a danish university hospital to the waste water environment. Environment international 34, 108–115.

Johnson, A., Burns, L., Woodford, N., Threlfall, E., Naidoo, J., Cooke, E., and George, R. (1994). Gentamicin resistance in clinical isolates of escherichia coli encoded by genes of veterinary origin. Journal of medical microbiology 40, 221–226.

Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. (1997). The complete genome sequence of escherichia coli k-12. science 277, 1453–1462.

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., et al. (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19, 455–477.

Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., and Gurevich, A. (2018). Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150.

Jain, M., Koren, S., Miga, K.H., Quick, J., Rand, A.C., Sasani, T.A., Tyson, J.R., Beggs, A.D., Dilthey, A.T., Fiddes, I.T., et al. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology 36, 338–345.

Sereika, M., Kirkegaard, R.H., Karst, S.M., Michaelsen, T.Y., Sørensen, E.A., Wollenberg, R.D., and Albertsen, M. (2022). Oxford nanopore R10. 4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nature methods 19, 823–826.

10.

Haendiges, J., Jinneman, K., and Gonzalez-Escalona, N. (2021). Choice of library preparation affects sequence quality, genome assembly, and precise in silico prediction of virulence genes in shiga toxin-producing escherichia coli. Plos one 16, e0242294.

11.

Carattoli, A., and Hasman, H. (2020). PlasmidFinder and in silico pMLST: Identification and typing of plasmid replicons in whole-genome sequencing (WGS). Horizontal gene transfer: methods and protocols, 285–294.

12.

Ishak, N.A.M. (2020). Performance analysis of bacterial genome assemblers using illumina next generation sequencing data.

13.

Lalak, A., Wasyl, D., Zając, M., Skarżyńska, M., Hoszowski, A., Samcik, I., Woźniakowski, G., and Szulowski, K. (2016). Mechanisms of cephalosporin resistance in indicator escherichia coli isolated from food animals. Veterinary Microbiology 194, 69–73.

14.

Webber, M., and Piddock, L.J. (2001). Quinolone resistance in escherichia coli. Veterinary research 32, 275–284.

15.

Pepoyan, A., Manvelyan, A., Balayan, M., Harutyunyan, N., Tsaturyan, V., Batikyan, H., Bren, A., Chistyakov, V., Weeks, R., and Chikindas, M. (2023). Tetracycline resistance of escherichia coli isolated from water, human stool, and fish gills from the lake sevan basin. Letters in Applied Microbiology 76, ovad021.

16.

Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. (No Title).

17.

Ewels, P., Magnusson, M., Lundin, S., and Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048.

18.

Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics 30, 2114–2120.

19.

Seemann, T. (2014). Prokka: Rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069.

20.

Silva, J.G.P. A METHODOLOGICAL APPROACH TO ASSESS THE QUALITY OF BACTERIAL GENOMES.

Appendix A: Data retrieval and Processing

# The data is from an Illumina 150bp paired-end run and was provided by Dr. Ricker. I used BLASTn to validate that the sequences were E. coli, and reached out to Dr. Ricker to check if the data was a pure culture isolate or a mixture (it was an isolate).

# 1. I first ran fastqc and multiqc to aggregate quality information for short-reads
fastqc 136x2_R1.fastq 136x2_R2.fastq -o fastqc_results/

multiqc fastqc_results/

# 2. After investigating the multiqc report, I used Trimmomatic to remove Illumina adapters and any bases that did not meet quality standards (see below).
# For Trimmomatic, the manual outlines the order of steps quite clearly. First is clipping adapters from the 3' end.
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE \
                                                # PE is for 'paired-end' mode
    136x2_R1.fastq 136x2_R2.fastq \
# forward and reverse reads are passed in as input
    trimmed_reads/136x2_R1_trimmed.fastq trimmed_reads/136x2_R1_unpaired.fastq \
# trimmed forward and orphaned forward reads are partial output
    trimmed_reads/136x2_R2_trimmed.fastq trimmed_reads/136x2_R2_unpaired.fastq \
# trimmed reverse and orphaned reverse and the other part of the output
    ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10 \
# the ILLUMINACLIP command follows the syntax "":<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>.
# I chose to allow 2 mismatches max in the adapter alignments, 30 maximum mismatches while still allowing a full match, 10 percent match minimum between the two ‘adapter ligated’ reads in order for PE palindrome read alignment. I did not specify a simple clip threshold, which specifies the min % match required for an adapter and read.
    MINLEN:36 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:15 
# after trimming, reads with a min length > 36 will be kept. Bases marked with an "Illumina Low Quality" PHRED score of 2 will be removed from the 5' and 3' ends after adapter trimming. On top of this, a sliding window of 4 bases will remove bases if they cause an average score < 15, which avoids removing high-quality bases that happen to be adjacent to a low-quality base.
# These parameters are based on those observed in literature, and though somewhat arbitrary, they do offer a good balance between removing low-quality bases without removing too much of the actual information present in the data.

# 3. I then used SPAdes for de novo assembly of the genome. I chose SPAdes because it is optimized for short-read Illumina sequencing data, common for bacterial genome assembly, and was already available on the digital research alliance of Canada's computing clusters.

# Like Trimmomatic, the SPAdes manual also outlines its use well.
spades.py \
    -1 trimmed_reads/136x2_R1_trimmed.fastq \
    -2 trimmed_reads/136x2_R2_trimmed.fastq \
    -o spades_output \
    --isolate
# The isolate flag should improve assembly quality and is "highly recommended for high-coverage isolate Illumina data". If Dr. Ricker had been unable to verify that the data was from a culture isolate, I would have used the "careful" flag instead, as recommended in the manual.

# 4. I then used QUAST to assess the quality of the SPAdes assembly.
module load StdEnv/2020 gcc/9.3.0 quast # must downgrade to use quast module
quast.py spades_output/contigs.fasta \
    # using an E. coli reference should improve accuracy
    -r /reference/Ecoli_K12_MG1655.fasta
    # providing the short reads too
    --pe1 /trimmed_reads/136x2_R1_trimmed.fastq \
    --pe2 /trimmed_reads/136x2_R2_trimmed.fastq \
    -o analysis_results/quast \
    # contigs < 1kb are typically low-quality or artifacts, so I'm excluding them
    --min-contig 1000 \
    # and I want statistics for the distribution of contig size
    --contig-thresholds 0, 1000, 5000, 10000, 25000, 50000


# 5. To improve what interpretations could be made on the data, I used Abricate to search for AMR genes, plasmids and virulence factors. Abricate is an aggregation of various tools rather than it's own single tool.
module load abricate
# For example, ncbi is called *through* abricate in order to search for AMR genes
abricate --db ncbi spades_output/contigs.fasta > analysis_results/amr/ncbi_amr.tab
# the PlasmidFinder database is also called via abricate in order to search for plasmids
abricate --db plasmidfinder spades_output/contigs.fasta > analysis_results/plasmids/plasmids.tab
# then I searched the vfdb (virulence factors database) via abricate
abricate --db vfdb spades_output/contigs.fasta > analysis_results/virulence/vfdb.tab

# 6. Lastly, I used Prokka for gene annotation
module load prokka
prokka --outdir analysis_results/annotation \
    --prefix assembly \
# prefix sets the prefix for output files
    spades_output/contigs.fasta \
    --kingdom Bacteria --genus Escherichia --species coli \
# the default kingdom flag is set to Bacteria but I'm specifying it for proof-of-work. I have also included the genus and species flags to improve accuracy for CDS annotations that may be genus/species specific.
#    --centre X --compliant --addgenes \
# if I were submitting to GenBank I'd include these flags as well to ensure my submission fulfilled the required criteria
      --force
# The "force" tag just makes sure prokka runs even if I've run it before in the outdir.

# Before I repeated steps 4-6 on the provided output from Dr. Ricker running Flye and Unicycler on the long-read data, I checked the .log files for each to verify the parameters used.
# For flye: "/home/boerlin/anaconda3/bin/flye --nano-raw /home/boerlin/Desktop/GC_JM_plasmids/long_reads/136x2LRmerged.fastq --genome-size 5M -o /home/boerlin/Desktop/flyeOUT_136x2" # specifies nanopore raw (uncorrected) reads and an expected genome size of 5M (reasonable for E. coli)

# For Unicycler: "/home/boerlin/anaconda3/bin/unicycler -1 /home/boerlin/Desktop/GC_JM_plasmids/short_reads/136x2/136x2_S89_L001_R1_001.fastq -2 /home/boerlin/Desktop/GC_JM_plasmids/short_reads/136x2/136x2_S89_L001_R2_001.fastq -l /home/boerlin/Desktop/GC_JM_plasmids/long_reads/136x2LRmerged.fastq -o /home/boerlin/Desktop/GC_JM_plasmids/UNICYCLER_136x2" # hybrid assembly using SPAdes for the short-reads and improve bridging estimates with miniasm for long-reads.

# I repeated steps 4-6 on the provided output from Dr. Ricker running Flye and Unicycler on the longread data.i.e.
quast.py Flye_136x2/Flye_assembly.fasta \
    # using an E coli reference sequence helps 
    -r /reference/Ecoli_K12_MG1655.fasta \
    # specifying nanopore long-reads
    --nanopore 136x2_longreads.fastq \
    -o /analysis_results/flye/quast \
    # only looking at contigs > 1kb
    --min-contig 1000 \
    # generating statistics for different sized contig thresholds
    --contig-thresholds 0, 1000, 5000, 10000, 25000, 50000
  
    
quast.py Unicycler_136x2/Unicycler_assembly.fasta \
    # providing forward and reverse short reads
    --pe1 trimmed_reads/136x2_R1_trimmed.fastq \
    --pe2 trimmed_reads/136x2_R2_trimmed.fastq \
    -o /analysis_results/unicycler/quast
    
# ... ""

Appendix B: Data analysis

# Generate summary report for my pipeline (see Appendix A, steps 1-6)
DATA_DIR=$(readlink -f /home/nzeidenb/scratch/A3/)
CURRENT_DIR=$(pwd)
echo "Generating summary report..."
echo "Assembly and Analysis Summary" > ${CURRENT_DIR}/analysis_results/summary.txt
echo "============================" >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt

# Add QUAST summary
echo "Assembly Statistics:" >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "Total length" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "# contigs" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
grep "N50" ${CURRENT_DIR}/analysis_results/quast/report.txt >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt

# Add AMR summary
echo "Antimicrobial Resistance Genes:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/amr/ncbi_amr.tab >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt

# Add Plasmid summary
echo "Plasmid Replicons:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/plasmids/plasmids.tab >> ${CURRENT_DIR}/analysis_results/summary.txt
echo "" >> ${CURRENT_DIR}/analysis_results/summary.txt

# Add Virulence Factors summary
echo "Virulence Factors:" >> ${CURRENT_DIR}/analysis_results/summary.txt
abricate --summary ${CURRENT_DIR}/analysis_results/virulence/vfdb.tab >> ${CURRENT_DIR}/analysis_results/summary.txt

Comparative Bacterial Genome Assembly

Noah Zeidenberg

2025-03-20

Introduction

Methods

Results and Discussion

Short-read assembly (SPAdes)

Assembly Size and Contiguity

Alignment Metrics

Misassemblies and Partial Alignments

Overall Interpretations and Future Analyses

Long-read and hybrid assemblies (Flye & Unicycler)

Assembly Size and Contiguity

Alignment Metrics

Misassemblies and Partial Alignments

Overall Interpretations and Future Analyses

References

Appendix A: Data retrieval and Processing

Appendix B: Data analysis