long-reads-workshop

Assembly of complete bacterial genomes using long-reads

Welcome to the bacterial long-read genome assembly workshop as part of Bioinfosummer 2021.

Accessing a workshop VM

For this workshop you will be accessing a server through rstudio in your web browser. Your instructor should provide you with a URL, username and password to access the server.

Accessing sequencing data

Sequencing data is provided for a marine bacterium that was isolated from coral. A subset of reads from this experiment are provided. Start by creating the directory to work in.

mkdir ~/tutorial

We want to list the sequence files to use and put all these reads in a single file. Use cat to combine them into a single file. The ‘»’ means they will be appending together into a single file. You can also do this with gzipped files.

cd ~/tutorial
ls ~/data/amtp4/*.fastq
cat ~/data/amtp4/*fastq >> ~/tutorial/amtp4_20X.fastq

Adapter trimming

There are a wide range of library preparation methods for Nanopore sequencing but all of them use adapters of some type. For whole genome sequencing many downstream tools (eg the Canu genome assembler) will automatically detect and trim adapter sequences. Not all tools do this however so it may be a good idea to trim adapters prior to running further analyses. You should generally check the instructions for the tool you are using and unless otherwise stated you should assume that it requires adapter trimmed sequences.

The software tool porechop by Ryan Wick is a great tool for adapter trimming. Although this tool is no longer actively maintained it still works well.

cd ~/tutorial
porechop -t 2 -i amtp4_20X.fastq -v 2 -o amtp4_trim.fastq

Check the size of the two files. Notice the trimmed file is smaller than the original file as we expected. Roughly 6M of adapter sequence was removed.

ls -lh

Assembly using Redbean

Redbean (aka wtdbg2) by Ruan Jue is a very fast long read assembler. We will use this to run our first assembly.

mkdir ~/tutorial/redbean
cd ~/tutorial/redbean
wtdbg2 -x ont -g 5m -t 2 -i ~/tutorial/amtp4_trim.fastq -fo amtp4_wtdbg

The first step creates a contig layout. We need to run a second command to generate a final set of consensus sequences.

wtpoa-cns -t 2 -i ~/tutorial/redbean/amtp4_wtdbg.ctg.lay.gz -fo amtp4_wtdbg.ctg.fa

Assembly using Flye

The Flye assembler is usually slightly slower than Redbean but often produces better assemblies.

mkdir ~/tutorial/flye
cd ~/tutorial
conda activate flye27
flye --nano-raw ~/tutorial/amtp4_trim.fastq --out-dir flye --genome-size 5m --threads 2

Contiguity of assemblies

One of the properties we strive for in a good assembly is high contiguity. For bacteria it is often possible to assemble the entire genome into a small number of circular contigs that correspond to the plasmids.

Use the assembly-stats program to calculate contiguity statistics for all the assemblies we have created.

cd ~/tutorial/
assembly-stats redbean/amtp4_wtdbg.ctg.fa
assembly-stats flye/assembly.fasta

In this case it looks like flye produced a more contiguous assembly (better N50, fewer contigs, etc) so let’s take this assembly forward.

Polishing

Although long reads provide very high contiguity they generally have low accuracy. To a large extent this low accuracy can be overcome obtaining a high read depth and taking the consensus of many reads at each position. Some genome assemblers incorporate such error correction (eg Canu) but even in such cases it is always a good idea to use a dedicated assembly polishing tool to correct small errors in the assembly.

First we are going to run racon. Racon is a very popular tools for polishing long uncorrected reads. Typically it is run over several rounds with each iteration producing better assemblies. While we could run these rounds one after another it’s more efficient to write a simple loop. Let’s have a look at such a loop.

cd ~/tutorial/flye
cat ~/bin/racon.sh

What do you think this bash script is doing? How many rounds of polishing is it doing? Now let’s run the script.

bash ~/bin/racon.sh

A good final polishing tool for Oxford Nanopore data is medaka. It expects as input an assembly run through four rounds of polishing from racon so we will pass in flye.4.fasta. You get to choose a model to run.

. ~/medaka/venv/bin/activate
cd ~/tutorial
medaka_consensus -i ~/tutorial/amtp4_trim.fastq -d ~/tutorial/flye/flye.4.fasta -o medaka_consensus -t 2 -m r941_min_fast_g507 -b 20

Effect of Polishing

Let’s have a look at the effect of racon and medaka polishing compared to the original flye assembly

cd ~/tutorial
#original flye
assembly-stats flye/flye.0.fasta
#racon flye
assembly-stats flye/flye.4.fasta
#medaka flye
assembly-stats medaka_consensus/consensus.fasta

Notice the overall assembly stats don’t look dramatically different but remember many small errors will have been fixed by polishing.

Correctness

Another very important property of a good assembly is correctness. This is challenging to measure since we often don’t have a high quality reference with which to compare.

For this particular example we have a reference assembly which we previously created using very deep coverage and best-practice genome polishing tools. By comparing our assemblies to the reference we can assess them in terms of all the important metrics, contiguity, completeness and correctness. For this we will use quast.

cd ~/tutorial
quast.py -o quast_flye -l flye_final,flye_original ~/tutorial/medaka_consensus/consensus.fasta ~/tutorial/flye/flye.0.fasta -t 2 --circos --glimmer -r ~/data/amtp4/reference/vibrio/vibrio.fna --features  ~/data/amtp4/reference/vibrio/vibrio.gff --nanopore ~/tutorial/amtp4_trim.fastq

Normally we would look at the webpage generated but here let’s have a look at the report.txt

cat ~/tutorial/quast_flye/report.txt

Lots of information here but notice there was one misassembled contig, more indels and fewer large genes in original assembly.

Additional Steps

There is lots more you can do with an assembly including assembly visualisation, annotation, manual curation, etc.

Some additional tools to consider include bandage, prokka, and apollo (now offered by BioCommons)

Archive