long-reads-workshop

For this exercise will use data from a previous MinION run. The same process can be used on reads from the run we started in the workshop. If there is time it would be a useful exercise to repeat all the steps on that data.

The example data was generated from a culture dominated by a bacterium in the genus Vibrio. The dominant organism has two large chromosomes with sizes around 3.4Mb and 1.8Mb respectively. We also expect some smaller plasmids. At very high coverage it was also revealed that this data contained another bacterium (a Gammaproteobacterium) which was around 100 times less abundant than the Vibrio.

Be sure to start a new RMarkdown notebook to take notes on this section. I recommend saving it using the filename 03_assembly.Rmd

Redbean Assembly

Assemble a genome with redbean

We will start by using one of the fastest long read genome assemblers, which is now called redbean, likely because the original name WTDBG2 was just plain terrible. This software is available from github at https://github.com/ruanjue/wtdbg2. Lucky for you, redbean has been preinstalled for you but if you would like to try install it yourself on your own machine at another time, we recommend a manual installation from the github repo because the conda version is out of date.

First prepare a directory for running the software and change to it

mkdir ~/nanopore_workshop/redbean
cd redbean

Collect a small subset of data for the initial assembly. We will use just the first 3 files which equates to 12000 reads and around 6x coverage of the target genome. This will almost certainly not be enough to generate a good assembly. As a learning exercise it will be useful to compare the results with this small dataset to results we will obtain later using more reads.

cat /pvol/data/amtp4/*_[012].fastq > amtp4.fastq

Run the assembly

wtdbg2 -x ont -g 5m -t 4 -i amtp4.fastq -fo amtp4
wtpoa-cns -t 4 -i amtp4.ctg.lay.gz -fo amtp4.ctg.fa

Take a rough note of how long this assembly takes.

Activity: Inspect assembly statistics

assembly-stats amtp4.ctg.fa

How good does the assembly look in terms of contiguity and completeness?

Run redbean with higher coverage

Now let’s redo the assembly using a better coverage depth. We will use data equivalent to around 20x coverage.

cat /pvol/data/amtp4/*.fastq > amtp4_20x.fastq
wtdbg2 -x ont -g 5m -t 4 -i amtp4_20x.fastq -fo amtp4_20x
wtpoa-cns -t 4 -i amtp4_20x.ctg.lay.gz -fo amtp4_20x.ctg.fa

Using this larger dataset redbean takes around 13 minutes

Activity: Inspect assembly stats for higher coverage

Reassess the assembly using assembly-stats.

Flye Assembly

Now we will try an alternative assembler called Flye. This assembler has been around a bit longer than redbean and is a more mature software project. As far as long read assemblers go it is fairly quick but not as fast as redbean. Again, lucky you, it has already been installed on your system but if you want to try it out on your own machine later, you can go with the conda version for flye.

Run Flye on both datasets

Now create a new directory for running Flye

mkdir ~/nanopore_workshop/flye
cd ~/nanopore_workshop/flye

Make symbolic links to the same raw data we used for redbean

ln -s ../redbean/amtp4.fastq .
ln -s ../redbean/amtp4_20x.fastq .

Start by running Flye on the smaller dataset. Even this will take quite a while (about 15 minutes). We would like to keep working while flye runs in the background. To do this we will using some special shell magic. Please enter commands exactly as written below (cut-paste is best).

Take care with the next command. ONLY enter it once! If you enter it multiple times it will create multiple (potentially conflicting) runs of flye.

flye --nano-raw amtp4.fastq --out-dir amtp4 --genome-size 5m --threads 4 >> flye.log 2>&1 &

What is happening here? We are using some bash magic at the end of the command to do the following

Redirect standard output to a file with >> flye.log
Redirect standard error to standard output with 2>&1
Run in the background with & at the end.

All of the output from this command is being sent to flye.log so we can look at this to see what is happening.

tail flye.log

or to follow (-f) the output as it is created use

tail -f flye.log

Activity: Monitor running jobs

You can use the ps command to see the processes that are running. This should show your instance of flye.

Also try top which shows all the most CPU intensive tasks being run by the entire system. Here you should be able to see everybody’s flye processes.

Activity: Assess assemblies made using Flye

Once flye has finished, the final assembly will be available inside the output directory and will be called assembly.fasta. Use the assembly-stats program to assess the contiguity and completeness of this assembly.

Now run flye on the larger dataset. This will take a very long time. We will come back to it later.

flye --nano-raw amtp4_20x.fastq --out-dir amtp4_20x --genome-size 5m --threads 4 >> flye_20x.log 2>&1 &

After the large assembly has finished assess it with assembly-stats and look at its GFA file with Bandage (see below).

Inspect assembly graphs with Bandage

One of the nice features of Flye is that is produces output in the GFA format. We can use a nice graphical tool called Bandage to inspect GFA files.

Activity: Inspect low coverage flye assembly with Bandage

First download and install bandage. It is available for all major operating systems.

Look inside the output directory for the first flye run. Find the file called assembly-graph.gfa. Use RStudio to download this file to your laptop.

Now open bandage and use the File->Load Graph menu to load the file assembly-graph.gfa. Then click Draw Graph.

In this visualisation contiguous segments are shown as solid coloured lines. Some of these are separated into independent graphs whereas others (eg the mess in the top left) are connected to each other via black lines. There are few things to keep in mind here;

Biologically distinct entities such as chromosomes or plasmids will form separate graphs
Separated graphs can also occur if coverage is very low and no reads exist to join parts of the assembly together
Highly joined regions (eg top left) tend to occur because repeats create ambiguities. There are too many connections and the assembler has trouble choosing the correct path between them.

Try zooming in on the highly connected region. Click and move contigs around to get a clearer look. Clicking on a contig shows the depth of read coverage assigned to it. Based on read coverage and the shape of the graph can you spot a segment that might correspond to a repeat?

Activity: Inspect high coverage flye assembly with Bandage

Repeat the exercise above using the high coverage (20x) assembly generated with Flye. Can you spot the same repeat region? How has increased coverage changed the appearance of the assembly.

Which contigs look like fully assembled chromosomes? Which ones look like plasmids?

Check Assembly Quality with QUAST

Up to this point we have performed four separate assemblies with 2 different assemblers and 2 different levels of coverage. Although we could clearly see a huge improvement in contiguity with the addition of more reads we could not easily check for the accuracy of assemblies.

For this particular example we have a reference assembly which we previously created using very deep coverage and best-practice genome polishing tools. By comparing our assemblies to the reference we can assess them in terms of all the important metrics, contiguity, completeness and correctness.

Prepare assembly data

mkdir ~/nanopore_workshop/quast
cd ~/nanopore_workshop/quast
mkdir flye redbean

Our reference assembly only contains contigs for the two major Vibrio chromosomes. For our two deeper coverage assemblies this will correspond to the two largest contigs. We extract these (and ignore all others) in order to make the assembly comparison easier to interpret.

Use this command to look at the contig sizes in the flye assembly

cat ../flye/amtp4_20x/assembly.fasta | bioawk -c fastx '{print $name, length($seq)}' | sort -nr -k2

For me this was contig_11 and contig_56 but these numbers may differ from assembly to assembly. Use samtools to extract these

samtools faidx ../flye/amtp4_20x/assembly.fasta contig_3 contig_16 > flye/vibrio.fasta

Now repeat the process for the redbean assembly

cat ../redbean/amtp4_20x.ctg.fa | bioawk -c fastx '{print $name, length($seq)}' | sort -nr -k2
samtools faidx ../redbean/amtp4_20x.ctg.fa ctg1 ctg2 > redbean/vibrio.fasta

Now create a symbolic link to the original reads

ln ../flye/amtp4_20x.fastq .

We also need the reference assembly so make a symbolic link to that as well

ln -s /pvol/data/amtp4/reference/vibrio .

Run QUAST

Run quast as follows

conda activate circos
quast.py -o quast_vibrio redbean/vibrio.fasta flye/vibrio.fasta -L -t 4 --circos --glimmer -r vibrio/vibrio.fna --features  vibrio/vibrio.gff --nanopore amtp4_20x.fastq

Activity: Inspect quast outputs

Quast outputs should be saved to a folder called quast_vibrio. Download this folder to your laptop and unzip it. Some things to look at include

The main report report.html
A circos plot in circos/circos.png

Both assemblers seem to have done a very good job of assembling the genome but both have produced regions where there are a large number of small differences (snps and indels) between the assembly and the reference. The redbean assembly also seems to have a couple of gaps which most likely correspond with the highly repetitive region we observed in the Bandage visualisation.

Optional: Check final assembly for adapters

Notice that we didn’t actually do any adapter trimming prior to running flye or redbean. In a real project we would probably run porechop on all the reads to trim adapters. This would most likely improve the assembly and reduce the possibility of incorporating adapter sequences into our final product.

As an exercise use porechop to check for adapters in the various assembly files you created.

Good to do if we have time: Polishing the assembly

Although long reads provide very high contiguity they generally have low accuracy. To a large extent this low accuracy can be overcome obtaining a high read depth and taking the consensus of many reads at each position. Some genome assemblers incorporate such error correction (eg Canu) but even in such cases it is always a good idea to use a dedicated assembly polishing tool to correct small errors in the assembly.

A good polishing tool for Oxford Nanopore data is medaka.

Install polishing tools

Create a directory for polishing and switch to it

mkdir ~/nanopore_workshop/medaka
cd medaka

conda create -n polish medaka racon minimap2

Link in the flye assembly (our best so far) and the raw reads

ln -s ../quast/flye/vibrio.fasta flye.fasta
ln -s ../flye/amtp4_20x.fastq .

Before running medaka we first need to run another polishing software called racon. The medaka manual specifically states

Medaka has been trained to correct draft sequences processed through racon, specifically racon run four times iteratively.

Here’s how to run a single racon iteration.

First map the reads to the current version of the reference

minimap2 -x ava-ont -t 40 \
	flye.fasta \
	amtp4_20x.fastq > \
	reads.minimap.fastq.1.paf

Then run racon to generate a new polished assembly. This is round 1.

racon -m 8 -x -6 -g -8 -w 500 \
	-t 4 \
	amtp4_20x.fastq \
	reads.minimap.fastq.1.paf \
	flye.fasta > \
	flye.1.fasta

It could get very laborious and error prone to repeat these steps four times. Updating filenames each time and making sure everything uses the correct file is quite tricky.

Instead of doing this we will create a short script to automate the process. Open a new text file and call it racon.sh

Enter the following code into your new file exactly as shown below (cut and paste it)

reads=amtp4_20x.fastq

ln -s flye.fasta flye.0.fasta

for prev_round in $(seq 0 3);do
  round=$((prev_round+1))

  minimap2 -x ava-ont -t 4 \
	flye.${prev_round}.fasta $reads > reads.minimap.fastq.${round}.paf

  racon -m 8 -x -6 -g -8 -w 500 -t 4 \
  $reads \
	reads.minimap.fastq.${round}.paf \
	flye.${prev_round}.fasta > flye.${round}.fasta
done

This code is using a bash for loop to automate the process of running racon four times. It should result in a series of successively more polished assemblies ending with flye.4.fasta

Run the script from within the medaka directory like this

bash racon.sh

Now we are ready to run medaka which we do as follows;

medaka_consensus -i amtp4_20x.fastq -d flye.3.fasta -o medaka_consensus -t 4 -m r941_min_fast

Rerun QUAST on the polished assembly

Now switch back to the quast directory and redo the analysis. This time incorporating the medaka consensus fasta file.

cd ~/nanopore_workshop/quast

mkdir medaka
cp ../medaka/medaka_consensus/consensus.fasta medaka/vibrio.fasta

quast -o quast_vibrio redbean/vibrio.fasta flye/vibrio.fasta medaka/vibrio.fasta -L -t 32 --circos --glimmer -r vibrio/vibrio.fna --features  vibrio/vibrio.gff --nanopore amtp4_20x.fastq

Activity: Use quast to compare the polished to unpolished assemblies

What next?

After polishing our final assembly should be available as ../medaka/medaka_consensus/consensus.fasta. Ideally we would probably want some more accurate Illumina data for final polishing, or PacBio data which has less biased errors than Nanopore.

Once you are happy with your assembly you should give it an appropriate filename that incorporates some version information. The important thing at this point is that you keep things organised and that you have a clear trace from the final assembly back through the steps used to produce it.

mkdir ~/nanopore_workshop/amtp4_1.0
cd ~/nanopore_workshop/amtp4_1.0
cp ../medaka/medaka_consensus/consensus.fasta amtp4_vibrio_1.0.fasta

We will work from this file going forward for annotation and genomic comparisons