For this exercise will use data from a previous MinION run. The same process can be used on reads from the run we started in the workshop. If there is time it would be a useful exercise to repeat all the steps on that data.
The example data was generated from a culture dominated by a bacterium in the genus Vibrio. The dominant organism has two large chromosomes with sizes around 3.4Mb and 1.8Mb respectively. We also expect some smaller plasmids. At very high coverage it was also revealed that this data contained another bacterium (a Gammaproteobacterium) which was around 100 times less abundant than the Vibrio.
Be sure to start a new RMarkdown notebook to take notes on this section. I recommend saving it using the filename 03_assembly.Rmd
We will start by using one of the fastest long read genome assemblers, which is now called redbean, likely because the original name WTDBG2 was just plain terrible. This software is available from github at https://github.com/ruanjue/wtdbg2. Lucky for you, redbean has been preinstalled for you but if you would like to try install it yourself on your own machine at another time, we recommend a manual installation from the github repo because the conda version is out of date.
First prepare a directory for running the software and change to it
mkdir ~/nanopore_workshop/redbean
cd redbean
Collect a small subset of data for the initial assembly. We will use just the first 3 files which equates to 12000 reads and around 6x coverage of the target genome. This will almost certainly not be enough to generate a good assembly. As a learning exercise it will be useful to compare the results with this small dataset to results we will obtain later using more reads.
cat /pvol/data/amtp4/*_[012].fastq > amtp4.fastq
Run the assembly
wtdbg2 -x ont -g 5m -t 4 -i amtp4.fastq -fo amtp4
wtpoa-cns -t 4 -i amtp4.ctg.lay.gz -fo amtp4.ctg.fa
Take a rough note of how long this assembly takes.
Activity: Inspect assembly statistics
assembly-stats amtp4.ctg.fa
How good does the assembly look in terms of contiguity and completeness?
Now let’s redo the assembly using a better coverage depth. We will use data equivalent to around 20x coverage.
cat /pvol/data/amtp4/*.fastq > amtp4_20x.fastq
wtdbg2 -x ont -g 5m -t 4 -i amtp4_20x.fastq -fo amtp4_20x
wtpoa-cns -t 4 -i amtp4_20x.ctg.lay.gz -fo amtp4_20x.ctg.fa
Using this larger dataset redbean takes around 13 minutes
Activity: Inspect assembly stats for higher coverage
Reassess the assembly using assembly-stats.
Now we will try an alternative assembler called Flye. This assembler has been around a bit longer than redbean and is a more mature software project. As far as long read assemblers go it is fairly quick but not as fast as redbean. Again, lucky you, it has already been installed on your system but if you want to try it out on your own machine later, you can go with the conda version for flye.
Now create a new directory for running Flye
mkdir ~/nanopore_workshop/flye
cd ~/nanopore_workshop/flye
Make symbolic links to the same raw data we used for redbean
ln -s ../redbean/amtp4.fastq .
ln -s ../redbean/amtp4_20x.fastq .
Start by running Flye on the smaller dataset. Even this will take quite a while (about 15 minutes). We would like to keep working while flye runs in the background. To do this we will using some special shell magic. Please enter commands exactly as written below (cut-paste is best).
Take care with the next command. ONLY enter it once! If you enter it multiple times it will create multiple (potentially conflicting) runs of flye.
flye --nano-raw amtp4.fastq --out-dir amtp4 --genome-size 5m --threads 4 >> flye.log 2>&1 &
What is happening here? We are using some bash magic at the end of the command to do the following
>> flye.log2>&1& at the end.All of the output from this command is being sent to flye.log so we can look at this to see what is happening.
tail flye.log
or to follow (-f) the output as it is created use
tail -f flye.log
Activity: Monitor running jobs
You can use the ps command to see the processes that are running. This should show your instance of flye.
Also try top which shows all the most CPU intensive tasks being run by the entire system. Here you should be able to see everybody’s flye processes.
Activity: Assess assemblies made using Flye
Once flye has finished, the final assembly will be available inside the output directory and will be called assembly.fasta. Use the assembly-stats program to assess the contiguity and completeness of this assembly.
Now run flye on the larger dataset. This will take a very long time. We will come back to it later.
flye --nano-raw amtp4_20x.fastq --out-dir amtp4_20x --genome-size 5m --threads 4 >> flye_20x.log 2>&1 &
After the large assembly has finished assess it with assembly-stats and look at its GFA file with Bandage (see below).
One of the nice features of Flye is that is produces output in the GFA format. We can use a nice graphical tool called Bandage to inspect GFA files.
Activity: Inspect low coverage flye assembly with Bandage
First download and install bandage. It is available for all major operating systems.
Look inside the output directory for the first flye run. Find the file called assembly-graph.gfa. Use RStudio to download this file to your laptop.
Now open bandage and use the File->Load Graph menu to load the file assembly-graph.gfa. Then click Draw Graph.
In this visualisation contiguous segments are shown as solid coloured lines. Some of these are separated into independent graphs whereas others (eg the mess in the top left) are connected to each other via black lines. There are few things to keep in mind here;
Try zooming in on the highly connected region. Click and move contigs around to get a clearer look. Clicking on a contig shows the depth of read coverage assigned to it. Based on read coverage and the shape of the graph can you spot a segment that might correspond to a repeat?
Activity: Inspect high coverage flye assembly with Bandage
Repeat the exercise above using the high coverage (20x) assembly generated with Flye. Can you spot the same repeat region? How has increased coverage changed the appearance of the assembly.
Which contigs look like fully assembled chromosomes? Which ones look like plasmids?
Up to this point we have performed four separate assemblies with 2 different assemblers and 2 different levels of coverage. Although we could clearly see a huge improvement in contiguity with the addition of more reads we could not easily check for the accuracy of assemblies.
For this particular example we have a reference assembly which we previously created using very deep coverage and best-practice genome polishing tools. By comparing our assemblies to the reference we can assess them in terms of all the important metrics, contiguity, completeness and correctness.
mkdir ~/nanopore_workshop/quast
cd ~/nanopore_workshop/quast
mkdir flye redbean
Our reference assembly only contains contigs for the two major Vibrio chromosomes. For our two deeper coverage assemblies this will correspond to the two largest contigs. We extract these (and ignore all others) in order to make the assembly comparison easier to interpret.
Use this command to look at the contig sizes in the flye assembly
cat ../flye/amtp4_20x/assembly.fasta | bioawk -c fastx '{print $name, length($seq)}' | sort -nr -k2
For me this was contig_11 and contig_56 but these numbers may differ from assembly to assembly. Use samtools to extract these
samtools faidx ../flye/amtp4_20x/assembly.fasta contig_3 contig_16 > flye/vibrio.fasta
Now repeat the process for the redbean assembly
cat ../redbean/amtp4_20x.ctg.fa | bioawk -c fastx '{print $name, length($seq)}' | sort -nr -k2
samtools faidx ../redbean/amtp4_20x.ctg.fa ctg1 ctg2 > redbean/vibrio.fasta
Now create a symbolic link to the original reads
ln ../flye/amtp4_20x.fastq .
We also need the reference assembly so make a symbolic link to that as well
ln -s /pvol/data/amtp4/reference/vibrio .
Run quast as follows
conda activate circos
quast.py -o quast_vibrio redbean/vibrio.fasta flye/vibrio.fasta -L -t 4 --circos --glimmer -r vibrio/vibrio.fna --features vibrio/vibrio.gff --nanopore amtp4_20x.fastq
Activity: Inspect quast outputs
Quast outputs should be saved to a folder called quast_vibrio. Download this folder to your laptop and unzip it. Some things to look at include
report.htmlcircos/circos.pngBoth assemblers seem to have done a very good job of assembling the genome but both have produced regions where there are a large number of small differences (snps and indels) between the assembly and the reference. The redbean assembly also seems to have a couple of gaps which most likely correspond with the highly repetitive region we observed in the Bandage visualisation.
Notice that we didn’t actually do any adapter trimming prior to running flye or redbean. In a real project we would probably run porechop on all the reads to trim adapters. This would most likely improve the assembly and reduce the possibility of incorporating adapter sequences into our final product.
As an exercise use porechop to check for adapters in the various assembly files you created.
Although long reads provide very high contiguity they generally have low accuracy. To a large extent this low accuracy can be overcome obtaining a high read depth and taking the consensus of many reads at each position. Some genome assemblers incorporate such error correction (eg Canu) but even in such cases it is always a good idea to use a dedicated assembly polishing tool to correct small errors in the assembly.
A good polishing tool for Oxford Nanopore data is medaka.
Create a directory for polishing and switch to it
mkdir ~/nanopore_workshop/medaka
cd medaka
conda create -n polish medaka racon minimap2
Link in the flye assembly (our best so far) and the raw reads
ln -s ../quast/flye/vibrio.fasta flye.fasta
ln -s ../flye/amtp4_20x.fastq .
Before running medaka we first need to run another polishing software called racon. The medaka manual specifically states
Medaka has been trained to correct draft sequences processed through racon, specifically racon run four times iteratively.
Here’s how to run a single racon iteration.
First map the reads to the current version of the reference
minimap2 -x ava-ont -t 40 \
flye.fasta \
amtp4_20x.fastq > \
reads.minimap.fastq.1.paf
Then run racon to generate a new polished assembly. This is round 1.
racon -m 8 -x -6 -g -8 -w 500 \
-t 4 \
amtp4_20x.fastq \
reads.minimap.fastq.1.paf \
flye.fasta > \
flye.1.fasta
It could get very laborious and error prone to repeat these steps four times. Updating filenames each time and making sure everything uses the correct file is quite tricky.
Instead of doing this we will create a short script to automate the process. Open a new text file and call it racon.sh
Enter the following code into your new file exactly as shown below (cut and paste it)
reads=amtp4_20x.fastq
ln -s flye.fasta flye.0.fasta
for prev_round in $(seq 0 3);do
round=$((prev_round+1))
minimap2 -x ava-ont -t 4 \
flye.${prev_round}.fasta $reads > reads.minimap.fastq.${round}.paf
racon -m 8 -x -6 -g -8 -w 500 -t 4 \
$reads \
reads.minimap.fastq.${round}.paf \
flye.${prev_round}.fasta > flye.${round}.fasta
done
This code is using a bash for loop to automate the process of running racon four times. It should result in a series of successively more polished assemblies ending with flye.4.fasta
Run the script from within the medaka directory like this
bash racon.sh
Now we are ready to run medaka which we do as follows;
medaka_consensus -i amtp4_20x.fastq -d flye.3.fasta -o medaka_consensus -t 4 -m r941_min_fast
Now switch back to the quast directory and redo the analysis. This time incorporating the medaka consensus fasta file.
cd ~/nanopore_workshop/quast
mkdir medaka
cp ../medaka/medaka_consensus/consensus.fasta medaka/vibrio.fasta
quast -o quast_vibrio redbean/vibrio.fasta flye/vibrio.fasta medaka/vibrio.fasta -L -t 32 --circos --glimmer -r vibrio/vibrio.fna --features vibrio/vibrio.gff --nanopore amtp4_20x.fastq
Activity: Use quast to compare the polished to unpolished assemblies
After polishing our final assembly should be available as ../medaka/medaka_consensus/consensus.fasta. Ideally we would probably want some more accurate Illumina data for final polishing, or PacBio data which has less biased errors than Nanopore.
Once you are happy with your assembly you should give it an appropriate filename that incorporates some version information. The important thing at this point is that you keep things organised and that you have a clear trace from the final assembly back through the steps used to produce it.
mkdir ~/nanopore_workshop/amtp4_1.0
cd ~/nanopore_workshop/amtp4_1.0
cp ../medaka/medaka_consensus/consensus.fasta amtp4_vibrio_1.0.fasta
We will work from this file going forward for annotation and genomic comparisons