long-reads-workshop

In this session we will explore some initial data that has been produced from our MinION run. We will evaluate the data in terms of its quantity and quality as well as perform adapter trimming.

Listing output files

As data is produced it will be added to the directory /pvol/data/minion.

Activity : List fastq files in the output directory

Use the ls command to list all files in the minion output directory as follows

ls /pvol/data/minion/

Just in case our run doesn’t work on the day we have also provided some files from a previous run under /pvol/data/amtp4. (This amtp4 data will also be used for exploring genome assembly on day 2).

Try listing the fastq files in the amtp4 directory

ls /pvol/data/amtp4

Activity : Find all files ending in .fastq.

Notice that there are multiple files and they all in end in the suffix .fastq. If we wanted to find all the .fastq files across both directories we could use the unix find command as follows;

find /pvol/data/ -name '*.fastq'

Activity : Count numbers of files

The wc command stands for Word Count. It actually counts numbers of characters, words and lines in text. We can use it to count the number of lines output by ls or find and since there is one line per file it will effectively count the number of files.

find /pvol/data/ -name '*.fastq' | wc 

or to just get the number of lines

find /pvol/data/ -name '*.fastq' | wc -l

Notice that in these two examples we are actually combining two commands, find and wc into a single line. This is done using the | (pipe) operator. This operator has the effect of taking the output from the first command (ie find) and sending it as input to the second command (ie wc). Pipes underpin the design philosophy of unix because they allow many small and specialised tools to be combined together to do more complex tasks.

Exploring data in fastq format

Activity : Inspect a fastq file

Use the head command to view the first 10 lines of a fastq file. For example;

head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq

Quite a bit of output is produced. Let’s unpick it.

The fastq format is designed for storing sequencing data along with quality scores. It is a standard format used for many types of sequencing including Illumina short reads and longer reads produced by Nanopore or PacBio sequencing. See this wikipedia page for a more detailed description of the fastq format.

Each read occupies 4 lines in a fastq file. The lines are as follows;

Starts with @ and contains a unique identifier for the sequence
Is the sequence itself
Starts with + and almost always nothing else
Contains quality scores. These are encoded using ASCII characters with one character per base.

The full list of qualty score letters from 0-94 is

!\"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_\`abcdefghijklmnopqrstuvwxyz{|}~

Read Quality Control Checks

Activity : Start a new Markdown Document 02_readqc

Create a new RMarkdown document called “Read QC” and name the file 02_readqc. Use this file to take notes on the following exercises.

The first step with any sequencing analysis is to check the data. Some key questions are;

How much data is there? How many reads?
What is the distribution of read lengths?
What is the distribution of quality scores?
Does the data contain adapter sequences that need to be removed?

For short read data all these questions can be answered using the FastQC program but for long reads there is no equivalent piece of software.

We will use a few pieces of software and some custom scripts to answer these questions for Nanopore data

How much data do we have?

Activity : Count reads with grep

grep is a unix program that performs very fast searches for patterns within text files. One handy use of grep is to count the number of entries within a fastq or fastq file. From the fastq definition we now that each entry has a line that starts with the @ character. We could use grep to find all the lines that satisfy this condition like this

grep '^@' /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq

Note the use of ^ in front of the @. This tells grep to only find instances of @ at the start of a line. If we didn’t have the ^ it would also return the quality score lines since @ is a quality score character as well.

Let’s use wc again. This time to count the number of lines returned by grep

grep '^@' /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | wc -l

Notice how this gives a very nice round number. This is because the Nanopore basecalling software splits data up into 4000 read chunks. We can therefore obtain a fairly quick estimate of the number of reads by multiplying the number of fastq files by 4000.

How many reads to we have so far?

Read lengths with `bioawk`

bioawk is a powerful piece of software that will enable us to inspect fastq data.

In order to harness the full power of bioawk you need to know the awk language. We don’t have time to go into that properly in this workshop but I will try to explain the commands we use as we go.

This command shows a basic usage of bioawk to inspect fastq data

head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print $name}'

Stop for a moment and think about what this command is doing? First note that it contains a pipe. What is the command on the left of the pipe? Try running this on its own. Look at the header lines for each sequence (lines starting with @).

Now run the full command again. Notice that it prints the header line (sequence ID) for each entry.

Next we modify the command to be more useful. Instead of printing the name of each entry we pring its sequence with;

head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print $seq}'

We can augment this command slightly to print the lengths of each sequence as follows

head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}'

So far we have used the head command to work with just the first 10 lines of a single file. This is useful when troubleshooting but we are now ready to print out all the read lengths. To do this change the head command to cat. The cat command will send the entire contents of the file to our bioawk command.

cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}'

Activity : Calculate total sequencing volume

To work out the total number of sequenced bases we just need to calculate the sum of all the numbers output from the last command. This can be done by modifying the bioawk script as follows;

cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'

Read length distribution

Activity : Make a histogram of read lengths

Start by modifying the command used in the last exercise. We will use the redirection operator, > to ensure that all this output gets saved to a file

cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}' > read_lengths.tsv

Now we will switch to R to create the required plot. Don’t worry if you don’t understand exactly how this R code works. You will learn more about it on Wednesday and Thursday.

Copy the following code chunk into your RMarkdown document. This chunk loads a package that includes tools for reading files and making plots

library(tidyverse)

When you first try to run this chunk it will most likely result in an error because the package is not yet installed. RStudio will prompt you to install the package. Just click install and let it do its thing. Package installation will take a few minutes.

Now use the next code chunk to read the data in lengths.tsv and use it to make a histogram

lengths <- read_tsv("read_lengths.tsv",col_names="length")

ggplot(lengths,aes(x=length)) + geom_histogram()

This histogram is made from just the first 4000 reads from the example data. Think about how you might generate data and make a histogram for all the reads in the run we started on Monday.

To figure out how to do this first note that the wildcard operator * can be used to list multiple files. So we can list all the fastq files in our output directory like this;

ls /pvol/data/minion/*.fastq

If we can list files we can also use cat to dump the output of files and send this to bioawk

cat /pvol/data/minion/*.fastq | bioawk -c fastx '{print length($seq)}' > read_lengths_minion.tsv

Now remake the histogram of read lengths but use the data in read_lengths_minion.tsv. This should represent the output of our current run.

Activity : Estimate genome coverage

Use the following command to calculate the sequencing volume

(For example data)

cat /pvol/data/amtp4/*.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'

(For our run data)

cat /pvol/data/minion/*.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'

Assuming a genome size of roughly 4Mb calculate the depth of coverage in either case.

Quality score distribution

I have provide an awk script to summarise quality scores from a fastq file.

BEGIN{
  split("!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~",ordered_scores,"")

  for(i=1;i<95;i++){
    sc=ordered_scores[i]
    scores[sc]=0
  }

}

NR%4==0{
  split($0,read_scores,"")
  for(i in read_scores){
    sc=read_scores[i]
    scores[sc]=scores[sc]+1
  }
}

END{
  printf("score\tcharacter\tcount\n")
  for(i=1;i<95;i++){
    sc=ordered_scores[i]
    printf("%s\t%s\t%s\n",i-1,sc,scores[sc])
  }
}

You don’t need to understand this script to use it. Use it as follows

Create a new empty text file in RStudio.
Cut and paste the entire contents of the script into your file
Name the file qualdist.awk

How you can calculate a score distribution from fastq like this

cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | awk -f qualdist.awk

Activity : Plot the distribution of quality scores

Run the qualdist script and send its output to a file called scores.tsv

cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | awk -f qualdist.awk > scores.tsv

Use the following R code to save the read scores and plot them

scores <- read_tsv("scores.tsv",quote = "")

ggplot(scores,aes(x=score,y=count)) + geom_line()

Activity : Compare quality scores from Nanopore and Illumina data

For comparison lets have a look at quality scores from Illumina data. Data from a MiSeq run on the same bacterial isolate is available at /pvol/data/amtp4/MiSeq/

To dump out the scores use

zcat /pvol/data/amtp4/MiSeq/AMT-P-4_S2_L001_R1_001.fastq.gz | head -n 20000 | awk -f qualdist.awk > miseq_scores.tsv

Then to make a plot use

miseq_scores <- read_tsv("miseq_scores.tsv",quote = "") %>% add_column(platform="MiSeq")
ont_scores <- read_tsv("scores.tsv",quote = "") %>% add_column(platform="MinION")

all_scores <- rbind(miseq_scores,ont_scores)

ggplot(all_scores,aes(x=score,y=count)) + geom_line(aes(color=platform))

Detect and Trim Adapters

There are a wide range of library preparation methods for Nanopore sequencing but all of them use adapters of some type. For whole genome sequencing many downstream tools (eg the Canu genome assembler) will automatically detect and trim adapter sequences. Not all tools do this however so it may be a good idea to trim adapters prior to running further analyses. You should generally check the instructions for the tool you are using and unless otherwise stated you should assume that it requires adapter trimmed sequences.

The software tool porechop by Ryan Wick is a great tool for adapter trimming. Be aware that this tool is no longer maintained so you should check in future to see if other tools have emerged to replace it. Oxford Nanopore distribute another tool called qcat which is similar to porechop but seems more oriented toward demultiplexing (barcoded adapters) than removing simple sequencing adapters.

Activity: Create a qc conda environment and install porechop

conda create -n qc porechop

Switch to the qc environment

conda activate qc

Test that it worked by running the program without arguments

porechop

Activity: Run porechop to check adapter content

Run porechop on a single fastq file and use verbose mode so that it shows adapters in color

porechop -i /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq -v 2 -o 0_trim.fastq

Things to notice about the output

Most reads contain “Rapid_adapter” sequences at one end which is what we would expect since we used the rapid sequencing kit
A small number of reads have adapters in the middle. These are chimeric sequences