In this session we will explore some initial data that has been produced from our MinION run. We will evaluate the data in terms of its quantity and quality as well as perform adapter trimming.
As data is produced it will be added to the directory /pvol/data/minion.
Activity : List fastq files in the output directory
Use the ls command to list all files in the minion output directory as follows
ls /pvol/data/minion/
Just in case our run doesn’t work on the day we have also provided some files from a previous run under /pvol/data/amtp4. (This amtp4 data will also be used for exploring genome assembly on day 2).
Try listing the fastq files in the amtp4 directory
ls /pvol/data/amtp4
Activity : Find all files ending in
.fastq.
Notice that there are multiple files and they all in end in the suffix .fastq. If we wanted to find all the .fastq files across both directories we could use the unix find command as follows;
find /pvol/data/ -name '*.fastq'
Activity : Count numbers of files
The wc command stands for Word Count. It actually counts numbers of characters, words and lines in text. We can use it to count the number of lines output by ls or find and since there is one line per file it will effectively count the number of files.
find /pvol/data/ -name '*.fastq' | wc
or to just get the number of lines
find /pvol/data/ -name '*.fastq' | wc -l
Notice that in these two examples we are actually combining two commands, find and wc into a single line. This is done using the | (pipe) operator. This operator has the effect of taking the output from the first command (ie find) and sending it as input to the second command (ie wc). Pipes underpin the design philosophy of unix because they allow many small and specialised tools to be combined together to do more complex tasks.
Activity : Inspect a fastq file
Use the head command to view the first 10 lines of a fastq file. For example;
head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq
Quite a bit of output is produced. Let’s unpick it.
The fastq format is designed for storing sequencing data along with quality scores. It is a standard format used for many types of sequencing including Illumina short reads and longer reads produced by Nanopore or PacBio sequencing. See this wikipedia page for a more detailed description of the fastq format.
Each read occupies 4 lines in a fastq file. The lines are as follows;
@ and contains a unique identifier for the sequence+ and almost always nothing elseThe full list of qualty score letters from 0-94 is
!\"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_\`abcdefghijklmnopqrstuvwxyz{|}~
Activity : Start a new Markdown Document
02_readqc
Create a new RMarkdown document called “Read QC” and name the file 02_readqc. Use this file to take notes on the following exercises.
The first step with any sequencing analysis is to check the data. Some key questions are;
For short read data all these questions can be answered using the FastQC program but for long reads there is no equivalent piece of software.
We will use a few pieces of software and some custom scripts to answer these questions for Nanopore data
Activity : Count reads with grep
grep is a unix program that performs very fast searches for patterns within text files. One handy use of grep is to count the number of entries within a fastq or fastq file. From the fastq definition we now that each entry has a line that starts with the @ character. We could use grep to find all the lines that satisfy this condition like this
grep '^@' /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq
Note the use of ^ in front of the @. This tells grep to only find instances of @ at the start of a line. If we didn’t have the ^ it would also return the quality score lines since @ is a quality score character as well.
Let’s use wc again. This time to count the number of lines returned by grep
grep '^@' /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | wc -l
Notice how this gives a very nice round number. This is because the Nanopore basecalling software splits data up into 4000 read chunks. We can therefore obtain a fairly quick estimate of the number of reads by multiplying the number of fastq files by 4000.
How many reads to we have so far?
bioawkbioawk is a powerful piece of software that will enable us to inspect fastq data.
In order to harness the full power of bioawk you need to know the awk language. We don’t have time to go into that properly in this workshop but I will try to explain the commands we use as we go.
This command shows a basic usage of bioawk to inspect fastq data
head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print $name}'
Stop for a moment and think about what this command is doing? First note that it contains a pipe. What is the command on the left of the pipe? Try running this on its own. Look at the header lines for each sequence (lines starting with @).
Now run the full command again. Notice that it prints the header line (sequence ID) for each entry.
Next we modify the command to be more useful. Instead of printing the name of each entry we pring its sequence with;
head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print $seq}'
We can augment this command slightly to print the lengths of each sequence as follows
head /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}'
So far we have used the head command to work with just the first 10 lines of a single file. This is useful when troubleshooting but we are now ready to print out all the read lengths. To do this change the head command to cat. The cat command will send the entire contents of the file to our bioawk command.
cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}'
Activity : Calculate total sequencing volume
To work out the total number of sequenced bases we just need to calculate the sum of all the numbers output from the last command. This can be done by modifying the bioawk script as follows;
cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'
Activity : Make a histogram of read lengths
Start by modifying the command used in the last exercise. We will use the redirection operator, > to ensure that all this output gets saved to a file
cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | bioawk -c fastx '{print length($seq)}' > read_lengths.tsv
Now we will switch to R to create the required plot. Don’t worry if you don’t understand exactly how this R code works. You will learn more about it on Wednesday and Thursday.
Copy the following code chunk into your RMarkdown document. This chunk loads a package that includes tools for reading files and making plots
library(tidyverse)
When you first try to run this chunk it will most likely result in an error because the package is not yet installed. RStudio will prompt you to install the package. Just click install and let it do its thing. Package installation will take a few minutes.
Now use the next code chunk to read the data in lengths.tsv and use it to make a histogram
lengths <- read_tsv("read_lengths.tsv",col_names="length")
ggplot(lengths,aes(x=length)) + geom_histogram()
This histogram is made from just the first 4000 reads from the example data. Think about how you might generate data and make a histogram for all the reads in the run we started on Monday.
To figure out how to do this first note that the wildcard operator * can be used to list multiple files. So we can list all the fastq files in our output directory like this;
ls /pvol/data/minion/*.fastq
If we can list files we can also use cat to dump the output of files and send this to bioawk
cat /pvol/data/minion/*.fastq | bioawk -c fastx '{print length($seq)}' > read_lengths_minion.tsv
Now remake the histogram of read lengths but use the data in read_lengths_minion.tsv. This should represent the output of our current run.
Activity : Estimate genome coverage
Use the following command to calculate the sequencing volume
(For example data)
cat /pvol/data/amtp4/*.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'
(For our run data)
cat /pvol/data/minion/*.fastq | bioawk -c fastx '{sum+= length($seq)} END{print sum}'
Assuming a genome size of roughly 4Mb calculate the depth of coverage in either case.
I have provide an awk script to summarise quality scores from a fastq file.
BEGIN{
split("!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~",ordered_scores,"")
for(i=1;i<95;i++){
sc=ordered_scores[i]
scores[sc]=0
}
}
NR%4==0{
split($0,read_scores,"")
for(i in read_scores){
sc=read_scores[i]
scores[sc]=scores[sc]+1
}
}
END{
printf("score\tcharacter\tcount\n")
for(i=1;i<95;i++){
sc=ordered_scores[i]
printf("%s\t%s\t%s\n",i-1,sc,scores[sc])
}
}
You don’t need to understand this script to use it. Use it as follows
How you can calculate a score distribution from fastq like this
cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | awk -f qualdist.awk
Activity : Plot the distribution of quality scores
Run the qualdist script and send its output to a file called scores.tsv
cat /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq | awk -f qualdist.awk > scores.tsv
Use the following R code to save the read scores and plot them
scores <- read_tsv("scores.tsv",quote = "")
ggplot(scores,aes(x=score,y=count)) + geom_line()
Activity : Compare quality scores from Nanopore and Illumina data
For comparison lets have a look at quality scores from Illumina data. Data from a MiSeq run on the same bacterial isolate is available at /pvol/data/amtp4/MiSeq/
To dump out the scores use
zcat /pvol/data/amtp4/MiSeq/AMT-P-4_S2_L001_R1_001.fastq.gz | head -n 20000 | awk -f qualdist.awk > miseq_scores.tsv
Then to make a plot use
miseq_scores <- read_tsv("miseq_scores.tsv",quote = "") %>% add_column(platform="MiSeq")
ont_scores <- read_tsv("scores.tsv",quote = "") %>% add_column(platform="MinION")
all_scores <- rbind(miseq_scores,ont_scores)
ggplot(all_scores,aes(x=score,y=count)) + geom_line(aes(color=platform))
There are a wide range of library preparation methods for Nanopore sequencing but all of them use adapters of some type. For whole genome sequencing many downstream tools (eg the Canu genome assembler) will automatically detect and trim adapter sequences. Not all tools do this however so it may be a good idea to trim adapters prior to running further analyses. You should generally check the instructions for the tool you are using and unless otherwise stated you should assume that it requires adapter trimmed sequences.
The software tool porechop by Ryan Wick is a great tool for adapter trimming. Be aware that this tool is no longer maintained so you should check in future to see if other tools have emerged to replace it. Oxford Nanopore distribute another tool called qcat which is similar to porechop but seems more oriented toward demultiplexing (barcoded adapters) than removing simple sequencing adapters.
Activity: Create a qc conda environment and install porechop
conda create -n qc porechop
Switch to the qc environment
conda activate qc
Test that it worked by running the program without arguments
porechop
Activity: Run porechop to check adapter content
Run porechop on a single fastq file and use verbose mode so that it shows adapters in color
porechop -i /pvol/data/amtp4/FAK82735_c1e699b0aaba5cdd8e4f3a986f81dbe576112c87_0.fastq -v 2 -o 0_trim.fastq
Things to notice about the output