- Write a for-loop that will
- run FastQC on each of the
FASTQ
files that you previously downloaded from the Gierlinski dataset.
- run TrimGalore on each
FASTQ
file.
- run FastQC on the trimmed dataset. 2.Briefly summarize any QC results or differences.
- Based on the QC, would you be justified in combining any of the
FASTQ
files given that they are technical replicates?
- Even if the answer to the previous question is “no”, how could you combine the several
FASTQ
files into one?
- Which base call is more likely to be incorrect – one with a Phred score of
#
or one with a Phred score of ;
?
- Explain at least 2 reasons for base calling uncertainties (i.e. what factors could explain lower than expected/desired sequencing scores) and how they can be avoided/alleviated.
- What is the baseline uncertainty that Illumina attaches to its base calls? In other words, how likely is it that a base call is wrong even if it got the highest possible Phred score of 41? How many bases can you therefore expect to be wrong in a file with 1 million 50bp-long reads? Does this concern you? (Briefly justify your answer)
Project work:
- Expand your project ideas. Come up with (at least) one specific hypothesis that you want to test. Include (i) how you came up with this hypothesis/why it is interesting, (ii) why is NGS data well suited to address this, and (iii) how you will go about testing it.
- Specify the data you will need.
- Locate potential datasets and describe them (when/where were they generated, what sequencing platform was used, etc).
- Think about possible biases or technical problems that you might run into if you were to use these data. (Hint: remember the lecture about experimental design!)
Compile the .Rmd
file and send both the .Rmd and the HTML files to angsd_2019@zoho.com by Sunday night.