1. Write a for-loop that will
    • run FastQC on each of the FASTQ files that you previously downloaded from the Gierlinski dataset.
    • run TrimGalore on each FASTQ file.
    • run FastQC on the trimmed dataset. 2.Briefly summarize any QC results or differences.
  2. Based on the QC, would you be justified in combining any of the FASTQ files given that they are technical replicates?
  3. Even if the answer to the previous question is “no”, how could you combine the several FASTQ files into one?
  4. Which base call is more likely to be incorrect – one with a Phred score of # or one with a Phred score of ;?
  5. Explain at least 2 reasons for base calling uncertainties (i.e. what factors could explain lower than expected/desired sequencing scores) and how they can be avoided/alleviated.
  6. What is the baseline uncertainty that Illumina attaches to its base calls? In other words, how likely is it that a base call is wrong even if it got the highest possible Phred score of 41? How many bases can you therefore expect to be wrong in a file with 1 million 50bp-long reads? Does this concern you? (Briefly justify your answer)

Project work:

  1. Expand your project ideas. Come up with (at least) one specific hypothesis that you want to test. Include (i) how you came up with this hypothesis/why it is interesting, (ii) why is NGS data well suited to address this, and (iii) how you will go about testing it.
  2. Specify the data you will need.
    • Locate potential datasets and describe them (when/where were they generated, what sequencing platform was used, etc).
    • Think about possible biases or technical problems that you might run into if you were to use these data. (Hint: remember the lecture about experimental design!)

Compile the .Rmd file and send both the .Rmd and the HTML files to angsd_2019@zoho.com by Sunday night.