Exercises (14 pts)

The goal is to create a package that contains:

The package should be installable via install.packages().

Exercises that involve the reading in of data and plotting should become part of the vignette (which is really just a good old Rmarkdown document where you keep track of example analyses you’ve done with the functions in a given package).

Feel free to add any other functions you might come up with along the exercises to your package.

  1. Set up a new package. (0.5 pts)
  2. Use the reading_in function (shown below) as your first function in the newly generated package. Describe the steps you have to take in order to make that function part of the package. (1pt)
  3. Make sure to adapt the DESCRIPTION file to note all the packages that this function depends on. (1pt)
  4. Load the function into your workspace and use it to extract the values of FastQC’s diagnostic “Per base sequence quality” from a single fastqc_data.txt file into an Robject. (1pt)
    • Each FastQC run should have produced such a file (usually stored in the zipped output folder) – it’s fine to download these files to your computer.
    • The command would go into the Rmd document that will become your vignette.
  5. Explain the logic of the function’s sed command (1pt)
    • Put that in the vignette, too. In principle, the @details section of the function’s documentation would be a good place to put it, too, but for the sake of the homework, just keep it in the vignette.
  6. Now go back to the function’s code and add a variable to the function that adds an additional column to the resulting data frame containing a user-specified sample name (e.g. “WT_1_ERR458493”). I.e., the function should get at least one more argument. (2pts)
  7. Use your updated function to read in the FastQC results of at least 4 fastq files that should cover 2 biological replicates and 2 technical replicates of each. Make sure to keep track of the sample name in the new Robjects you’re creating. (2pts)
    • It’s fine to use an R-appropriate version of a for-loop for this (go back to the course notes for a refresher).
  8. Combine all these data.frames into one (check out rbind(); if you’ve generated a list in the previous exercise, also look into the do.call() function). Save that composite data frame as an .rda object (with the save() function) giving it the same name as the name of the Robject (e.g. combined_df.rda). (1pt)
  9. The goal is to include that combined data frame as a data object with your package.
    • Figure out where to store the .rda file within the package infrastructure. (0.5pt)
    • Document your object following the examples here. Where do you keep the documentation of the data file? (1pt)
  10. How do you build your package? (1pt)
    • You can include the answer to this in the vignette, too, for the sake of the homework answers all being kept in one place. Make sure to set the code chunk option eval=FALSE though (why?).
  11. Make a ggplot2-based plot using the combined data frame. Try to mimick the basic features of the example plot below, but feel free to change the color palette, remove the grey background and other details. (2pts)
    • This should be part of the vignette, too.
    • You will probably have to add a couple of columns to your original combined data frame
    • You will get a bonus point if you (i) install the package (instead of loading it via devtools) and (ii) use the data stored in the package to make the plot

Here’s the function to get you started with your package:

#' Function for parsing the text output of FastQC
#'
#' This functions extracts the values for a specific test run by FastQC on a
#' single fastq file.
#'
#' @param file string that specifies the path to an individual FastQC result file
#' (tyically named "fastqc_data.txt"
#' @param test Indicate which test results should be extracted. Default:
#' "Per base sequence quality". Other options are, for example, "Per tile sequence quality",
#' "Per sequence quality score" etc.
#'
#' @return data.frame with the values of a single FastQC test result.
#'
#' @examples \dontrun{
#' res <- reading_in(file = "acinar-3_S9_L001_R1_001_fastqc/fastqc_data.txt")
#' }
reading_in <- function(file, test = "Per base sequence quality"){

    ## generate the string that will be used for the file parsing
    syscommand <- paste0("sed -n '/", test, "/,/END_MODULE/p' ", file, " | grep -v '^>>'")
    
    ## use the fread command, which can interpret UNIX commands on the fly to
    ## read in the correct portion of the FastQC result
    dat <- data.table::fread( cmd = syscommand, header = TRUE) %>% as.data.frame
    return(dat)
}

Example plot:

Build the package and send it to by Saturday night. If you need support, get in touch with Merv on Thursday, 3-4pm.