Practical Project

Your project = small scientific project! The project will make up 70% of your final grade.

You will be presenting the results on April 16, 2019.

The 4 main points that we’re going to look out for are:

have a question – this could be a fairly simple one!

have an idea of how to answer it (and understand when you may not be able to address it)

generate the answer in a way that others can reproduce your analysis and results

put a strong emphasis on data processing

The project will take all the time until the end of the semester to complete. The first steps will be:

Identify a question of interest, e.g. “What is the difference between the transcriptome of cells of the inner ear and the outer ear?”
- formulate (at least) one hypothesis: e.g. “There is no difference.”
Identify the type of data that you will analyze.
- e.g. RNA-seq data
Identify the specific data set that you will use.
- e.g. a GEO or SRA accession number

At the end of the semester, we expect the following result:

A coherent report (similar to a short research paper) about how you addressed your question of interest and whether the data you used supported that analysis, including
- "Introduction" A brief paragraph summarizing the scientific background and/or why that particular question is interesting (you should cite at least 3-5 papers). Of course, you should clearly state the question you’re going to investigate as well as the specific hypothesis you set out to test.
- "Results" Another brief paragraph summarizing your key insights and possible future experiments/analyses that might enhance your own analysis. Make sure to include a discussion of the limits that your data set has!
- "Methods" A detailed verbose description of all the steps you took to arrive at the conclusion including how and where the data was downloaded, pre-processed and analyzed. This should also include some brief reasoning of why you chose certain tools/solutions.
- "Discussion" A brief description/list of issues/problems/limitations you encountered along the way and how you addressed them.
- A table that summarizes the key data sets that you have generated during the analyses and decided to keep.
The report should have at least 3 figures that illustrate some aspects of your analysis, either the ones you find most difficult to explain without a graph or the points that are most important in your opinion – there is no upper limit on the number of figures to include, but make sure to describe/reference every single one! In addition, we want all the code (and data) that one would need to recreate those images.

The report should be written in Rmarkdown format and we expect you to share both the compiled html or pdf as well as the “source” Rmd file.

You will give a 10-15 minute presentation that should be a condensed version of your report. Be prepared to answer questions at the end (5 minutes Q&A).

What is important to us?

Critical assessment of the data at hand. Are there any issues with it?
Perseverance – do not give up! Even if the data does not seem to contain the answer you were looking for, explain to us why you feel that way and what type of data you would need instead.
Keep as many notes as you need, both written and electronically – whatever works for you!

Thoughts about hypotheses

A hypothesis is a declarative sentence that predicts the results of a research study based on existing scientific knowledge and stated assumptions. It is a prediction that answers the research question. Hypotheses are statements that, if true, would explain the researchers’ observations. Lipowsky (2008)

Examples of statements that are not hypotheses:

“Investigating the transcriptome of colon cancer patients.”
“Understanding the sequences of flu viruses.”
“Comparing method 1 with method 2.”

Examples of how these statements may give rise to hypotheses:

“There are numerous genes that are differentially expressed when comparing male and female colon cancer patients.”
“Different strains of flu viruses carry different gene sequences for specific surface proteins.”
“Method 1 yields more comprehensive information about the transcriptome of heterogeneous tissues than method 2.” [Here, one would have to further specify what is meant by “comprehensive information”, e.g. the number of genes that can be detected as expressed is higher in method 1 than in method 2]

The aim is that the purpose and objectives of your projects become clear and unambiguous: What do we need to know and why?

For a very practical approach to developing an hypothesis, this article may be helpful.

If you want to dive into the history and importance of (and controversies around) hypothesis-led research, here are two primers:

Some suggestions

Biological questions

Is there a certain type of disease/cell state/cell type you’re interested in? Perhaps a specific animal model interests you?

Are there certain molecular mechanisms that particularly interest you? Such as transcription factor binding motifs, evolution of dosage compensation, identifying non-protein-coding RNAs, examining histone mark distributions etc.

It’s fine to start your brain-storming with a fairly broad question, but make sure to limit the scope of the question sufficiently to eventually be addressed by one or more NGS experiments.

population studies:
- genetic variation within and between populations (caution: will require lots of data and a possibly good deal of stats)
- Evolution of mammalian miRNA genes
“small-scale” experiments – the most common types of NGS-based experiments

Technical questions

Technical questions may actually be more straight-forward to address since the nature of your question will be fairly limited in scope.

The effects of sequencing platforms on phylogenetic resolution in 16 S rRNA gene profiling of human feces
Comparing different “large scale efforts” in terms of data quality and possibly results: An RNA-Seq atlas of gene expression in mouse and rat normal tissues vs. Cross-platform ultradeep transcriptomic profiling of human reference RNA samples by RNA-Seq
Investigating the potential impact of sample preparation factors, including library storage time, quantity of input RNA, and sample cryopreservation, on RNA-seq experiments
Comparing different DNA extraction methods e.g. Performance comparison of three DNA extraction kits on human whole-exome data from formalin-fixed paraffin-embedded normal and tumor samples
Comparing different library preparations, e.g. Poly-A vs. ribo-depletion or different libraries for small RNAs or different strand-specific RNA-seq library preps
Understanding the impact of sequencing depth for ATAC-seq: this could be done using both published and unpublished data (we have some data at hand in case you’d be interested)
Understanding biases of ChIP-seq data, e.g. Teytelman et al., 2013 and Jain et al., 2015
Assessing GC content bias in old (before 2012) and new (after 2016) NGS data sets (follow, for example, Benjamini & Speed, 2012
Understanding issues with (clinically relevant!) tests based on cell-free DNA fragments, e.g. for pre-natal screenings

Practical Project

Friederike Duendar and Luce Skrabanek

ANGSD Course 2019

Thoughts about hypotheses

Some suggestions

Biological questions

Technical questions