r-make

image

Documentation - Usage.

In order to begin the r-make build process, follow these instructions.

Setup directory structure.

First, create a Project directory. Populate each Project directory with Sample directories. Within each Sample directory, place your compressed (gzip) fastq files. The directory structure should be modeled as in the below figure.

image
















Figure 1: Starting directory structure for r-make.

NOTE: Neither sample nor project names need contain either the prefix 'Project-' or 'Sample-'; that is, sample and project names can be named plainly, such as, 'Mouse1exp', 'M2cntl', 'Liver', 'Brain', etc.

Fastq naming scheme

Fastq files must be named following Illumina's naming scheme as in CASAVA 1.8, with the addition of flowcell ID:

<sample name>_<flowcell id>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number >_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_D0KWFACXX_CGATGT_L001_R1_001.fastq.gz

To insert the flowcell id from fastq files generated from the Illumina's 1.8 pipeline, you may use the following code:

for i in $(find . -name "*fastq.gz"); do \
mv $i `echo $i|cut -d "_" -f1`_`zcat $i|head -n1|cut -d":" -f 3`_`echo $i|cut -d "_" -f2-` \
; done

NOTE: the <sample name> segment in the fastq file name should be inherited from it's parent Sample directory name.

Illegal Characters

Project and sample names cannot contain illegal characters (often not allowed by some file systems). The characters not allowed are the space character and the following:

? ( ) [ ] / \ = + < > : ; " ' , * ^ | & . _

Create a configuration file.

r-make requires a configuration file in order to inform the build process of key parameters. The configuration file is translated into a unique makefile for each build, which specifies exactly what commands need to be executed to carry out the requested analysis. The following table list the parameters that can be specified in the configuration file

parameter definition
EXPT_DIR (required) Provide the path to the project directory in the experiment folder.
GENOME(required) Specify the genome code.
INPUT (required) Specify input type: 'fastq' or 'bam'.
ALIGN (required) Should the reads be aligned? BOOL 1,0.
ALIGNER
(required if ALIGN=1)
Name of the aligner to use.
NOTE: only STAR is supported at present
ALIGNER_INDEX
(required if ALIGN=1)
Name of the index to be aligned to.
ALIGNER_OPTIONS (optional) Any option that is a valid aligner option can be specified here. STAR options can be found here
ANNOTATION (required if COUNT/GENEBODY//DISTRIBUTION=1) Specify the name of the reference annotation.
GENE_COUNT (optional) Count reads per gene? BOOL 1,0.
FASTQ_STATS (optional) Generate raw FASTQ stats? BOOL 1,0.
NVC (optional) Generate nucleotide intensity vs. cycle data? BOOL 1,0.
GC (optional) Generate GC content density of reads? BOOL 1,0.
GENEBODY (optional) Calculate coverage across genebody? BOOL 1,0.
INSERT (optional) Estimate insert size of reads? BOOL 1,0.
NOTE: for paired-end reads only
DISTRIBUTION (optional) Determine distribution of reads across genic features? BOOL 1,0.
BIOTYPE (optional) Print frequency of biotypes of detected genes? BOOL 1,0.
DUPLICATES (optional) Calculate number of duplicated reads as determined by sequence composition and mapping coordinates? BOOL 1,0.
ERROR (optional) Estimate error rate from mapping statistics? BOOL 1,0.
MAPPED (optional) Print number of mapped and unmapped reads? BOOL 1,0.
STRAND (optional) Print number of reads mapping to forward and reverse strands? BOOL 1,0.
TAR (optional) Print 'transcriptionally active regions' (i.e., regions of the genome which are transcribed, but which are outside of the reference annotation)? BOOL 1,0.
QUAL (optional) Plot quality-scores of mapped and unmapped reads as a function of read position? BOOL 1,0.
KARYOGRAM (optional) Create a karyogram depecting expression of exons across each chromosome? BOOL 1,0..
NOTE: only works for hg19 at present
PROJECT_PLOTS (optional) Create project-level plots? BOOL 1,0.
WEB_USERNAME
WEB_HOSTNAME
WEB_DIR
WEB_HTML_ADDRESS (optional)
Upload the results to a web server.
NOTE: keyless login must be configured
EMAIL_ADDRESS (optional) Specify a comma-separate list of e-mail addresses for notifcation upon completion of analysis.

An example configuration file is listed below:

EXPT_DIR /scratchLocal/Project_A
INPUT fastq
ALIGN 1
ALIGNER star
ALIGNER_OPTIONS --outFilterMultimapNmax 1 --readFilesCommand zcat
GENOME hg19
ALIGNER_INDEX hg19
ANNOTATION refseq
MAPPED 1
STRAND 1
GENE_COUNT 1
FASTQ_STATS 1
NVC 1
GC 1
GENEBODY 1
INSERT 1
BIOTYPE 1
DISTRIBUTION 1
DUPLICATES 1
ERROR 1
TAR 1
QUAL 1
KARYOGRAM 1
PROJECT_PLOTS 1
WEB_USERNAME chmweb
WEB_HOSTNAME okeeffe.med.cornell.edu
WEB_DIR /pubnet_store001/cmlab_data/r-make/
WEB_HTML_ADDRESS physiology.med.cornell.edu/faculty/mason/lab/data3/rmake/
EMAIL_ADDRESS paz2005@med.cornell.edu

Run the configuration script.

Before starting the r-make build process, you must first generate the makefiles. Run the configuration script ('Config.pl') from the r-make directory, pointing to the configuration file and the EXPT_DIR, to create the makefiles.

/path-to-rmake/Config.pl config.txt --EXPT_DIR path_to_ProjectDir

Begin rmake.

To begin the r-make build process, move into the project directory and type "make". Because make is inherently capable of parallelization, you may choose to use the '-j' switch. For example, to start the build process using 40 CPUs:

make -j 40

WARNING: because make itself is paralleizable, it is *not* advisable to internally thread STAR; doing so may overload the node
NOTE: it is a good idea to keep a log file of the make build. To do so, you may append '2>&1 > make.log' to the end of the make command.

Documentation.

Available pages:
Last modification date:

p. zumbo