r-make

image

Preparing reference files.

To prepare r-make compatible reference and annotation datasets, follow the below instructions.

Preparing STAR index files.

Download the reference FASTA file from, for example, the UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/downloads.html). Create the STAR index within the r-make index directory. Taking human (hg19) as an example:

#move into the r-make index directory
cd ~/rmake-1.0c/references/indexes

#create an hg19 reference directory
mkdir hg19

#move into the hg19 reference directory
cd hg19

#download the reference files
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

#extract and decompress the reference files
tar -xvzf chromFa.tar.gz

#remove random reference files
rm *random*

#remove haplotypes
rm *hap*

#remove contigs which cannot be confidently placed into a chromosome
rm *chrUn*

#concatenate the chromosomes into an individual file
cat *chr*fa > hg19.fa

#remove the chromosomes
rm *chr*fa

#create the STAR reference index
~/rmake-1.0c/third-party/star --runMode genomeGenerate --genomeDir ./ \
--genomeFastaFiles hg19.fa --runThreadN 4

#index the reference FASTA file
~/rmake-1.0/third-party/samtools faidx hg19.fa

Preparing the reference annotation files.

Retrieve the appropriate reference annotation files in the BED-6 and BED-12 format, along with rRNA coordinates, from, for example, the UCSC Table Browser. Using human (hg19) and RefSeq gene annotation as an example:

 1) Point your web-browser here: http://genome.ucsc.edu/cgi-bin/hgTables

 2) Download the BED-12 format. Setup the Table Browser as follows:
      *clade: Mammal
      *genome: Human
      *assembly: Feb 2009(GRCh37/hg19)
      *group: Genes and Gene Prediction Tracks
      *track: RefSeq Gene
      *table: refFlat
      *region: genome
      *output format: BED-browser extensible data
      *output file: refseq.bed12
      *file type returned: plain text

image











Click "get output". Then, choose "Whole Gene", and click "get BED" to download.

 3)Download the BED-6 format for each feature. Setup the Table Browser as follows:
      *clade: Mammal
      *genome: Human
      *assembly: Feb 2009(GRCh37/hg19)
      *group: Genes and Gene Prediction Tracks
      *track: RefSeq Gene
      *table: refFlat
      *region: genome
      *output format: BED-browser extensible data
      *output file: refseq.exons
      *file type returned: plain text

image









Click "get output". Then, choose "Exons", and click "get BED" to download. image











Repeat this processes for "Introns", "5' UTR Exons", and "3' UTR Exons". (Changing the name of the output file to 'refseq.introns', 'refseq.5utr', and 'refseq.utr3', respectively).

 4) Retrieve the ribosomal rRNA coordinates. Setup the Table Browser as follows:
      *clade: Mammal
      *genome: Human
      *assembly: Feb 2009(GRCh37/hg19)
      *group: All Tables
      *track: rmsk
      *table: refFlat
      *region: genome
      *output format: BED-browser extensible data
      *output file: ribosomal.bed
      *file type returned: plain text

image








Click "Filter". On the next page, setup the Filters as follows:
      *repClass does match_ rRNA
      *OR Free-form query: repClass = "tRNA"

image











Click "Submit", "Whole Gene", then "get bed" to download.

image






 5) Move the files in the r-make reference folder.

#move into the r-make references directory
cd ~/rmake-1.0c/references

#create an hg19 annotation directory
mkdir hg19

#create a refseq annotation directory
mkdir refseq 

#move into the refseq annotation directory
cd hg19/refseq

Put 'refseq.bed12', 'refseq.exons', 'refseq.introns', 'refseq.5utr', 'refseq.utr3', and 'ribosomal.bed' into the newly created 'hg19/refseq' annotation directory.

Concatenate and sort the annotation feature files:

cat refseq.utr3 refseq.utr5 refseq.introns refseq.exons > refseq.bed
sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 refseq.bed -o refseq.bed

Append the ribosomal rRNA coordinates:

grep -v "random\|hap\|chrUn" test.bed | \
awk 'BEGIN{OFS="\t";}{if ($6=="+") \
print $1,$2,$3,"ribosomal_abd_0_0_"$1"_"$2+1"_f",$5,$6 ; \
else print $1,$2,$3,"ribosomal_abd_0_0_"$1"_"$2+1"_r",$5,$6 }' \
>> refseq.bed

Sort the bed file:

sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 refseq.bed -o refseq.bed

Documentation.

Available pages:
Last modification date:

p. zumbo