#########################################################################
diffReps: Detecting differential chromatin modification sites 
from ChIP-seq data with biological replicates

by Li Shen, Ph.D.
Assistant Professor
Department of Neuroscience
Mount Sinai School of Medicine
New York, New York, U.S.A.

Created:       Jun. 2011
Last modified: Mar 29, 2012
For most updated information, visit: https://code.google.com/p/diffreps/
#########################################################################


INTRODUCTION

ChIP-seq is now widely used to profile the enrichment of a DNA-binding protein 
on a genome. It is of high interest to compare the binding differences of a 
histone mark or transcription factor between two contrasting conditions, such as
disease vs. control. diffReps is developed to serve this purpose. It scans the
whole genome using a sliding window, performing millions of statistical tests 
and report the significant hits. diffReps takes into account the biological 
variations within a group of samples and uses that information to enhance the
statistical power. Considering biological variation is of high importance, 
especiallly for in vivo brain tissues (which is my group's high priority).


PREREQUISITES

diffReps requires the following two CPAN modules:

Statistics::TTest
Math::CDF

they can be downloaded and installed from CPAN. If you use cpanminus to install
diffReps, they will be automatically installed.

Some systems have reported missing package Time::HiRes. If that is the case, it
can be installed from CPAN or your choice of package manager. To test if it is 
already installed, use "perldoc Time::HiRes".

If you want to use the annotation tool - region_analysis.pl, it will also call
two external programs, refgene_getnearestgene and intersectBed. They are developed
by other researchers (see CREDITS). The former is already included in the bin/
directory and should be installed already. The intersectBed is from BEDTools
package and it can be downloaded from: http://code.google.com/p/bedtools/.


INSTALLATION

Installing diffReps is just like a standard PERL module. Basically you extract 
the package downloaded, go to the program directory and type the following 
commands:

perl Makefile.PL (Optional, PREFIX=your_perl_directory)
make
make test
make install

If you have root privileges, diffReps.pl will most likely be installed in
/usr/bin/. If you specified PREFIX in Makefile, it will be installed in
your_perl_directory/bin. Add your_perl_directory/bin to your PATH environmental 
variable, or copy diffReps.pl from your_perl_directory/bin to a directory that
is already in PATH, such as /home/yourname/bin.

Alternative. If you have cpanminus installed, you can also install diffReps with
one line command

cpanm diffReps-XXX.tar.gz

it will try to satisfy all the dependencies for you.


CHROMOSOME LENGTHS

It is important to supply diffReps with chromosome length information. diffReps 
requires that to bin the chromosomes into smaller sections. diffReps has a few
genomes built-in so what you need to do is just give a genome name, such as mm9
or hg19. If the genome you are interested in is not already defined, you can 
give a text file for chorosome lengths. An example input is like 

chr1    197195431
chr2    181748086
chr3    159599782
...


A NOTE ABOUT STATISTICAL TESTS

When you have biological replicates, Negative Binomial(NB) is the recommended 
test for differential analysis. An exact NB test is implemented in diffReps. 
Because NB distribution models discrete count data and over-dispersion among 
different samples, it appears to be an ideal model for ChIP-seq data. Many 
studies to date have used T-test on normalized counts for differential analysis.
However, this is sub-optimal because normalized counts are NOT Normally 
distributed! As a result, detection power can be significantly degraded. Another
caveat about T-test is that regions with very small counts may be picked up. 
Those regions should never pass cutoff because they don't have statistical 
significance. T-test ignores this fact because it simply treats them as 
continuous values. I still provide T-test in diffReps just for comparison 
purpose. 

If your experiment doesn't contain biological replicates, you can choose between
G-test and Chi-square test for differential analysis. They both give similar 
results but G-test is more recommended and has gained its popularity recently. 
See "http://en.wikipedia.org/wiki/G-test" for explanation. When they are chosen,
diffReps performs a goodness-of-fit test on the normalized counts of treatment
and control groups.

You can also use G-test or Chi-square test on data WITH biological replicates.
An incentive of doing this is that this may give you more sensitivity but with a
possibility of incurring false positives. diffReps automatically combines the
biological replicates and generate a probablity vector accordingly. That means,
if you have TWO replicates for treatment group and THREE replicates for control
group, the probablity vector will be adjusted to reflect the replicate number
difference.


DIFFERENTIAL SITES ANNOTATION

diffReps includes a script for annotation of a differential sites list. By default,
it will be evoked after diffReps finished running and annotate the differential
sites based on their locations to the nearest genes. If no nearby genes can be found,
it will also associate the differential sites with heterochromatic regions. A 
differential site will be assigned to one of the following categories:

ProximalPromoter: +/- 250bp of TSS
Promoter1k:       +/- 1kbp of TSS
Promoter3k:       +/- 3kbp of TSS
Genebody:         Anywhere between a gene's promoter and up to 1kbp downstream i
                  of the TES.
Genedeserts:      Genomic regions that are depleted with genes and are at least 
                  1Mbp long.
Pericentromere:   Between the boundary of a centromere and the closest gene minus
                  10kbp of that gene's regulatory region.
Subtelomere:      Similary defined as pericentromere.
OtherIntergenic:  Any region that does not belong to the above categories.

The script can also be triggered manually. For example, if you want to annotate 
a differential list diff.h3k4me3.txt, you can use command like:

region_analysis.pl -i diff.h3k4me3.txt -r -d refseq -g mm9

will annotate the list using reference genome mm9 and the RefSeq database. The 
output will write to diff.h3k4me3.txt.annotated. 


FINDING HISTONE MODIFICATION HOTSPOTS

The distance between two adjacent differential sites can be approximated by a 
Poisson distribution if they were positioned by random allocation. In reality, 
differential sites are often discovered to be spatially clustered together, 
forming so called chromatin modification hotspots. diffReps finds the hotspots 
by first building a null model on site-to-site distance, and then looking for 
regions that violate the null model with statistical significance using greedy 
search. diffReps reports the start, end positions as well as associated p-values 
and FDR of the hotspots. In addition, diffReps can accept more than one 
differential list of different histone marks as input, so that one can predict 
hotspots that show interaction between two or more histone marks.

By default, the finding hotspots routine will be called after diffReps finishes
detecting differential sites. It will try to identify hotspots from the 
differential list just generated. The routine can also be used separately. For
example, if you have two differential lists named diff.h3k4me3.txt and 
diff.polII.txt, you can look for hotspots that represent the interaction between
H3K4me3 and Pol II using command like:

findHotspots.pl -d diff.h3k4me3.txt diff.polII.txt -o hotspot_k4.pol2.txt

will generate a hotspots list in "hotspot_k4.pol2.txt" file.


EXAMPLES

diffReps requires input of BED files for ChIP-seq alignments for both treatment 
and control groups. BED files can be converted from any alignment format, such 
as BAM(Tip: you can use BedTools for this). An example of using diffReps for 
differential analysis is as follows 

diffReps.pl -tr C1.bed C2.bed C3.bed -co S1.bed S2.bed S3.bed -gn mm9 \
-re diff.nb.txt -me nb

The output will be in diff.nb.txt file. By default, a sliding window of 1kbp is
used with a moving step size of 100bp. There are other parameters that can be 
tuned for your data. Just type diffReps.pl in command console without specifying
any arguments and hit Enter, you will see a usage summary.


RUNNING TIME

The running time of diffReps totally depends on data and parameter settings. It
can vary wildly between 30min and 10h. The most influential parameters on 
running time are window size and step size. The smaller the window size, the 
longer the running time which scales linearly.