GSNormalize              package:GSEAlm              R Documentation

_A_g_g_r_e_g_a_t_i_n_g _a_n_d _c_a_l_c_u_l_a_t_i_n_g _e_x_p_r_e_s_s_i_o_n _s_t_a_t_i_s_t_i_c_s _b_y _G_e_n_e _S_e_t

_D_e_s_c_r_i_p_t_i_o_n:

     Provides an interface for producing aggregate gene-set statistics,
     for gene-set-enrichment analysis (GSEA).  The function is best
     suited for mean or rescaled-mean GSEA approaches, but is hopefully
     generic enough  to enable other approaches as well.

_U_s_a_g_e:

     GSNormalize(dataset, incidence, gseaFun = crossprod, fun1 = "/", fun2 = sqrt, removeShift=FALSE, removeStat=mean, ...)
     identity(x)
     one(x)

_A_r_g_u_m_e_n_t_s:

 dataset: a numeric matrix, typically of some gene-level statistics 

incidence: 0/1 incidence matrix indicating genes' membership in
          gene-sets

 gseaFun: function name for the type of aggregation to take place,
          defaults to 'crossprod'. See 'Details' 

    fun1: function name for normalization, defaults to "/". See
          'Details' 

    fun2: function name for scaling, defaults to 'sqrt'. See 'Details'

removeShift: logical: should normalization begin with a column-wise
          removal of the mean shift?

removeStat: (if above is TRUE) the column-wise statistic to be swept
          out of 'dataset'.

     ...: Additional arguments optionally passed on to 'gseaFun'.

       x: any numerical value

_D_e_t_a_i_l_s:

     In gene-set-enrichment analysis (GSEA), the core step is
     aggregating (or calculating) gene-set-level statistics from
     gene-set statistics. This utility achieves the feat. It is
     tailored specifically for rescaled-sums of the type suggested by
     Jiang and Gentleman (2007), but is designed as a generic template
     that should other GSEA approaches.  In such cases, at this moment
     users should provide their own version of 'gseaFun'.

     The default will generate sums of gene-level values divided by the
     square-root of the gene-set size (in other words, gene-set means
     multiplied by the square-root of gene-set size). The arithmetic
     works like this:

     gene-set stat = gseaFun(t(incidence),dataset),...) 'fun1'
     fun2(gene-set size).

     In case there is a known (or suspected) overall baseline shift
     (i.e., the mass of gene-level stats is not centered around zero)
     it may be scientifically more meaningful to look for gene-set
     deviating from this baseline rather than from zero. In this case,
     you can set 'removeShift=TRUE'.

     Also provided are the 'identity' function (identity = function(x)
     x), so that leaving 'gseaFun' and 'fun1' at their default and
     setting 'fun2 = identity' will generate gene-set means - and the
     'one' function to neutralize the effect of both 'fun1' and 'fun2'
     (see note below).

_V_a_l_u_e:

     'GSNormalize' returns a matrix with the same number of rows as
     'incidence' and the same number of columns as 'dataset' (if
     'dataset' is a vector, the output will be a vector as well). The
     respective row and column names will carry through from 'dataset'
     and 'incidence' to the output.

     'identity' simply returns x. 'one' returns the number 1.

_N_o_t_e:

     If you want to create your own GSEA function for 'gseaFun', note
     that it should receive the transposed incidence matrix as its
     first argument, and the gene-level stats as its second argument. 
     In other words, both should have genes as rows. also, you can
     easily neutralize the effect of 'fun1', 'fun2' by setting "fun2 =
     one".

_A_u_t_h_o_r(_s):

     Assaf Oron

_R_e_f_e_r_e_n_c_e_s:

     Z. Jiang and R. Gentleman, "Extensions to Gene Set Enrichment
     Analysis",Bioinformatics (23),306-313, 2007.

_S_e_e _A_l_s_o:

     'gsealmPerm', which relies heavily on this function

_E_x_a_m_p_l_e_s:

     data(sample.ExpressionSet)
     lm1 = lmPerGene(sample.ExpressionSet,~sex+type)

     ### Generating random pseudo-gene-sets
     fauxGS=matrix(sample(c(0,1),size=50000,replace=TRUE,prob=c(.9,.1)),nrow=100)

     ### "tau-stats" for gene-SET-level type effect, adjusting for sex
     fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS)

     qqnorm(fauxEffects)
     ### diagonal line represents zero-shift null; note that it doesn't fit
     abline(0,1,col=2)
     ### a better option may be to run a diagonal through the middle of the
     ### data (nonzero-shift null, i.e. type may have an effect but it is the
     ### same for all gene-sets); note that if any outlier shows, it is a purely random one!

     abline(median(fauxEffects),1,col=4)

     #### Now try with baseline-shift removal

     fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS,removeShift=TRUE)

     qqnorm(fauxEffects)
     abline(0,1,col=2)

