dfbetasPerGene            package:GSEAlm            R Documentation

_L_i_n_e_a_r-_M_o_d_e_l _D_e_l_e_t_i_o_n _D_i_a_g_n_o_s_t_i_c_s _f_o_r _G_e_n_e _E_x_p_r_e_s_s_i_o_n (_o_r _s_i_m_i_l_a_r)
_D_a_t_a _S_t_r_u_c_t_u_r_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     This is an extension of standard linear-model diagnostics for use
     with gene-expression datasets, in which the same model was run
     simultaneously on each row of a response matrix.

_U_s_a_g_e:

      dfbetasPerGene(lmobj)

      CooksDPerGene(lmobj)

      dffitsPerGene(lmobj)

      Leverage(lmobj)

_A_r_g_u_m_e_n_t_s:

   lmobj: An object produced by 'lmPerGene'. 

_D_e_t_a_i_l_s:

     Deletion diagnostics gauge the influence of each observation upon
     model fit, by calculating values after removal of the observation
     and comparing to the complete-data version.

     DFFITS_i measures the distance on the response scale, between
     fitted values with and without observation y_i, at point i. The
     distance is normalized by the regression standard error and the
     point's leverage (see below).

     Cook's D_i is the square of the distance, in parameter space,
     between parameter estimates witn and without observation y_i,
     normalized and rescaled by standard errors and by a factor
     depending upon leverage.

     DFBETAS_{i,j} breaks the square root of Cook's D into its
     Euclidean components for each parameter j - but uses a somewhat
     different scaling function from Cook's D.

     The leverage is the diagonal of the "hat matrix" X'(X'X)^{-1}X'.
     This measure provides the relative weight of observation y_i in
     the fitted value y-hat_i. Typically observations with extreme X
     values (or belonging to smaller groups if model variables are
     categorical) will have high leverage.

     All these functions exist for standard regression, see
     'influence.measures'.

     The functions described here are extensions for the case in which
     the response is a matrix, and the same linear model is run on each
     row separately.

     For more details, see the references below.

     All functions are implemented in matrix form, which means they run
     quite fast.

_V_a_l_u_e:

     'dfbetasPerGene' A G x n x p array, where G, n are the number of
     rows and columns in the input's expression matrix, respectively,
     and p the number of parameters in the linear model (including
     intercept)

     'CooksDPerGene' A G x n matrix.

     'dffitsPerGene' A G x n matrix.

     'Leverage' A vector of length n, corresponding to the diagonal of
     the "hat matrix".

_N_o_t_e:

     The commonly-cited reference alert thresholds for diagnostic
     measures such as Cook's $D$ and DFBETAS, found in older
     references, appear to be out of date. See LaMotte (1999) and
     Jensen (2001) for a more recent discussion. Our suggested practice
     is to inspect any samples or values that are visibly separate from
     the pack.

_A_u_t_h_o_r(_s):

     Robert Gentleman, Assaf Oron

_R_e_f_e_r_e_n_c_e_s:

     Belsley, D. A., Kuh, E. and Welsch, R. E. (1980) Regression
     Diagnostics. New York: Wiley.

     Cook, R. D. and Weisberg, S. (1982) Residuals and Influence in
     Regression. London: Chapman and Hall.

     Williams, D. A. (1987) Generalized linear model diagnostics using
     the deviance and single case deletions. Applied Statistics *36*,
     181-191.

     Fox, J. (1997) Applied Regression, Linear Models, and Related
     Methods. Sage.

     LaMotte, L. R. (1999) Collapsibility hypotheses and diagnostic
     bounds in regression analysis. Metrika 50, 109-119.

     Jensen, D.R. (2001) Properties of selected subset diagnostics in
     regression. Statistics and Probability Letters 51, 377-388.

_S_e_e _A_l_s_o:

     'influence.measures' for the analogous simple regression
     diagnostic functions

_E_x_a_m_p_l_e_s:

     data(sample.ExpressionSet)
     layout(1)
     lm1 = lmPerGene( sample.ExpressionSet,~score+type)
     CD = CooksDPerGene(lm1)
     ### How does the distribution of mean Cook's distances across samples look?

     boxplot(log2(CD) ~ col(CD),names=colnames(CD),ylab="Log Cook's
     Distance",xlab="Sample")
     ### There are a few gross individual-observation outliers (which is why we plot on the log
     ### scale), but otherwise no single sample pops out as problematic. Here's
     ### one commonly-used alert level for problems:
     lines(c(-5,30),rep(log2(2/sqrt(26)),2),col=2)

     DFB = dfbetasPerGene(lm1)

     ### Looking for simultaneous two-effect outliers - 500 genes times 26
     ### samples makes 13000 data points on this plot

     plot(DFB[,,2],DFB[,,3],main="DFBETAS for Score and Type (all genes)",xlab="Score Effect
     Offset (normalized units)",ylab="Type Effect Offset (normalized units)",pch='+',cex=.5)
     lines(c(-100,100),rep(0,2),col=2)
     lines(rep(0,2),c(-100,100),col=2)

     DFF = dffitsPerGene(lm1)
     summary(apply(DFF,2,mean))

     Lev = Leverage(lm1)
     table(Lev)
     ### should have only two unique values because this is a dichotomous one-factor model

