gsMMD2             package:GeneSelectMMD             R Documentation

_G_e_n_e _s_e_l_e_c_t_i_o_n _b_a_s_e_d _o_n _a _m_i_x_t_u_r_e _o_f _m_a_r_g_i_n_a_l _d_i_s_t_r_i_b_u_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Gene selection based on the marginal distributions of gene
     profiles that characterized by a mixture of three-component
     multivariate distributions. Input is an object derived from the
     class 'ExpressionSet'. The user needs to provide initial gene
     cluster membership.

_U_s_a_g_e:

     gsMMD2(obj.eSet, 
            memSubjects, 
            memIni,
            maxFlag = TRUE, 
            thrshPostProb = 0.5, 
            geneNames = NULL, 
            alpha = 0.05, 
            transformFlag = FALSE, 
            transformMethod = "boxcox", 
            scaleFlag = FALSE, 
            if.center = TRUE, 
            if.scale = TRUE, 
            criterion = c("cor", "skewness", "kurtosis"), 
            minL = -10, 
            maxL = 10, 
            stepL = 0.1, 
            eps = 0.001, 
            ITMAX = 100, 
            plotFlag = FALSE,
            quiet=TRUE)

_A_r_g_u_m_e_n_t_s:

obj.eSet: an object derived from the class 'ExpressionSet' which
          contains the matrix of gene expression levels. The rows of
          the matrix are genes. The columns of the matrix are subjects.

memSubjects: a vector of membership of subjects. 'memSubjects[i]=1'
          means that the i-th subject belongs to diseased group, 0
          otherwise.  

  memIni: a vector of user-provided gene cluster membership.

 maxFlag: logical. Indicate how to assign gene class membership.
          'maxFlag'=TRUE means that a gene will be assigned to a class
          in which the posterior probability of the gene belongs to
          this class is maximum. 'maxFlag'=FALSE means that a gene will
          be assigned to class 1 if the posterior probability of the
          gene belongs to class 1 is greater than 'thrshPostProb'.
          Similarly, a gene will be assigned to class 1 if the
          posterior probability of the gene belongs to class 1 is
          greater than 'thrshPostProb'. If  the posterior probability
          is less than 'thrshPostProb', the gene will be assigned to
          class 2 (non-differentially expressed gene group).

thrshPostProb: threshold for posterior probabilities. For example, if
          the posterior probability that a gene belongs to cluster 1
          given its gene expression levels is larger than
          'thrshPostProb', then this gene will be assigned to cluster
          1.

geneNames: an optional character vector of gene names

   alpha: significant level which is equal to '1-conf.level', 
          'conf.level' is the argument for the function 't.test'. 

transformFlag: logical. Indicate if data transformation is needed

transformMethod: method for transforming data. Available methods
          include "boxcox", "log2", "log10", "log", "none".

scaleFlag: logical. Indicate if gene profiles are to be scaled. If
          'transformFlag=TRUE' and 'scaleFlag=TRUE', then scaling is
          performed after transformation.

if.center: logical. If 'scaleFlag=TRUE' and 'if.center=TRUE', then each
          gene profile will be centered to have mean zero.

if.scale: logical. If 'scaleFlag=TRUE' and 'if.scale=TRUE', then each
          gene profile will be scaled to have variance one.

criterion: if 'transformFlag=TRUE', 'criterion' indicates what
          criterion to determine if data looks like normal.  cor
          means using Pearson's correlation. The idea is that the
          observed quantiles after transformation should be close to
          theoretical normal quantiles. So we can use Pearson's
          correlation to check if the scatter plot of theoretical
          normal quantiles versus observed quantiles is a straightline.
           skewness means using skewness measure to check if the
          distribution of the transformed data are close to normal
          distribution; kurtosis means using kurtosis measure to
          check normality.

    minL: lower limit for the 'lambda' parameter used in Box-Cox
          transformation

    maxL: upper limit for the 'lambda' parameter used in Box-Cox
          transformation

   stepL: step increase when searching the optimal 'lambda' parameter
          used in Box-Cox transformation

     eps: a small positive value. If the absolute value of a value is
          smaller than 'eps', this value is regarded as zero.  

   ITMAX: maximum iteration allowed for iterations in the EM algorithm

plotFlag: logical. Indicate if the Box-Cox normality plot should be
          output.

   quiet: logical. Indicate if intermediate results should be printed
          out.

_D_e_t_a_i_l_s:

     We assume that the distribution of gene expression profiles is  a
     mixture of 3-component multivariate normal distributions 
     sum_{k=1}^{3} pi_k f_k(x|theta). Each component distribution f_k 
     corresponds to a gene cluster. The 3 components correspond to 3
     gene clusters: (1) up-regulated gene cluster, (2)
     non-differentially expressed gene cluster,  and (3) down-regulated
     gene cluster.  The model parameter vector is theta=(pi_1, pi_2, 
     pi_3, mu_{c1},  sigma^2_{c1}, rho_{c1}, mu_{n1}, sigma^2_{n1}, 
     rho_{n1}, mu_2, sigma^2_2, rho_2,  mu_{c3},  sigma^2_{c3},
     rho_{c3}, mu_{n3}, sigma^2_{n3},  rho_{n3}. where pi_1, pi_2, and
     pi_3 are the mixing proportions;  mu_{c1}, sigma^2_{c1}, and
     rho_{c1} are  the marginal mean, variance, and correlation of gene
     expression levels  of cluster 1 (up-regulated genes) for diseased
     subjects;  mu_{n1}, sigma^2_{n1}, and rho_{n1} are  the marginal
     mean, variance, and correlation of gene expression levels  of
     cluster 1 (up-regulated genes) for non-diseased subjects;  mu_2,
     sigma^2_2, and rho_2 are the marginal mean,  variance, and
     correlation of gene  expression levels of cluster 2
     (non-differentially expressed genes);  mu_{c3}, sigma^2_{c3}, and
     rho_{c3} are  the marginal mean, variance, and correlation of gene
     expression levels  of cluster 3 (up-regulated genes) for diseased
     subjects;  mu_{n3}, sigma^2_{n3}, and rho_{n3} are  the marginal
     mean, variance, and correlation of gene expression levels  of
     cluster 3 (up-regulated genes) for non-diseased subjects. 

     Note that genes in cluster 2 are non-differentially expressed
     across abnormal and normal tissue samples. Hence there are only 3
     parameters for cluster 2.

     We apply the EM algorithm to estimate the model parameters.  We
     regard the cluster membership of genes as missing values.

_V_a_l_u_e:

     A list contains 10 elements. 

     dat: the (transformed) microarray data matrix. If tranformation
          performed, then 'dat' will be different from the input 
          microarray data matrix.

memSubjects: the same as the input 'memSubjects'.

memGenes: a vector of cluster membership of genes. 1 means up-regulated
          gene; 2 means non-differentially expressed gene;  3 means
          down-regulated gene.

memGenes2: an variant of the vector of cluster membership of genes.  1
          means differentially expressed gene; 0 means
          non-differentially expressed gene.

    para: parameter estimates (c.f. details).

    llkh: value of the loglikelihood function.

   wiMat: posterior probability that a gene belongs to a cluster given
          the expression levels of this gene. Column i is for cluster
          i.

  memIni: the initial cluster membership of genes.

 paraIni: the parameter estimates based on initial gene cluster
          membership.

 llkhIni: the value of loglikelihood function.

  lambda: the parameter used to do Box-Cox transformation

_N_o_t_e:

     The speed of the program is slow for large data sets.

_A_u_t_h_o_r(_s):

     Weiliang Qiu stwxq@channing.harvard.edu, Wenqing He
     whe@stats.uwo.ca, Xiaogang Wang stevenw@mathstat.yorku.ca, Ross
     Lazarus ross.lazarus@channing.harvard.edu

_R_e_f_e_r_e_n_c_e_s:

     Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008).  A
     Marginal Mixture Model for Selecting Differentially Expressed
     Genes across Two Types of Tissue Samples. _The International
     Journal of Biostatistics. 4(1):Article 20._ <URL:
     http://www.bepress.com/ijb/vol4/iss1/20>

_S_e_e _A_l_s_o:

     'gsMMD', 'gsMMD.default', 'gsMMD2.default'

_E_x_a_m_p_l_e_s:

       library(ALL)
       data(ALL)
       eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"]
       
       mem.str <- as.character(eSet1$BT)
       nSubjects <- length(mem.str)
       memSubjects <- rep(0,nSubjects)
       # B3 coded as 0, T2 coded as 1
       memSubjects[mem.str == "T2"] <- 1
       
       myWilcox <-
       function(x, memSubjects, alpha = 0.05)
       {
         xc <- x[memSubjects == 1]
         xn <- x[memSubjects == 0]
       
         m <- sum(memSubjects == 1)
         res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha)
         res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2)
         names(res2) <- c("p.value", "statistic")
       
         return(res2)
       }
       
       mat <- exprs(eSet1)
       tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects))
       colnames(tmp) <- c("p.value", "statistic")
       memIni <- rep(2, nrow(mat))
       memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1
       memIni[tmp[, 1] < 0.05 & tmp[, 2] < 0] <- 3
       
       cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n");

       obj.gsMMD <- gsMMD2(eSet1, memSubjects, memIni, transformFlag = TRUE, 
            transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE)
       round(obj.gsMMD$para, 3)

