logicFS               package:logicFS               R Documentation

_F_e_a_t_u_r_e _S_e_l_e_c_t_i_o_n _w_i_t_h _L_o_g_i_c _R_e_g_r_e_s_s_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Identification of interesting interactions between binary
     variables using logic regression. Currently available for the
     classification, the linear regression and the logistic regression
     approach of 'logreg' and for a multinomial logic regression as
     implemented in 'mlogreg'.

_U_s_a_g_e:

     ## S3 method for class 'formula':
     logicFS(formula, data, recdom = TRUE, ...)

     ## Default S3 method:
     logicFS(x, y, B = 100, useN = TRUE, ntrees = 1, nleaves = 8, 
       glm.if.1tree = FALSE, replace = TRUE, sub.frac = 0.632, 
       anneal.control = logreg.anneal.control(), onlyRemove = FALSE,
       prob.case = 0.5, addMatImp = TRUE, fast = FALSE, rand = NULL, ...)

_A_r_g_u_m_e_n_t_s:

 formula: an object of class 'formula' describing the model that should
          be fitted.

    data: a data frame containing the variables in the model. Each row
          of 'data' must correspond to an observation, and each column
          to a binary variable (coded by 0 and 1)  or a factor (for
          details, see 'recdom') except for the column comprising the
          response. The response must be either binary (coded by 0 and
          1), categorical or continuous. If continuous, a linear model
          is fitted in each of the 'B' iterations of 'logicFS'. If
          categorical, the column of 'data' specifying the response
          must be a factor. In this case, multinomial logic regressions
          are performed as implemented in 'mlogreg'. Otherwise,
          depending on 'ntrees' (and 'glm.if.1tree') the classification
          or the logistic regression approach of logic regression is
          used.

  recdom: a logical value or vector of length 'ncol(data)' comprising
          whether a SNP should be transformed into two binary dummy
          variables coding for a recessive and a dominant effect. If
          'TRUE' (logical value), then all factors (variables) with
          three levels will be coded by two dummy variables as
          described in 'make.snp.dummy'. Each level of each of the
          other factors  (also factors specifying a SNP that shows only
          two genotypes) is coded by one indicator variable.  If
          'FALSE' (logical value), each level of each factor is coded
          by an indicator variable. If 'recdom' is a logical vector,
          all factors corresponding to an entry in 'recdom' that is
          'TRUE' are assumed to be SNPs and transformed into the two
          binary variables described above. Each variable that
          corresponds to an entry of 'recdom' that is 'TRUE' (no matter
          whether 'recdom' is a vector or a value) must be coded by the
          integers 1 (coding for the homozygous reference genotype), 2
          (heterozygous),  and 3 (homozygous variant).

       x: a matrix consisting of 0's and 1's. Each column must
          correspond to a binary variable and each row to an
          observation.

       y: a numeric vector or a factor specifying the values of a
          response for all the observations  represented in 'x'. If a
          numeric vector, then 'y' either contains  the class labels
          (coded by 0 and 1) or the values of a continuous response
          depending on whether the classification or logistic
          regression approach of logic regression, or the linear
          regression approach, respectively, should be used. If the
          response is categorical, then 'y' must be a factor naming the
          class labels of the observations.

       B: an integer specifying the number of iterations.

    useN: logical specifying if the number of correctly classified
          out-of-bag observations should be used in the computation of
          the importance measure. If 'FALSE', the proportion of
          correctly classified oob observations is used instead.

  ntrees: an integer indicating how many trees should be used. 

          For a binary response: If 'ntrees' is larger than 1, the
          logistic regression approach of logic regreesion will be
          used. If 'ntrees' is 1, then by default the classification
          approach of logic regression will be used (see
          'glm.if.1tree'.)

          For a continuous response: A linear regression model with
          'ntrees' trees is fitted in each of the 'B' iterations.

          For a categorical response: n.lev-1 logic regression models
          with 'ntrees' trees are fitted, where n.lev is the number of
          levels of the response (for details, see 'mlogreg').

 nleaves: a numeric value specifying the maximum number of leaves used
          in all trees combined. For details, see the help page of the
          function 'logreg' of the package 'LogicReg'.

glm.if.1tree: if 'ntrees' is 1 and 'glm.if.1tree' is 'TRUE' the
          logistic regression approach of logic regression is used
          instead of the classification approach. Ignored if 'ntrees'
          is not 1, or the response is not binary.

 replace: should sampling of the cases be done with replacement? If 
          'TRUE', a Bootstrap sample of size 'length(cl)' is drawn from
          the 'length(cl)' observations in each of the 'B' iterations.
          If 'FALSE', 'ceiling(sub.frac * length(cl))' of the
          observations are drawn without replacement in each iteration.

sub.frac: a proportion specifying the fraction of the observations that
          are used in each iteration to build a classification rule if
          'replace = FALSE'. Ignored if 'replace = TRUE'.

anneal.control: a list containing the parameters for simulated
          annealing. See the help of the function
          'logreg.anneal.control' in the 'LogicReg' package.

onlyRemove: should in the single tree case the multiple tree measure be
          used? If 'TRUE', the prime implicants are only removed from
          the trees when determining the importance in the single tree
          case. If 'FALSE', the original single tree measure is
          computed for each prime implicant, i.e. a prime implicant is
          not only removed from the trees in which it is contained, but
          also added to the trees that do not contain this interaction.
          Ignored in all other than the classification case.

prob.case: a numeric value between 0 and 1. If the outcome of the
          logistic regression, i.e. the predicted probability, for an
          observation is larger than 'prob.case' this observations will
          be classified as case  (or 1).

addMatImp: should the matrix containing the improvements due to the
          prime implicants in each of the iterations be added to the
          output? (For each of the prime implicants, the importance is
          computed by the average over the 'B' improvements.) Must be
          set to 'TRUE', if standardized importances should be computed
          using  'vim.norm', or if permutation based importances should
          be computed  using 'vim.perm'.

    fast: should a greedy search (as implemented in 'logreg') be used
          instead of simulated annealing?

    rand: numeric value. If specified, the random number generator will
          be set into a reproducible state.

     ...: for the 'formula' method, optional parameters to be passed to
          the low level function 'logicFS.default'. Otherwise, ignored.

_V_a_l_u_e:

     An object of class 'logicFS' containing 

  primes: the prime implicants,

     vim: the importance of the prime implicants,

    prop: the proportion of logic regression models that contain the
          prime  implicants,

    type: the type of model (1: classification, 2: linear regression,
          3: logistic regression),

   param: further parameters (if 'addInfo = TRUE'),

 mat.imp: the matrix containing the improvements if 'addMatImp = TRUE',
          otherwise, 'NULL',

 measure: the name of the used importance measure,

    useN: the value of 'useN',

threshold: NULL,

      mu: NULL.

_A_u_t_h_o_r(_s):

     Holger Schwender, holger.schwender@udo.edu

_R_e_f_e_r_e_n_c_e_s:

     Ruczinski, I., Kooperberg, C., LeBlanc M.L. (2003). Logic
     Regression. _Journal of Computational and Graphical Statistics_,
     12, 475-511.

     Schwender, H., Ickstadt, K. (2007). Identification of SNP
     Interactions Using Logic Regression. _Biostatistics_, 9(1),
     187-198.

_S_e_e _A_l_s_o:

     'plot.logicFS', 'logic.bagging'

_E_x_a_m_p_l_e_s:

     ## Not run: 
        # Load data.
        data(data.logicfs)
        
        # For logic regression and hence logic.fs, the variables must
        # be binary. data.logicfs, however, contains categorical data 
        # with realizations 1, 2 and 3. Such data can be transformed 
        # into binary data by
        bin.snps<-make.snp.dummy(data.logicfs)
        
        # To speed up the search for the best logic regression models
        # only a small number of iterations is used in simulated annealing.
        my.anneal<-logreg.anneal.control(start=2,end=-2,iter=10000)
        
        # Feature selection using logic regression is then done by
        log.out<-logicFS(bin.snps,cl.logicfs,B=20,nleaves=10,
            rand=123,anneal.control=my.anneal)
        
        # The output of logic.fs can be printed
        log.out
        
        # One can specify another number of interactions that should be
        # printed, here, e.g., 15.
        print(log.out,topX=15)
        
        # The variable importance can also be plotted.
        plot(log.out)
        
        # And the original variable names are displayed in
        plot(log.out,coded=FALSE)
     ## End(Not run)

