KEstimate             package:pcaMethods             R Documentation

_E_s_t_i_m_a_t_e _b_e_s_t _n_u_m_b_e_r _o_f _C_o_m_p_o_n_e_n_t_s _f_o_r _m_i_s_s_i_n_g _v_a_l_u_e _e_s_t_i_m_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Perform cross validation to estimate the optimal number of
     components for missing value estimation. Cross validation is done
     on the subset containing only complete observations because
     including incomplete observations may tamper the results. The
     assumption hereby is that genes that are highly correlated in a
     distinct region (here the non-missing observations) are also
     correlated in another (here the missing observations). This also
     implies that the complete subset must be large enough to be
     representative. For each incomplete gene, the available values are
     divided into a user defined number of cv-segments. The segments
     have equal size, but are chosen from a random equal distribution.
     The non-missing values of the gene are covered completely. PPCA,
     BPCA, SVDimpute and Nipals PCA may be used for imputation. 
      The whole cross validation is repeated several times. As error
     measure the NRMSEP (see Feten et. al, 2005) is used. This error
     basically normalises the RMSD between original data and estimate
     by the gene-wise variance. The reason for this is that a higher
     variance will lead to a higher estimation error.

_U_s_a_g_e:

     kEstimate(data, method = "ppca", maxPcs = 3, segs = 3, nruncv = 10,
     allGenes = FALSE, verbose = interactive(), random = FALSE)

_A_r_g_u_m_e_n_t_s:

    data: 'matrix' - numeric matrix containing observations in rows and
           genes in columns

  method: 'character' - One of ppca | bpca | svdImpute | nipals

  maxPcs: 'numeric' - number of principal components to use for cross
          validation. The NRMSEP is calculated for 1:maxPcs components.

    segs: 'numeric' - number of segments for cross validation

  nruncv: 'numeric' - Times the whole cross validation is repeated

allGenes: 'boolean' - If TRUE, the NRMSEP is calculated for all genes,
          If FALSE, only the incomplete ones are included. You maybe
          want to do this to compare several methods on a  complete
          data set.

 verbose: 'boolean' - If TRUE, the NRMSEP and the variance are printed
          to the console each iteration.

  random: 'boolean' - Impute normal distributed random values with same
          mean and standard deviation than the original data. This is
          only thought for comparison.

_D_e_t_a_i_l_s:

     Run time may be very high on large data sets. Also, when used with
     methods like BPCA or Nipals PCA which are already quite slow. The
     estimation method is called  (g_miss * segs * nruncv) times, where
     g_miss is the number of genes showing missing values.

_V_a_l_u_e:

    list: Returns a list with the elements:

             *  mink - number of PCs for which the minimal average
                NRMSEP was obtained

             *  nrmsep - a matrix of dimension (nruncv, maxPcs). The
                columns contain the NRMSEP obtained for each repeat of
                the cross validation.

_A_u_t_h_o_r(_s):

     Wolfram Stacklies 
      Max Planck Institut fuer Molekulare Pflanzenphysiologie, Potsdam,
     Germany 
      wolfram.stacklies@gmail.com 

_S_e_e _A_l_s_o:

     'bpca, svdImpute, prcomp, nipalsPca, pca'.

_E_x_a_m_p_l_e_s:

     ## Load a sample metabolite dataset (metaboliteData)
     data(metaboliteData)

     # Now remove 10% of the data
     rows <- nrow(metaboliteData)
     cols <- ncol(metaboliteData)
     cond<-matrix(runif(rows * cols),rows,cols) < 0.1
     metaboliteData[cond] <- NA

     # Do cross validation with ppca for component 1:3
     nrmsep <- kEstimate(metaboliteData, method = "ppca", maxPcs = 3, nruncv=1)

     # Plot the result
     barplot(drop(nrmsep$nrmsep), xlab = "Components",ylab = "NRMSEP (1 iterations)")

     # The best k value is:
     print(nrmsep$mink)

