KEstimate         package:pcaMethods         R Documentation(latin1)

_E_s_t_i_m_a_t_e _b_e_s_t _n_u_m_b_e_r _o_f _C_o_m_p_o_n_e_n_t_s _f_o_r _m_i_s_s_i_n_g _v_a_l_u_e _e_s_t_i_m_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Perform cross validation to estimate the optimal number of
     components for missing value estimation.

     Cross validation is done for the complete subset of a variable. 
     The assumption hereby is that variables that are highly correlated
     in a distinct region (here the non-missing observations) are also
     correlated in another (here the missing observations).  This also
     implies that the complete subset must be large enough to be
     representative.  For each incomplete variable, the available
     values are divided into a user defined number of cv-segments. The
     segments have equal size, but are chosen from a random equal
     distribution. The non-missing values of the variable are covered
     completely.  PPCA, BPCA, SVDimpute, Nipals PCA, llsImpute an NLPCA
     may be used for imputation.

     The whole cross validation is repeated several times so, depending
     on the parameters, the calculations can take very long time.  As
     error measure the NRMSEP (see Feten et. al, 2005) or the Q2
     distance is used.  The NRMSEP basically normalises the RMSD
     between original data and estimate by the variable-wise variance.
     The reason for this is that a higher variance will generally lead
     to a higher estimation error.  If the number of samples is small,
     the variable - wise variance may become an unstable criterion and
     the Q2 distance should be used instead. Also if variance
     normalisation was applied previously.

     The method proceeds variable - wise, the NRMSEP / Q2 distance is
     calculated for each incomplete variable and averaged afterwards.
     This allows to easily see for wich set of variables missing value
     imputation makes senes and for wich set no imputation or something
     like mean-imputation should be used.

     Use 'kEstimateFast' or 'Q2' if you are not interested in variable
     wise values.

_U_s_a_g_e:

     kEstimate(Matrix, method = "ppca", evalPcs = 1:3, segs = 3, nruncv = 5,
     em = "q2", allVariables = FALSE, verbose = interactive(),...)

_A_r_g_u_m_e_n_t_s:

  Matrix: 'matrix' - numeric matrix containing observations in rows and
           variables in columns

  method: 'character' - One of ppca | bpca | svdImpute | nipals | nlpca
          | llsImpute | llsImputeAll. The option llsImputeAll calls
          llsImpute with the allVariables = TRUE parameter.

 evalPcs: 'numeric' - The principal components to use for cross
          validation or the number of neighbour variables if used with
          llsImpute. Should be an array containing integer values, eg.
          evalPcs = 1:10 or evalPcs = C(2,5,8). The NRMSEP or Q2 is
          calculated for each component.

    segs: 'numeric' - number of segments for cross validation

  nruncv: 'numeric' - Times the whole cross validation is repeated

      em: 'character' - The error measure. This can be nrmsep or q2

allVariables: 'boolean' - If TRUE, the NRMSEP is calculated for all
          variables, If FALSE, only the incomplete ones are included.
          You maybe want to do this to compare several methods on a 
          complete data set.

 verbose: 'boolean' - If TRUE, some output like the variable indexes
          are printed to the console each iteration.

     ...: Further arguments to 'pca()' or 'nni()'

_D_e_t_a_i_l_s:

     Run time may be very high on large data sets. Especially when used
     with complex methods like BPCA or Nipals PCA.  For PPCA, BPCA,
     Nipals PCA and NLPCA the estimation method is called (v_miss *
     segs * nruncv) times as the error for all numbers of principal
     components can be calculated at once.  For LLSimpute and SVDimpute
     this is not possible, and the method is called (v_miss * segs *
     nruncv * length(evalPcs)) times. This should still be fast for
     LLSimpute because the method allows to choose to only do the
     estimation for one particular variable.  This saves a lot of
     iterations.  Here, v_miss is the number of variables showing
     missing values.

     As cross validation is done variable-wise, in this function Q2 is
     defined on single variables, not on the entire data set. This is
     Q2 is calculated as as sum(x - xe)^2 sum(x^2), where x is the
     currently used variable and xe it's estimate. The values are then
     averaged over all variables. The NRMSEP is already defined
     variable-wise. For a single variable it is then sqrt(sum(x - xe)^2
      (n * var(x))), where x is the variable and xe it's estimate, n is
     the length of x.  The variable wise estimation errors are returned
     in parameter variableWiseError.

_V_a_l_u_e:

    list: Returns a list with the elements:

             *  bestNPcs - number of PCs or k for which the minimal
                average NRMSEP or the maximal Q2 was obtained.

             *  eError - an array of of size length(evalPcs). Contains
                the average error of the cross validation runs for each
                number of components.

             *  variableWiseError - Matrix of size incomplete_variables
                x length(evalPcs). Contains the NRMSEP or Q2 distance
                for each variable and each number of PCs. This allows
                to easily see for wich variables imputation makes sense
                and for which one it should not be done or mean
                imputation should be used.

             *  evalPcs - The evaluated numbers of components or number
                of neighbours  (the same as the evalPcs input
                parameter).

             *  variableIx - Index of the incomplete variables. This
                can be used to map  the variable wise error to the
                original data.

_A_u_t_h_o_r(_s):

     Wolfram Stacklies 
      CAS-MPG Partner Institute for Computational Biology, Shanghai,
     China 
              wolfram.stacklies@gmail.com 

_S_e_e _A_l_s_o:

     'kEstimateFast, Q2, pca, nni'.

_E_x_a_m_p_l_e_s:

     ## Load a sample metabolite dataset with 5% missing values (metaboliteData)
     data(metaboliteData)

     # Do cross validation with ppca for component 2:4
     esti <- kEstimate(metaboliteData, method = "ppca", evalPcs = 2:4, nruncv=1, em="nrmsep")

     # Plot the average NRMSEP
     barplot(drop(esti$eError), xlab = "Components",ylab = "NRMSEP (1 iterations)")

     # The best result was obtained for this number of PCs:
     print(esti$bestNPcs)

     # Now have a look at the variable wise estimation error
     barplot(drop(esti$variableWiseError[, which(esti$evalPcs == esti$bestNPcs)]), 
             xlab = "Incomplete variable Index", ylab = "NRMSEP")

