ppca               package:pcaMethods               R Documentation

_P_r_o_b_a_b_i_l_i_s_t_i_c _P_C_A _M_i_s_s_i_n_g _V_a_l_u_e _E_s_t_i_m_a_t_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     Implementation of probabilistic PCA (PPCA). PPCA allows to perform
     PCA on incomplete data and may be used for missing value
     estimation. This script was implemented after the Matlab version
     provided by Jakob Verbeek ( see <URL:
     http://lear.inrialpes.fr/~verbeek/>) and the draft _``EM
     Algorithms for PCA and Sensible PCA''_ written by Sam Roweis.
     Thanks a lot! 

     Probabilistic PCA combines an EM approach for PCA with a
     probabilistic model. The EM approach is based on the assumption
     that the latent variables as well as the noise are normal
     distributed.

     In standard PCA data which is far from the training set but close
     to the principal subspace may have the same reconstruction error.
     PPCA defines a likelihood function such that the likelihood for
     data far from the training set is much lower, even if they are
     close to the principal subspace. This allows to improve the
     estimation accuracy.

     A method called 'kEstimate' is provided to estimate the optimal
     number of components via cross validation. In general few
     components are sufficient for reasonable estimation accuracy. See
     also the package documentation for further discussion on what kind
     of data PCA-based missing value estimation is advisable.

     Requires 'MASS'

_U_s_a_g_e:

       ppca(Matrix, nPcs = 2, center = TRUE, completeObs = TRUE, seed = NA, ...)

_A_r_g_u_m_e_n_t_s:

  Matrix: 'matrix' - Data containing the variables in columns and
          observations in rows. The data may contain missing values,
          denoted as 'NA'.

    nPcs: 'numeric' - Number of components to estimate. The preciseness
          of the missing value estimation depends on the number of
          components, which should resemble the internal structure of
          the data.

  center: 'boolean' Mean center the data if TRUE

completeObs: 'boolean' Return the complete observations if TRUE. This
          is the original data with NA values filled with the estimated
          values.

    seed: 'numeric' Set the seed for the random number generator. PPCA
          creates fills the initial loading matrix with random numbers
          chosen from a normal distribution. Thus results may vary
          slightly. Set the seed for exact reproduction of your
          results.

     ...: Reserved for future use. Currently no further parameters are
          used.

_D_e_t_a_i_l_s:

     *Complexity:* Runtime is linear in the number of data, number of
     data dimensions and number of principal components.

     *Convergence:* The algorithms seems not to converge to proper
     results in rare cases when unfavourable initial random numbers
     were chosen.  To avoid this you can set the seed (parameter seed)
     of the random number generator. 
      If used for missing value estimation, results may be checked by
     simply running the algorithm several times with changing seed, if
     the estimated values show little variance the algorithm converged
     well.

_V_a_l_u_e:

  pcaRes: Standart PCA result object used by all PCA-based methods of
          this package. Contains scores, loadings, data mean and more.
          See 'pcaRes' for details.

_A_u_t_h_o_r(_s):

     Wolfram Stacklies 
      Max Planck Institut fuer Molekulare Pflanzenphysiologie, Potsdam,
     Germany 
      wolfram.stacklies@gmail.com 

_S_e_e _A_l_s_o:

     'bpca, svdImpute, prcomp, nipalsPca, pca, pcaRes'.

_E_x_a_m_p_l_e_s:

     ## Load a sample metabolite dataset (metaboliteData)
     data(metaboliteData)

     # Now remove 10% of the data
     rows <- nrow(metaboliteData)
     cols <- ncol(metaboliteData)
     cond <- matrix(runif(rows * cols),rows,cols) < 0.1
     metaboliteData[cond] <- NA

     ## Perform probabilistic PCA using the 3 largest components
     result <- pca(metaboliteData, method="ppca", nPcs=3, center=TRUE)

     ## Get the estimated principal axes (loadings)
     loadings <- result@loadings

     ## Get the estimated scores
     scores <- result@scores

     ## Get the estimated complete observations
     cObs <- result@completeObs

     ## Now plot the scores
     plotPcs(result, scoresLoadings=c(TRUE,FALSE))

