flowClust             package:flowClust             R Documentation

_R_o_b_u_s_t _M_o_d_e_l-_b_a_s_e_d _C_l_u_s_t_e_r_i_n_g _f_o_r _F_l_o_w _C_y_t_o_m_e_t_r_y

_D_e_s_c_r_i_p_t_i_o_n:

     This function performs automated clustering for identifying cell
     populations in flow cytometry data.  The approach is based on the
     t mixture model with the Box-Cox transformation, which provides a
     unified framework to handle outlier identification and data
     transformation simultaneously.

_U_s_a_g_e:

     flowClust(x, expName="Flow Experiment", varNames=NULL, K, B=500, 
               tol=1e-5, nu=4, lambda=1, nu.est=0, trans=1,
               min.count=10, max.count=10, min=NULL, max=NULL,
               level=0.9, u.cutoff=NULL, z.cutoff=0, randomStart=10, 
               B.init=B, tol.init=1e-2, seed=1, criterion="BIC",
               control=NULL)

_A_r_g_u_m_e_n_t_s:

       x: A numeric vector, matrix, data frame of observations, or
          object of class 'flowFrame'.  Rows correspond to observations
          and columns correspond to variables.

 expName: A character string giving the name of the experiment.

varNames: A character vector specifying the variables (columns) to be
          included in clustering.  When it is left unspecified, all the
          variables will be used.

       K: An integer vector indicating the numbers of clusters.

       B: The maximum number of EM iterations.

     tol: The tolerance used to assess the convergence of the EM.

      nu: The degrees of freedom used for the t distribution.  Default
          is 4.  If 'nu=Inf', Gaussian distribution will be used.

  lambda: The initial transformation to be applied to the data.

  nu.est: A numeric indicating whether 'nu' is to be estimated or not. 
          May take 0 (no estimation, default), 1 (estimation) or 2
          (cluster-specific estimation).

   trans: A numeric indicating whether the Box-Cox transformation
          parameter is estimated from the data.  May take 0 (no
          estimation), 1 (estimation, default) or 2 (cluster-specific
          estimation).

min.count: An integer specifying the threshold count for filtering data
          points from below.  The default is 10, meaning that if 10 or
          more data points are smaller than or equal to 'min', they
          will be excluded from the analysis.  If 'min' is 'NULL', then
          the minimum of data as per each variable will be used.  To
          suppress filtering, set it as -1.

max.count: An integer specifying the threshold count for filtering data
          points from above.  Interpretation is similar to that of
          'min.count'.

     min: The lower boundary set for data filtering.  Note that it is a
          vector of length equal to the number of variables (columns),
          implying that a different value can be set as per each
          variable.

     max: The upper boundary set for data filtering.  Interpretation is
          similar to that of 'min'.

   level: A numeric value between 0 and 1 specifying the threshold
          quantile level used to call a point an outlier.  The default
          is 0.9, meaning that any point outside the 90% quantile
          region will be called an outlier.

u.cutoff: Another criterion used to identify outliers.  If this is
          'NULL', then 'level' will be used.  Otherwise, this specifies
          the threshold (e.g., 0.5) for u, a quantity used to measure
          the degree of outlyingness based on the Mahalanobis
          distance.  Please refer to Lo et al. (2008) for more details.

z.cutoff: A numeric value between 0 and 1 underlying a criterion which
          may be used together with 'level'/'u.cutoff' to identify
          outliers.  A point with the probability of assignment z
          (i.e., the posterior probability that a data point belongs to
          the cluster assigned) smaller than 'z.cutoff' will be called
          an outlier.  The default is 0, meaning that assignment will
          be made no matter how small the associated probability is,
          and outliers will be identified solely based on the rule set
          by 'level' or 'cutoff'.

randomStart: A numeric value indicating how many times a random
          parition of the data is generated for initialization.  The
          default is 10, meaning that 10 random partitions of the data
          will be generated, each of which is followed by a short EM
          run.  The partition leading to the highest likelihood value
          will be adopted to be the initial partition for the eventual
          long EM run.  If 'randomStart' is 0, meaning that this
          initialization strategy is not applied and hierarchical
          clustering is used instead.

  B.init: The maximum number of EM iterations following each random
          partition in random initialization.

tol.init: The tolerance used as the stopping criterion for the short EM
          runs in random initialization.

    seed: An integer giving the seed number used when 'randomStart>0'.

criterion: A character string stating the criterion used to choose the
          best model.  May take either '"BIC"' or '"ICL"'.  This
          argument is only relevant when 'length(K)>1'.

 control: An argument reserved for internal use.

_D_e_t_a_i_l_s:

     Estimation of the unknown parameters (including the Box-Cox
     parameter) is done via an Expectation-Maximization (EM) algorithm.
      At each EM iteration, Brent's algorithm is used to find the
     optimal value of the Box-Cox transformation parameter. 
     Conditional on the transformation parameter, all other estimates
     can be obtained in closed form.  Please refer to Lo et al. (2008)
     for more details.

     The 'flowClust' package makes extensive use of the GSL as well as
     BLAS.  If an optimized BLAS library is provided when compiling the
     package, the 'flowClust' package will be able to run
     multi-threaded processes.

     Various operations have been defined for the object returned from
     'flowClust'.  These include:

       Subsetting operations:      '%in%', 'Subset' and 'split'
       Slot retrieval operations:  'ruleOutliers', 'Map', 'criterion', 'posterior', 'importance', 'uncertainty' and 'getEstimates'
       Graphical operations:       'plot', 'density' and 'hist'

     In addition, to facilitate the integration with the 'flowCore'
     package for processing flow cytometry data, the 'flowClust'
     operation can be done through a method pair ('tmixFilter' and
     'filter') such that various methods defined in 'flowCore' can be
     applied on the object created from the filtering operation.

_V_a_l_u_e:

     If 'K' is of length 1, the function returns an object of class
     'flowClust' containing the following slots, where K is the number
     of clusters, N is the number of observations and P is the number
     of variables: 

 expName: Content of the 'expName' argument.

varNames: Content of the 'varNames' argument if provided; generated if
          available otherwise.

       K: An integer showing the number of clusters.

       w: A vector of length K, containing the estimates of the K
          cluster proportions.

      mu: A matrix of size K x P, containing the estimates of the K
          mean vectors.

   sigma: An array of dimension K x P x P, containing the estimates of
          the K covariance matrices.

  lambda: The Box-Cox transformation parameter estimate.

      nu: The degrees of freedom for the t distribution.

       z: A matrix of size N x K, containing the posterior
          probabilities of cluster memberships.  The probabilities in
          each row sum up to one.

       u: A matrix of size N x K, containing the weights (the
          contribution for computing cluster mean and covariance
          matrix) of each data point in each cluster.  Since this
          quantity decreases monotonically with the Mahalanobis
          distance, it can also be interpreted as the level of
          outlyingness of a data point.  Note that, when 'nu=Inf',
          this slot is used to store the Mahalanobis distances instead.

   label: A vector of size N, showing the cluster membership according
          to the initial partition (i.e., hierarchical clustering if
          'randomStart=0' or random partitioning if 'randomStart>0'). 
          Filtered observations will be labelled as 'NA'.  Unassigned
          observations (which may occur since only 1500 observations at
          maximum are taken for hierarchical clustering) will be
          labelled as 0.

uncertainty: A vector of size N, containing the uncertainty about the
          cluster assignment.  Uncertainty is defined as 1 minus the
          posterior probability that a data point belongs to the
          cluster to which it is assigned.

ruleOutliers: A numeric vector of size 3, storing the rule used to call
          outliers.  The first element is 0 if the criterion is set by
          the 'level' argument, or 1 if it is set by 'u.cutoff'.  The
          second element copies the content of either the 'level' or
          'u.cutoff' argument.  The third element copies the content of
          the 'z.cutoff' argument.  For instance, if points are called
          outliers when they lie outside the 90% quantile region or
          have assignment probabilities less than 0.5, then
          'ruleOutliers' is 'c(0, 0.9, 0.5)'.  If points are called
          outliers only if their weights in the assigned clusters are
          less than 0.5 regardless of the assignment probabilities,
          then 'ruleOutliers' becomes 'c(1, 0.5, 0)'.

flagOutliers: A logical vector of size N, showing whether each data
          point is called an outlier or not based on the rule defined
          by 'level'/'u.cutoff' and 'z.cutoff'.

  rm.min: Number of points filtered from below.

  rm.max: Number of points filtered from above.

 logLike: The log-likelihood of the fitted mixture model.

     BIC: The Bayesian Information Criterion for the fitted mixture
          model.

     ICL: The Integrated Completed Likelihood for the fitted mixture
          model.

     If 'K' has a length >1, the function returns an object of class
     'flowClustList'.  Its data part is a list with the same length as
     'K', each element of which is a 'flowClust' object corresponding
     to a specific number of clusters.  In addition, the resultant
     'flowClustList' object contains the following slots:

     'index' An integer giving the index of the list element
     corresponding to the best model as selected by 'criterion'.
      'criterion' The criterion used to choose the best model - either
     '"BIC"' or '"ICL"'.

     Note that when a 'flowClustList' object is used in place of a
     'flowClust' object, in most cases the list element corresponding
     to the best model will be extracted and passed to the
     method/function call.

_A_u_t_h_o_r(_s):

     Raphael Gottardo <raph@stat.ubc.ca>, Kenneth Lo <c.lo@stat.ubc.ca>

_R_e_f_e_r_e_n_c_e_s:

     Lo, K., Brinkman, R. R. and Gottardo, R. (2008) Automated Gating
     of Flow Cytometry Data via Robust Model-based Clustering.
     _Cytometry A_ *73*, 321-332.

_S_e_e _A_l_s_o:

     'summary', 'plot', 'density', 'hist', 'Subset', 'split',
     'ruleOutliers', 'Map', 'SimulateMixture'

_E_x_a_m_p_l_e_s:

     data(rituximab)

     ### cluster the data using FSC.H and SSC.H
     res1 <- flowClust(rituximab, varNames=c("FSC.H", "SSC.H"), K=1)

     ### remove outliers before proceeding to the second stage
     # %in% operator returns a logical vector indicating whether each
     # of the observations lies within the cluster boundary or not
     rituximab2 <- rituximab[rituximab %in% res1,]
     # a shorthand for the above line
     rituximab2 <- rituximab[res1,]
     # this can also be done using the Subset method
     rituximab2 <- Subset(rituximab, res1)

     ### cluster the data using FL1.H and FL3.H (with 3 clusters)
     res2 <- flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=3)
     show(res2)
     summary(res2)

     # to demonstrate the use of the split method
     split(rituximab2, res2)
     split(rituximab2, res2, population=list(sc1=c(1,2), sc2=3))

     # to show the cluster assignment of observations
     table(Map(res2))

     # to show the cluster centres (i.e., the mean parameter estimates
     # transformed back to the original scale)
     getEstimates(res2)$locations

     ### demonstrate the use of various plotting methods
     # a scatterplot
     plot(res2, data=rituximab2, level=0.8)
     plot(res2, data=rituximab2, level=0.8, include=c(1,2), grayscale=TRUE,
         pch.outliers=2)
     # a contour / image plot
     res2.den <- density(res2, data=rituximab2)
     plot(res2.den)
     plot(res2.den, scale="sqrt", drawlabels=FALSE)
     plot(res2.den, type="image", nlevels=100)
     plot(density(res2, include=c(1,2), from=c(0,0), to=c(400,600)))
     # a histogram (1-D density) plot
     hist(res2, data=rituximab2, subset="FL1.H")

     ### to demonstrate the use of the ruleOutliers method
     summary(res2)
     # change the rule to call outliers
     ruleOutliers(res2) <- list(level=0.95)
     # augmented cluster boundaries lead to fewer outliers
     summary(res2)

     # the following line illustrates how to select a subset of data 
     # to perform cluster analysis through the min and max arguments;
     # also note the use of level to specify a rule to call outliers
     # other than the default
     flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=3, B=100, 
         min=c(0,0), max=c(400,800), level=0.95, z.cutoff=0.5)

