convert             package:GeneticsBase             R Documentation

_E_f_f_i_c_i_e_n_c_t_l_y _c_o_n_v_e_r_t _s_t_r_i_n_g_s _o_f _c_h_a_r_a_c_t_e_r_s _i_n_t_o _i_n_t_e_g_e_r _c_o_d_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Efficienctly convert strings of characters into integer codes.

_U_s_a_g_e:

     convert(source, levels, byrow=FALSE, aslist=FALSE)

_A_r_g_u_m_e_n_t_s:

  source: Vector of character strings

  levels: Vector of characters used to determine levels 

   byrow: Boolean. If FALSE (the default), return a matrix with one
          column per string.  If TRUE, return a matrix with one row per
          string.

  aslist: Boolean, return matrix (FALSE) or list of vectors (TRUE).

_D_e_t_a_i_l_s:

     This function efficiently converts character strings containing
     characters into vectors of integers.  Its primary purpose is to
     allow translation of genotypes stored as character vectors, one
     character per genotype, to a factor-coded matrix. The equivalent
     code using 'factor' is quite a bit slower, as shown by the last
     section of the example below.

     The 'levels' argument should be a vector of 1-character strings.
     This vector is used to determine the translation.  The index of 
     matching characters provides the returned integer values.
     Characters not present in 'levels' will be converted to NA's.

_V_a_l_u_e:

     If 'aslist=TRUE', the return value is a a list of vectors. Each
     vector will contain the translation of the corresponding input
     string.

     If 'aslist=FALSE (the default)', the return value will be a
     matrix. 'byrow' controls whether each string is converted into a a
     column ('byrow=FALSE', the default) or row ('byrow=TRUE').

     When 'byrow=FALSE', each element of the 'source' vector is
     converted to a column, and the number of rows will be the number
     of characters in the longest element of the 'source' vector.  Any
     shorter vectors will be padded with NA's.

     When 'byrow=TRUE' the matrix is created with one row per element
     of the 'source' vector, etc.

_N_o_t_e:

     Only of the first character of each element of 'levels' is used.
     Any other characters will be ignored.

_A_u_t_h_o_r(_s):

     Gregory R. Warnes warnes@bst.rochester.edu and Nitin Jain
     nitin.jain@pfizer.com

_S_e_e _A_l_s_o:

     'factor',  'as.factor '

_E_x_a_m_p_l_e_s:

     ###
     # Toy Genetics Example
     ##
     # 'c' = 'homozygote common allele'
     # 'h' = 'heterozygone'
     # 'r' = 'homozygote rare allele'
     marker.data <- c( m1='cchchrcr', m2='chccccrr')
     marker.data

     convert(marker.data, c('c','h','r'))

     ###
     # simple test example
     ###
     source <- c(one='abcabcabc', two='abc','ggg',buckle='aaa',my='bbb',
                 'shoe  '='bgb  ')
     levels <- c('a','b','c','d')

     convert(source,levels)
     convert(source,levels,aslist=TRUE)
     convert(source,levels,byrow=TRUE)

     ###
     # compare efficiency with equivalent code using 'factor'
     ###
     ## Not run: 
     makestr <- function(n)
       paste(sample(letters, size=n, replace=T), sep='', collapse='')

     timeit <- function( expr )
       {
         start <- Sys.time()
         expr
         end <- Sys.time()
         return( as.numeric(end-start ))
       }

     # Step 1: create a large set of character strings
     x <- unlist(lapply(1:100000, function(x) makestr(1000)))

     # Step 2: Time convert  (~17 sec on Intel Xeon 3.0 GHz, 32 GB RAM)
     newtime <- timeit( yn <- convert2(x, letters) )
     newtime

     # old method  (~4.7 min on Intex Xeon 3.0 GHz, 32 GB RAM)
     oldmethod <- function(x)
       {
         yo <- factor(unlist(strsplit(x, split='')),levels=letters)
         attr(y1,'dim') <- c(nchar(x[1]), length(x))
         class(y1) <- 'matrix'
       }

     oldtime <- timeit( oldmethod(x) )
     oldtime

     # time difference
     oldtime - newtime
     ## End(Not run)

