matchPattern           package:Biostrings           R Documentation

_S_t_r_i_n_g _s_e_a_r_c_h_i_n_g _f_u_n_c_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Generic that finds all matches of a pattern in a sequence (an
     XString object).

_U_s_a_g_e:

       matchPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)
       countPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)
       vcountPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)

_A_r_g_u_m_e_n_t_s:

 pattern: The pattern string. 

 subject: An XString object containing the subject string, or an
          XStringViews object, or a character vector (converted to an
          'XString' or 'XStringSet' object internally) or (for
          'vcountPattern') an XStringSet object. 

algorithm: One of the following: '"auto"', '"naive-exact"',
          '"naive-inexact"', '"boyer-moore"' or '"shift-or"'. 

max.mismatch: The maximum number of mismatching letters allowed (see
          'isMatching' for the details). If non-zero, an inexact
          matching algorithm is used. 

   fixed: If 'FALSE' then IUPAC extended letters are interpreted as
          ambiguities (see 'isMatching' for the details). 

_D_e_t_a_i_l_s:

     Available algorithms are: ``naive exact'', ``naive inexact'',
     ``Boyer-Moore-like'' and ``shift-or''. Not all of them can be used
     in all situations: restrictions depend on the length of the
     pattern, the class of the subject and the values of 'max.mismatch'
     and 'fixed'.

     When 2 different algorithms can be used for a given task, then
     choosing one or the other only affects the performance, not the
     result, so there is no "wrong choice" (strictly speaking). In
     short, it is better to just use 'algorithm="auto"' (the default):
     this way 'matchPattern' will choose the algo that is best suited
     for the task.

_V_a_l_u_e:

     An XStringViews object for 'matchPattern'.

     A single integer for 'countPattern'.

     An integer vector for 'vcountPattern', with each element in the
     vector corresponding to the number of matches in the corresponding
     element of 'subject'.

_S_e_e _A_l_s_o:

     'isMatching', 'mismatch', 'matchPDict', 'matchLRPatterns',
     'matchProbePair', 'maskMotif', 'alphabetFrequency',
     XStringViews-class, XString-class

_E_x_a_m_p_l_e_s:

       ## A simple inexact matching example with a short subject
       x <- DNAString("AAGCGCGATATG")
       m1 <- matchPattern("GCNNNAT", x)
       m1
       m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
       m2
       as.matrix(m2)

       ## With DNA sequence of yeast chromosome number 1
       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)
       PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
       match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
       match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE)

       ## With a genome containing isolated Ns
       library(BSgenome.Celegans.UCSC.ce2)
       chrII <- Celegans[["chrII"]]
       alphabetFrequency(chrII)
       matchPattern("N", chrII)
       matchPattern("TGGGTGTCTTT", chrII) # no match
       matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

       ## Using wildcards ("N") in the pattern on a genome containing N-blocks
       library(BSgenome.Dmelanogaster.UCSC.dm3)
       chrX <- maskMotif(Dmelanogaster$chrX, "N")
       as(chrX, "XStringViews") # 4 non masked regions
       matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE)
       ## Can also be achieved with no mask
       masks(chrX) <- NULL
       matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

