matchPattern           package:Biostrings           R Documentation

_S_t_r_i_n_g _s_e_a_r_c_h_i_n_g _f_u_n_c_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Generic that finds all matches of a pattern in a BString.

_U_s_a_g_e:

       matchPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE)
       countPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE)
       mismatch(pattern, x, fixed=TRUE)

_A_r_g_u_m_e_n_t_s:

 pattern: The pattern string. 

 subject: A BString (or derived) object containing the subject string,
          or a BStringViews object. 

algorithm: One of the following: '"auto"', '"naive-exact"',
          '"naive-fuzzy"', '"boyer-moore"' or '"shift-or"'. 

mismatch: The number of mismatches allowed. If non-zero, a fuzzy string
          searching algorithm is used for matching. 

   fixed: Only with a DNAString or RNAString subject can a 'fixed'
          value other than the default ('TRUE') be used.

          With 'fixed=FALSE', ambiguities (i.e. letters from the IUPAC
          Extended Genetic Alphabet (see 'IUPAC_CODE_MAP') that are not
          from the base alphabet) in the pattern _and_ in the subject
          are interpreted as wildcards i.e. they match any letter that
          they stand for.

          'fixed' can also be a character vector, a subset of
          'c("pattern", "subject")'. 'fixed=c("pattern", "subject")' is
          equivalent to 'fixed=TRUE' (the default). An empty vector is
          equivalent to 'fixed=FALSE'. With 'fixed="subject"',
          ambiguities in the pattern only are interpreted as wildcards.
          With 'fixed="pattern"', ambiguities in the subject only are
          interpreted as wildcards. 

       x: A BStringViews object (typically, one returned by
          'matchPattern(pattern, subject)'). 

_D_e_t_a_i_l_s:

     Available algorithms are: ``naive exact'', ``naive fuzzy'',
     ``Boyer-Moore-like'' and ``shift-or''. Not all of them can be used
     in all situations: restrictions depend on the length of the
     pattern, the class of the subject and the values of 'mismatch' and
     'fixed'.

     When 2 different algorithms can be used for a given task, then
     choosing one or the other only affects the performance, not the
     result, so there is no "wrong choice" (strictly speaking). In
     short, it is better to just use 'algorithm="auto"' (the default):
     this way 'matchPattern' will choose the algo that is best suited
     for the task.

_V_a_l_u_e:

     A BStringViews object for 'matchPattern'.

     A single integer for 'countPattern'.

     A list of integer vectors for 'mismatch'.

_S_e_e _A_l_s_o:

     'matchLRPatterns', 'matchProbePair', 'mask', 'alphabetFrequency',
     'IUPAC_CODE_MAP', BStringViews-class, DNAString-class

_E_x_a_m_p_l_e_s:

       ## A simple fuzzy matching example with a short subject
       x <- DNAString("AAGCGCGATATG")
       m1 <- matchPattern("GCNNNAT", x)
       m1
       m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
       m2
       as.matrix(m2)

       ## With DNA sequence of yeast chromosome number 1
       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)
       PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
       match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
       match2.PpiI <- matchPattern(PpiI, yeast1, mismatch=1, fixed=FALSE)

       ## With a genome containing isolated Ns
       library(BSgenome.Celegans.UCSC.ce2)
       chrII <- Celegans[["chrII"]]
       alphabetFrequency(chrII)
       matchPattern("N", chrII)
       matchPattern("TGGGTGTCTTT", chrII) # no match
       matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

       ## Using wildcards ("N") in the pattern on a genome containing N-blocks
       library(BSgenome.Dmelanogaster.FlyBase.r51)
       chrX <- Dmelanogaster[["X"]]
       noN_chrX <- mask(chrX, "N")
       mask(noN_chrX) # See the N-blocks?
       matchPattern("TTTATGNTTGGTA", noN_chrX, fixed=FALSE)
       ## Can also be achieved with
       matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

