match-utils            package:Biostrings            R Documentation

_U_t_i_l_i_t_y _f_u_n_c_t_i_o_n_s _r_e_l_a_t_e_d _t_o _p_a_t_t_e_r_n _m_a_t_c_h_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     This man page gives some background information about the concept
     of "match" ("exact match" or "inexact match") as understood by the
     various pattern matching functions available in the Biostrings
     package.

     The 'nmismatchStartingAt', 'nmismatchEndingAt' and 'isMatching'
     functions implement this concept.

     Other utility functions related to pattern matching are described
     here: the 'mismatch' function for getting the positions of the
     mismatching letters of a given pattern relatively to its matches
     in a given subject, and the 'coverage' function that can be used
     to get the "coverage" of a subject by a given pattern or set of
     patterns.

_U_s_a_g_e:

       nmismatchStartingAt(pattern, subject, starting.at=1, fixed=TRUE)
       nmismatchEndingAt(pattern, subject, ending.at=1, fixed=TRUE)
       isMatching(pattern, subject, start=1, max.mismatch=0, fixed=TRUE)
       mismatch(pattern, x, fixed=TRUE)
       coverage(x, start=NA, end=NA)

_A_r_g_u_m_e_n_t_s:

 pattern: The pattern string. 

 subject: An XString object (or character vector) containing the
          subject sequence, 

starting.at: An integer vector specifying the starting positions of the
          pattern relatively to the subject. 

ending.at: An integer vector specifying the ending positions of the
          pattern relatively to the subject. 

   start: For 'isMatching': an integer vector specifying the starting
          positions of the pattern relatively to the subject. For
          'coverage': a single integer specifying the position in 'x'
          where to start the extraction of the coverage. 

max.mismatch: The maximum number of mismatching letters allowed. Note
          that 'isMatching' doesn't support the kind of inexact
          matching where a given number of insertions or deletions are
          allowed. Therefore all the "matches" (i.e. the substrings in
          the subject that match the pattern) have the length of the
          pattern. 

   fixed: Only with a DNAString or RNAString subject can a 'fixed'
          value other than the default ('TRUE') be used.

          With 'fixed=FALSE', ambiguities (i.e. letters from the IUPAC
          Extended Genetic Alphabet (see 'IUPAC_CODE_MAP') that are not
          from the base alphabet) in the pattern _and_ in the subject
          are interpreted as wildcards i.e. they match any letter that
          they stand for.

          'fixed' can also be a character vector, a subset of
          'c("pattern", "subject")'. 'fixed=c("pattern", "subject")' is
          equivalent to 'fixed=TRUE' (the default). An empty vector is
          equivalent to 'fixed=FALSE'. With 'fixed="subject"',
          ambiguities in the pattern only are interpreted as wildcards.
          With 'fixed="pattern"', ambiguities in the subject only are
          interpreted as wildcards. 

       x: An XStringViews object for 'mismatch' (typically, one
          returned by 'matchPattern(pattern, subject)').

          Typically an XStringViews or MIndex object for 'coverage' but
          IRanges, MaskCollection and MaskedXString objects are
          accepted too. 

     end: A single integer specifying the position in 'x' where to end
          the extraction of the coverage. 

_V_a_l_u_e:

     'nmismatchStartingAt', 'nmismatchEndingAt': an integer vector of
     the same length as 'starting.at' (or 'ending.at') reporting the
     number of mismatching letters for each starting (or ending)
     position.

     A logical vector of the same length as 'start' for 'isMatching'.

     A list of integer vectors for 'mismatch'.

     An integer vector indicating the coverage of 'x' in the interval
     specified by the 'start' and 'end' arguments. An integer value
     called the "coverage" can be associated to each position in 'x',
     indicating how many times this position is covered by the views or
     matches stored in 'x'. For example, if 'x' is an XStringViews
     object, the coverage of a given position in 'x' is the number of
     views it belongs to. If 'x' is an MIndex object, the coverage of a
     given position in 'x' is the number of matches (or hits) it
     belongs to. Note that the positions in the returned vector are to
     be interpreted as relative to the interval specified by the
     'start' and 'end' arguments.

_S_e_e _A_l_s_o:

     'matchPattern', 'matchPDict', 'IUPAC_CODE_MAP', XString-class,
     XStringViews-class, MIndex-class, IRanges-class,
     MaskCollection-class, MaskedXString-class

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## nmismatchStartingAt() / isMatching()
       ## ---------------------------------------------------------------------
       subject <- DNAString("GTATA")

       ## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
       nmismatchStartingAt("AT", subject, starting.at=3)
       isMatching("AT", subject, start=3)

       ## ... but not at position 1
       nmismatchStartingAt("AT", subject)
       isMatching("AT", subject)

       ## ... unless we allow 1 mismatching letter (inexact match)
       isMatching("AT", subject, max.mismatch=1)

       ## Here we look at 6 different starting positions and find 3 matches if
       ## we allow 1 mismatching letter
       isMatching("AT", subject, start=0:5, max.mismatch=1)

       ## No match
       nmismatchStartingAt("NT", subject, starting.at=1:4)
       isMatching("NT", subject, start=1:4)

       ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
       nmismatchStartingAt("NT", subject, starting.at=1:4, fixed=FALSE)
       isMatching("NT", subject, start=1:4, fixed=FALSE)

       ## max.mismatch != 0 and fixed=FALSE can be used together
       nmismatchStartingAt("NCA", subject, starting.at=0:5, fixed=FALSE)
       isMatching("NCA", subject, start=0:5, max.mismatch=1, fixed=FALSE)

       some_starts <- c(10:-10, NA, 6)
       subject <- DNAString("ACGTGCA")
       is_matching <- isMatching("CAT", subject, start=some_starts, max.mismatch=1)
       some_starts[is_matching]

       ## ---------------------------------------------------------------------
       ## mismatch()
       ## ---------------------------------------------------------------------
       m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE)
       mismatch("NCA", m)

       ## ---------------------------------------------------------------------
       ## coverage()
       ## ---------------------------------------------------------------------
       coverage(m)

       x <- IRanges(start=c(-2L, 6L, 9L, -4L, 1L, 0L, -6L, 10L),
                    width=c( 5L, 0L, 6L,  1L, 4L, 3L,  2L,  3L))
       coverage(x, start=-6, end=20)  # 'start' and 'end' must be specified for
                                      # an IRanges object.
       coverage(shift(x, 2), start=-6, end=20)
       coverage(restrict(x, 1, 10), start=-6, end=20)
       coverage(reduce(x), start=-6, end=20)
       coverage(gaps(x, start=-6, end=20), start=-6, end=20)

       mask1 <- Mask(mask.width=29, start=c(11, 25, 28), width=c(5, 2, 2))
       mask2 <- Mask(mask.width=29, start=c(3, 10, 27), width=c(5, 8, 1))
       mask3 <- Mask(mask.width=29, start=c(7, 12), width=c(2, 4))
       mymasks <- append(append(mask1, mask2), mask3)
       coverage(mymasks)

       ## See ?matchPDict for examples of using coverage() on an MIndex object...

