scoreSegments          package:tilingArray          R Documentation

_S_c_o_r_e _s_e_g_m_e_n_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     Score the segments found by a previous call to findSegments by
     comparing to genome annotation

_U_s_a_g_e:

     scoreSegments(s, gff, 
       nrBasePerSeg = 1500, 
       probeLength  = 25,
       knownFeatures = c("CDS", "gene", "ncRNA", "nc_primary_transcript",
             "rRNA", "snRNA", "snoRNA", "tRNA",
             "transposable_element", "transposable_element_gene"),
       params = c(minOverlapFractionSame = 0.8, minOverlapOppo = 40,
         minIsolatedDistance=100, oppositeWindow = 100, utrScoreWidth=100),
       verbose = TRUE)

_A_r_g_u_m_e_n_t_s:

       s: environment. See details.

     gff: GFF dataframe.

nrBasePerSeg: Numeric of length 1. This parameter determines the number
          of segments.

probeLength: Numeric of length 1.

knownFeatures: Character vector. Names of those features in 'gff' which
          should be considered _known features_.

  params: vector of additional parameters, see details.

 verbose: Logical.

_D_e_t_a_i_l_s:

     This function scores segments. It is typically called after a
     _segmentation_. For an example segmentation script, see the script
     'segment.R' in the 'scripts' directory of this package. For an
     example scoring script, which loads the data and then calls this
     function, see the script 'scoreSegments.R'.

     To compare segment coordinates with genomic coordinates, a
     segment's start coordinate was defined as the coordinate of the
     middle (13-th) base of the first probe in the segment, and
     similarly its end coordinate as the coordinate of the middle base
     of the last probe.

     For each segment, we calculate and record its:

     _c_h_r, _s_t_r_a_n_d chromosome and strand

     _s_t_a_r_t, _e_n_d, _l_e_n_g_t_h start position, end position, length (in bases)

     _f_r_a_c._d_u_p fraction (0...1) of probes in this segment that have also
          hits otherwhere 

     _l_e_v_e_l mean signal level

     _g_e_n_e_I_n_S_e_g_m_e_n_t list of gene identifiers (these can be zero, one, or
          several identifiers). A gene is included if it is fully
          contained within the segment, i.e. its start coordinates are
          >= the segment's start and its end coordinates <= the
          segment's end.

     _o_v_e_r_l_a_p_p_i_n_g_F_e_a_t_u_r_e list of feature identifiers (these include
          genes, CDSs (=exons), ncRNAs ..., everything which has a line
          the GFF file). A feature is included if the overlap between
          it and the segment is more than a fraction
          'minOverlapFractionSame' of the length of the segment or of
          the feature, whichever is smaller. The overlappingFeature
          list by definition contains the elements of the geneInSegment
          list, but can be larger.

     _o_p_p_o_s_i_t_e_F_e_a_t_u_r_e list of feature identifiers. A feature is included
          if the overlap between it and the segment is >=
          minOverlapOppo (see below) bases.

     _o_p_p_o_s_i_t_e_E_x_p_r_e_s_s_i_o_n a number. The signal on the opposite strand is
          filtered with a moving average smoother of width
          oppositeWindow (see below) bases. oppositeExpression is the
          minimum of the result. It is later used to eliminate
          potential reverse transcription artifacts from the
          unannotated, potential antisense segments. If it is
          sufficiently small, we can assume that there is no
          transcription at least on parts of the strand opposite the
          segment, hence it cannot be a reverse transcription artifact.

     _i_s_I_s_o_l_a_t_e_d_S_a_m_e, _i_s_I_s_o_l_a_t_e_d_O_p_p_o logical. TRUE if the distance
          between the segment and and any annotated feature on the same
          or opposite strand, respectively, is >=
          'params["minIsolatedDistance"]' bases.

     _u_t_r_3, _u_t_r_5 positive integer numbers. These are calculated only for
          segments which have exactly one gene in the geneInSegment
          list. They are calculated as the difference between start
          points of segment and gene, and between end points of segment
          and gene, respectively.

     _d_i_s_t_L_e_f_t, _d_i_s_t_R_i_g_h_t positive integer numbers. distLeft is the
          distance between the start of the segment and the closest end
          of any annotated feature, and distRight is the distance
          between the end of the segment and the closest start any
          annotated feature.

     _z_L_e_f_t, _z_R_i_g_h_t z-scores for the left (right) flank of the segment.
          They are calculated as the difference between mean and of the
          segment and mean of the signal from a region of length
          utrScoreWidth (see below) immediately to the left (right),
          divided by the standard deviation of the region. Note that
          the standard deviation of the signal within the segment is
          not considered here.

     The meaning of the parameters in the parameter vector 'params' is
     as follows:

     _m_i_n_O_v_e_r_l_a_p_F_r_a_c_t_i_o_n_S_a_m_e see the definition of 'overlappingFeature'
          above

     _m_i_n_O_v_e_r_l_a_p_O_p_p_o see the definition of 'oppositeFeature' above

     _o_p_p_o_s_i_t_e_W_i_n_d_o_w see the definition of 'oppositeExpression' above

     _u_t_r_S_c_o_r_e_W_i_d_t_h see the definition of 'zLeft', 'zRight' above

_V_a_l_u_e:

     A dataframe with columns as described in the details section.

_A_u_t_h_o_r(_s):

     W. Huber <huber@ebi.ac.uk>

_E_x_a_m_p_l_e_s:

