Title: Document Addressing Methods Used by SGML Applications Source: SGML Raporteur Group Project: 1.18.15.1 Project editor: Charles F. Goldfarb Status of document: Approved Working Paper Requested action: For information Date: 22 April 1994 Distribution: WG8 and liasons References: Supersedes: This paper discusses addressing schemes that can be derived from the "trees plus" models that are being proposed for SGML. The addressing schemes are based on the following: o Property values are addressed by the name of the property. o Terms in a sequence are addressed by their numerical position in the sequence. o Terms in the content sequence, and attributes, of a subelement are addressed by the address of the subelement and then the address of the term or attribute within (i.e., relative to) that subelement. o If an addressable object has only one sequence-valued property, then the property name may be omitted and the terms of the sequence can be addressed directly relative to the address of the object. (For example, the address of the fifth term of the content sequence in the examples given below can be "5" rather than "content.5".) It needs to be noted that generating a good addressing scheme is not the only criterion for the "goodness" of a model. For example, a model used to explain the "shape" of a data structure might be judged on the ease with which people understand the data structure shape. Different models may be better for different purposes. (In a multi-model environment it is of course very important that the isomorphisms between models be carefully described.) The 8879-Syntax Model One model of elements that is simple and easy to understand is the one directly derived from the syntax productions of 8879. A version of that model, simplified by assuming that marked section declarations and entity references have been resolved, is as follows: Elements have these two properties, o content, for which they exhibit a value that is a finite sequence o type name, for which they exhibit a value that is an SGML name and possibly other properties, collectively known as "attributes". Because these elements have the content property, and because some terms of the content sequence may themselves be elements, elements are (among other things) inherently trees of elements. The content sequence has terms that are either elements, processing instructions, individual data characters, or comment and certain other declarations. This structure model is conveniently simple for describing the data structures of SGML, but the addressing scheme that it generates leaves much to be desired. For example, consider an element whose content consists of nine PCDATA characters followed by two subelements, such as: bcdefghij The address of the second subelement, the "gorp" attribute's value in that subelement, and the fifth data character would be 11 11.gorp and 5 respectively. The undesirable thing about this addressing scheme is that changes in an element that in some sense do not change the "SGML structure" of the element nonetheless change the address of things that are considered important to the "SGML structure". For example, if an additional character is added to a string of characters all of which satisfy a particular #PCDATA token of a content model, the "SGML structure" of the element has not changed, but every term beyond the new character in the content sequence will have had its address changed. Clumping A transformation of the model called "clumping" can be used to eliminate this problem, at least for PCDATA strings: Instead of placing each data character in a separate term of the content sequence, we use a single term in the sequence to "hold the place" for all the characters in a contiguous PCDATA string--that term is itself a sequence of characters. Thus, using the same nine-characters-and-two-subelements example used above, the address of the second subelement, its "gorp" attribute's, the entire PCDATA string, and the fifth character thereof would be 3 3.gorp 1 and 1.5 respectively. Clumping has the added -- sometimes more important -- characteristic that, in addition to changing the addressing impact of adding into something that has been clumped, it also permits addressing, focusing attention on, and hanging properties on the clump itself. In point of fact, clumping is exactly the same procedure, with this alternate motivation, that created the tree structure we use from a flat sequence of data characters: There are substrings (= clumps) of the data characters about which it is useful to place bounds and treat them as separate things--objects in their own right that have properties. We call them elements. A Different Kind of "Relative Addressing": Secondary Content For many purposes, the terms of the content sequence of an element can be partitioned into two categories, "primary" and "secondary". Most typically, the characters in PCDATA character subsequences and subelements are categorized "primary", and declarations and processing instructions are categorized "secondary". The desire then is to have an addressing scheme in which the presence or absence of secondary content terms does not change the address of the primary terms. For example, if a processing instruction were added between the second and third data characters and a second processing instruction were added between the two subelements in our preceding example, we would not want the address of the data characters and subelements to change. The effect of this is that, even though the secondary terms occur in the content or data character sequences, they are not to be counted when determining the address of the primary terms. How then to address these secondary terms? The simplest solution is to address them relative to the address of the primary term closest on the left. Thus, for example, the address (within the element) of the two processing instructions (assuming there are no other secondary terms, at least nearby) would be 1.2/1 and 2/1 respectively. When a secondary term occurs before the first primary term of a sequence, the primary address used should be "0". Clumping and categorizing of the 8879-inspired "syntax-derived" model in a particular way results in a model with the particularly useful addressing scheme of ISO 10744 (HyTime). A Problem Unlike subelements, there is no particular markup marking the boundaries of PCDATA strings in the flat marked-up character stream format that SGML prescribes for transmission. Therefore, if a secondary term such as a processing instruction is introduced before the first character of a PCDATA character subsequence, a parser cannot tell if it is in the content sequence of the element, preceding the character subsequence, or if it is in the character subsequence, preceding the first real data character. The model is richer than the flat format can describe. We believe that there is no value in trying to modify the flat format to make it able to describe these particular distinctions. Rather, one should simply prescribe that these ambiguous-location objects must always placed in one of the two (or more?) locations according to some prescribed rule. 8879 has gone to great lengths to avoid requiring a parser to do much lookahead. Therefore we recommend that the prescribed location for a secondary object in an ambiguous location be selected without lookahead: o Before the first character of a data character subsequence, the parser does not in general know that the subsequence is about to start. Therefore, all such secondary terms should be in the content sequence, not the data subsequence. o After the last character of a data character subsequence (but before the parser sees the next primary term or the markup that ends the content sequence), the parser does not in general know that the subsequence is about to end. Therefore, all such secondary terms should be at the end but still in the data character subsequence, not in the content sequence following the data character subsequence. Other ambiguous cases should be dealt with similarly.