.* TEI Document No: MLW13 .* Title: Guidelines for Using SGML .* Drafted: DTB Feb 90 .* Revised: MSM put into GML, 5 Mar 90 .* .im gmlpaper ;.* Use GMLPAPER or GMLGUIDE (or -MLA) .sr docfile = &sysfnam. ;.sr docversion = 'Draft' .im teigml .* Document proper begins. Guidelines <title>For Using the Standard Generalized Markup Language (SGML) <author>TEI Syntax and Metalanguage Committee <docnum>TEI &docfile. <date>February 1990 </titlep> <!> <toc> </frontm> <!> <body> <h1>Introduction As has been expected, we recommend the use of the Standard Generalized Markup Language (SGML) for the TEI encoding scheme. By this we mean that we take the ISO Standard document ISO-8879 (1986) as the basis for the TEI's work. We anticipate that features identified as part of the TEI encoding scheme (or family of schemes) by the other TEI committees will fall into one of these classes: <gl> <gt>Representable and validatable in SGML <gd>There is an SGML feature intended for use in the representation of the feature. The most obvious example is that a single structural hierarchy can be easily encoded. An SGML parser can validate that the feature is encoded ``properly.'' In these cases the available feature should always be used. <gt>Representable but not validatable in SGML <gd>There is no SGML feature directly available for representing the feature. However, there is a way -- perhaps an ``obvious'' way -- to encode the feature using SGML, but this involves using SGML according to some convention that will not be known to a parser and that thus can not be checked by a parser. For example, it may be possible to use attribute values to encode information in a way that can be interpreted by an application but cannot be validated by a parse, as in using a string to contain a list of values from some predetermined enumeration. In these cases, the conventions being used have to be carefully documented for the writers of applications. <gt>Not representable in SGML <gd>We expect these features to be rare. In fact, we expect them to be essentially non-existent int hat some SGML encoding could be found for almost anything. If any such features are encountered by one of the committees, we will attempt to suggest an encoding using SGML. The suggestion, even if not ``natural,'' would have the virtue of being consistent with other such encodings being used in the project. </gl> SGML is a large and complex ``language.'' The categorization just given suggests that we expect to have few instances (perhaps none) where a feature cannot be encoded either in a validatable manner in SGML, or in a non-validatable manner in SGML. We thus expect that there will be essentially no need to extend SGML as an encoding scheme for documents. There may, in fact, be a need to provide some mechanism for manipulating feature sets or tag sets to produce DTDs that is outside the scope of SGML, but it will not be inconsistent with SGML. We will address this issue in other papers. The complexity of SGML suggests that it might be possible or advisable to limit the use of certain SGML features. The remainder of this document is devoted to a statement of our recommendations about which features might or might not be used. Features we do not comment on can be used freely. We comment here only on those where we think restrictions should be imposed, or on those where we think it might have been expected that we would want to impose restrictions, and we make the lack of such restrictions explicit. <h1>Contexts for Using the TEI Encoding Scheme To proceed with this, we identify four different contexts in which the TEI encoding scheme might be used. These are <gl> <gt>Capture <gd>A document is being transcribed into machine-readable form, or created initially in machine-readable form. <gt>Interchange <gd>A document is being transmitted from one electronic ``environment'' to another. In the most complex case, these may be geographically separated, on different machines, using different operating systems, using different character sets, used by speakers of different languages. <gt>Storage <gd>A document is to be stored in a database or archive. <gt>Processing <gd>A document is to be used by an application such as formatting, retrieval, analysis. </gl> While the focus of the TEI is primarily on interchange, we expect that the scheme developed will be used in these other settings whether or not the project so intends. We thus considered the desirability of various features in these different contexts. It turns out after the fact that there are very few differences in what we recommend with respect to the various contexts. In the detailed listings the contexts are represented by their initial letters. We now proceed to a detailed listing of the recommendations. The discussions leading up to these were long, involved, eloquent and emotional. We do not attempt to reproduce them here. The codes used here are ``Y'' if a feature's use is allowed, ``N'' if it is not allowed, and ``--'' if we don't care. These latter will be treated at the end of the document. <h1>Minimization Features These are features that allow the full SGML encoding to be reduced in different ways. <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>SHORTREF <c>C: Y <c>I: N <c>S: N <c>P: Y <p>This is a search and replace mechanism. <row>DATATAG <c>C: N <c>I: N <c>S: N <c>P: N <p>This allows data characters to be interpreted as markup. <row>OMITTAG <c>C: Y <c>I: N <c>S: -- <c>P: Y <p>The model (grammar) is allowed to indicate that tags can be omitted. While we allow the use of this feature in TEI DTDs, we see the need to make some explicit recommendations concerning its use. Careless use of this feature can result in marked up documents that are difficult to read, even ambiguous. <row>RANK <c>C: N <c>I: N <c>S: N <c>P: N <p>Permits occurrences of elements in a document to be numbered. <row>SHORTTAG <c>C: Y <c>I: Y <c>S: Y <c>P: Y <p>When this SGML feature is used, two kinds of minimization are enabled. The first is the omission of attribute names under some circumstances. This we allow. <row>empty tag <c>C: N <c>I: N <c>S: N <c>P: N <p>The second thing enabled by SHORTTAG is having a tag that has no tag name specified. This, of course, is only allowed in some circumstances. We think it should never be allowed in the TEI project. </table> <h1>Attributes <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>attributes <c>C: Y <c>I: Y <c>S: Y <c>P: Y </table> Attributes are one of the most controversial features of SGML within the context of the TEI. The controversy is basically this: anything that can be encoded in an attribute can be encoded as text surrounded by tags, so attributes are not necessary and make the formalism more complex. However, we recommend the use of attributes because we want to support multiple views of a document. We take the position that the ``content'' of a document is that part which is visible in all views. Parts of a document that do not appear in all views must be enclosed in markup. One of the possibilities is to define this part as an attribute. This is the approach that we take. Attributes will only be visible in those views that incorporate the tag to which the attribute pertains. <h1>Inclusion and Exclusion Exceptions <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>exceptions <c>C: Y <c>I: Y <c>S: Y <c>P: Y </table> These are features which allow content models for document elements to be specified as having a certain form except for the inclusion of some other element, or the exclusion of some other element. These are also controversial in that they do not add to the formal power of the notation but they do require more complex than the standard notational conventions used in most formal language processing. We recommend the use of inclusion exceptions because they greatly simplify models in which ``incidental elements'' may appear anywhere. While such models could be re-written without the inclusion exception, they might become much more complex, and the complexity would hide what is really happening. The inclusion is valid in the entire structural subtree rooted at the node where the inclusion is specified. This can be modified by using exclusion exceptions. These say that a model can<emph>not</emph> contain a specified element. This, too, simplifies models, especially in cases where models are the same (in different contexts) save for the exclusion of some possible elements from one context in the other. Exclusion is not reversible, so recursive definitions in which exclusions appear are tricky. Exclusions must be used ``with care'' in such settings. <h1>CONCUR <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>CONCUR <c>C: Y <c>I: Y <c>S: Y <c>P: Y </table> This is the SGML mechanism for specifying two different ``views'' (hierarchies) of a document. It is imperfect but if we excluded it we would have to design another mechanism to accomplish the same thing since we view the ability to specify multiple views as a necessity in the TEI. <h1>SUBDOC <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>SUBDOC <c>C: Y <c>I: Y <c>S: Y <c>P: Y </table> This feature allows the definition of a self-contained description of a part of a text with its own environment (entity definitions, element definitions and short reference maps) that stands apart from the main document. Once a SUBDOC is activated in the document, the current DTD is suspended and the DTD defined by the SUBDOC becomes current. The suspended DTD becomes active when the end of the SUBDOC is reached. Thus SUBDOC defines context switching via a stack. There is no possibility to exchange information between the ``stacked'' contexts. SGML requires that both the text and the DTD of the SUBDOC be in the same file. The attractive aspect of this feature is the ability to reduce the complexity of the main DTD by keeping specific treatments apart from the general environment, and switching from the main DTD to a more specific one when the text requires it (treatment of special structures, support of embedded text with different characteristics, processing of complex documents with more than one type of text content). This feature allows subordinate documents to be included in different ways. We only allow it to be used in attribute values. <h1>LINK <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>LINK <c>C: Y <c>I: N <c>S: N <c>P: Y </table> This feature can be used to define transductions. We do not think it is ``all it should be.'' We admit the possibility of a work item to define a better transduction scheme if requested to do so by the other committees. <h1>``Lexical'' Issues <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>tag delimiters <c>C: Y<c>I: Y <c>S: Y <c>P: Y </table> SGML by default expects tag delimiters to be ``$<$'' and ``$>$'' but we allow these to be changed as needed. We want to avoid having to specify an SGML declaration -- an optional part of a fully specified and marked document, containing details about lexical and other issues -- with each document. This means that we must be able to agree <ital>a priori</ital> on the values of things that would be specified in the SGML declaration so that it will be a ``TEI standard'' default. <table><tblbody cols='1 20 25 30 35 40 45 50'> <row>quantity and capacity sets <c>C: Y<c>I: Y <c>S: Y <c>P: Y </table> There are a number of quantities that are part of the SGML declaration. These appear to be there because of assumptions -- unrealistic and limiting ones -- about the implementation techniques that would be used for SGML parsers. The values used in the default (reference) concrete syntax are too limiting for the TEI. We want to essentially say at this point: don't be limited by these in any way, we'll define TEI values later that are ``big enough'' to deal with any situation that has arisen as the other committees do their work. Specifically, any desired element nesting can be used, and we will compute the nesting level parameter later. We expect that the maximum length of a name can be limited to 128 characters, and of a literal to 32K. There should be consistency in naming. We do not propose any rules here. The existing defaults for case sensitivity should be adhered to (sensitivity in entity names only) and existing defined entity names should be used. The character set used for names should, as far as possible, be the same set used for the document. In document dependent entities, entity names should be used without an attributed document type name only if the name is used to refer to the same entity in all the document types. Otherwise, all the document type names should be given explicitly. <h1>Contexts Again These observations can be made <ital>a posteriori</ital> on the details of contexts. <ul> <li>The choices for Capture and Processing are the same. <li>The choices for Interchange and Storage are the same save for the ``don't care'' entry for OMITTAG in the Storage context. They can thus be considered to be the same, treating the ``don't care'' as ``N''. <li>The choices for ``Capture and Processing'' differ from those for ``Interchange and Storage'' only with respect to SHORTREF, ONITTAG and LINK being allowed for C and P but not for I and S. </ul> <!> <h1>Draft SGML declaration for the TEI <xmp font=mono> <!SGML ``ISO 8879-1986'' CHARSET BASESET ``ISO-8859-1'' DESCSET 0 256 0 CAPACITY SGMLREF TOTALCAP 35000 INTCAP 35000 ENTCHCAP 35000 ELEMCAP 35000 GRPCAP 35000 EXGRPCAP 35000 EXNUMCAP 35000 ATTCAP 35000 ATTCHCAP 35000 AVGRPCAP 35000 NOTCAP 35000 NOTCHCAP 35000 IDCAP 35000 IDREFCAP 35000 MAPCAP 35000 LKSETCAP 35000 LKNMCAP 35000 SCOPE DOCUMENT -- use SCOPE INSTANCE to switch to ref. syntax for prolog -- SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 ... 31 127 255 BASESET ``ISO 8859-1 ...'' DESCSET 0 256 0 FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 NAMING LCNMSTRT ``aeiouaeiouaeiou'' UCNMSTRT ``AEIOUAEIOUAEIOU'' LCNMCHAR ``B - .'' UCNMCHAR ``B - .'' NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF DSO ``(*'' DSC ``*)'' DTGO ``(*'' DTGC ``*)'' SHORTREF SGMLREF NAMES SGMLREF QUANTITY SGMLREF ATTENT 40 ATTSPLEN 960 BSEQLEN 960 DTAGLEN 16 DTEMPLEN 16 ENTLVL 32 GRPCNT 32 GRPGTCNT 96 GRPLVL 16 LITLEN 32768 -- default is 240 -- NAMELEN 128 -- default is 8 -- NUMSEP 2 PILEN 240 TAGLEN 960 TAGLVL 24 -- enough? -- FEATURES MINIMIZE DATATAG NO OMITTAG NO RANK NO SHORTTAG YES LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR YES SUBDOC YES FORMAL YES </xmp> </body> </appendix> </gdoc>