Feature System Declarations and the Interpretation of Feature Structures Gary F. Simons Document Number: TEI AI1W3 Draft of 28 January 1991 Abstract Discusses problems involved in interpreting the meaning of underspecified feature structures in TEI conforming texts and then proposes that these be solved by requiring an external Feature System Declaration. The DTD for such a file is proposed and two sample FSDs, illustrating GPSG and systemic linguistics, are given. 1. INTRODUCTION 1.1 The problem In writing any particular feature structure, analysts typically omit some features that occur in other feature structures used in the analysis of that language. What does it mean when a particular feature is not specified? At least the following might be possible interpretations: - The value is unrestricted. - The feature takes a default value supplied by rule. - The feature does not apply in this context. - The feature is known to apply but its value is unknown. - No claim is made about the feature's value or applicability. To avoid this ambiguity, the following universal atomic feature values are proposed for TEI feature structures: ANY The form is compatible (or unifies) with all possible values of the feature. DEFAULT The feature takes a specific default value which can be inferred by some general rule. N/A The feature is not applicable in this context. ? The feature is known to be applicable, but its value is not known. NO CLAIM No claim is made about the feature's value or applicability. (If it is known whether or not the feature is applicable, use N/A or ? instead.) But providing these special feature values still does not solve the problem of knwing how to interpret a given feature structure. The range of values possible for ANY or ? are still not known. The specific feature value to assign for DEFAULT is still not known, nor is the range of values for a feature specification which uses to permit any value other than the specific one given. The inability to interpret a feature structure could spring from a completely different kind of problem; the human reader of a feature structure might have no idea what the abbreviations for feature names and feature values stand for. Finally, even thought DEFAULT and N/A are available when needed, it is not parsimonious (and thus not typically desirable) to specify DEFAULT or N/A whenever a feature value can be inferred by a general rule or is known to be irrelevant by a general rule. 1.2 The proposed solution To solve these problems, the working group of the A & I committee is proposing that a "Feature System Declaration" mechanism be set up. In a Feature System Declaration (or FSD), the designer of the feature system declares its formal properties in an external file which is separate analyzed text files or dicionary files which might use that feature system. Specifically, an FSD declares the names of the features that are used, the range of values allowed for each feature, rules for inferring default values, and constraints on the co-occurrence of features and values. The FSD functions further as a place for the analyst to provide prose description of the features and their possible values. The TEI draft guidelines already employ the mechanism of using an auxiliary file (with a specialized DTD) to interpret the meaning of what is encoded in a TEI-conformant document. This is the mechanism of the Writing System Declaration (or WSD). For every unique value of the LANG attribute used in a text, there should be a matching WSD which documents how data in that language has been encoded, and thus how byte codes in content marked as being in that language are to be interpreted. The proposed FSD mechanism is therefore analogous to the WSD mechanism that already exists. In order for application software to use such declarations to aid in automatic interpretation of encoded texts, or for human readers to find the appropriate declarations, there must be a formal link from the texts to the declarations. As far as I can tell this is missing in the current draft. A logical place for this would be in the of the . For instance, could formally identify the WSD for a given language, and could identify the FSD associated with a given text file. The parallel to WSDs raises the question of whether we might want multiple FSDs associated with a single file, linked perhaps via the LANG or the NAME attribute of the feature structure. 2. THE FORMAL DEFINITION OF FEATURE SYSTEM DECLARATIONS This section proposes a formal definition of Feature System Declarations in terms of an SGML document type definition (DTD). In developing this definition I have referred heavily to a recent landmark work dealing with feature structures, namely, Generalized Phrase Structure Grammar by Gerald Gazdar, Ewan Klein, Geoffret Pullum, and Ivan Sag (Harvard University Press, 1985). I believe the formulation below to be adequate to handle the feature system they propose in the appendix for English. A feature system declaration has four parts: a header, a specification of feature ranges, an optional specification of feature defaults, and an optional specification of feature co- occurrence constraints. That is, We now elaborate each of the four components in turn, illustrating each by extracts from the feature system for English proposed by Gazdar, Klein, Pullum, and Sag. 2.1 The FSD header The FSD.header, in analogy to a TEI.header, gives the necessary bibliographic documentation on the FSD. This includes identification of the language and purpose of the feature system, who developed it and when, details of revision history, references to literature on which the analysis is based, and so on. For the present, I define FSD.header as simply PCDATA. I leave its elaboration to the TEI editors who are in a better position to make it consistent with the philosophy of TEI headers. Thus, *** Note to TEI editors: This suggests that WSDs as well, should have a header analogous to a TEI header, and very similar (if not identical) in definition to an FSD header. *** The following is the FSD header for our sample based on the GPSG appendix: This sample FSD does not describe a complete feature system. It is based on extracts from the feature system for English presented in the appendix (pages 245-247) of Generalized Phrase Structure Grammar, by Gazdar, Klein, Pullum, and Sag (Harvard University Press, 1985). This sample was encoded by Gary F. Simons (Summer Institute of Linguistics, Dallas, TX) on January 28, 1991. 2.2 Feature range specifications The specification of feature ranges consists of a sequence of range specifications, one for each feature in the system. For a feature to be sanctioned in feature structures, it must have a range specification. The feature attribute of the range tag declares the name of the feature. The content of the range specification may optionally begin with a prose description. Then follows a sequence of value declarations. Feature values may be of three types: a predefined atomic value, an embedded feature structure, or an arbitrary text string. A name attribute in an tag is used to specify the name of a predefined value. In the case of values which are embedded feature structures, a name attribute is optional. When name is absent, any feature structure may occur as the value for that feature. When name is specified, only a feature structure with that name is allowed. Arbitrary text is declared as a legitimate feature value when, for instance, the feature is used to hold a word form of an open set. The DTD fragment for feature range specifications is thus as follows: As an example, consider the following extract from Gazdar, Klein, Pullum, and Sag (1985:245): feature value range CASE {ACC, NOM} COMP {for, that, whether, if, NIL} ADV {+, -} AGR CAT PFORM {to, by, for, ...} where CAT stands for any category (where category is equivalent to feature structure). The above would be encoded as follows in an FSD: specifies the form of complementizer used adverbial agreement for person and number word form of a preposition 2.3 Feature default specifications The specification of feature defaults consists of a sequence of default specifications. Defaults need not be specified for every feature. If a default is specified, then it is possible to infer the correct value for an obligatory feature when it is missing from a feature structure. Otherwise, a feature structure with a missing obligatory feature will be viewed as in error. The feature attribute of the default tag specifies the name of the feature for which the default is being declared. The first embedded element specifies the default value for the feature. A second optional embedded specifies a guard (or condition); that is, the specified value is the default only when the conditions enumerated in the guard are met. The DTD fragment for feature default specifications is thus as follows: As an example, consider the following extract from Gazdar, Klein, Pullum, and Sag (1985:246-247): FSD 1: [-INV] FSD 2: ~[CONJ] FSD 9: [INF, +SUBJ] --> [COMP for] Some comments on notation: ~ means undefined (that is, not applicable). If a given feature value is in the range set of only one feature, then the feature name may be omitted, as with INF above which is a value of the VFORM feature.. --> is the implication (if-then) operator of boolean logic. (The book has just one example -- number 10 -- which uses a biconditional implication; I am having trouble understanding this as a default specification. To my way of thinking, a biconditional would require that both sides be fully specified already, and therefore couldn't be a default specification but would be a co-occurrence constraint.) The above would be encoded as follows in an FSD: 2.4 Feature co-occurrence constraints The specification of feature co-occurrence constraints consists of a sequence of conditional and biconditional tests. A particular feature structure is valid only if all tests are true for it. The element encodes the conventional if-then operation of boolean logic which fails only if the antecedent is true and the consequent is false; otherwise, it succeeds. In feature co-occurrence constraints the antecedent and consequent are expressed as feature structures; they are considered true if they unify with the target feature structure. The element encodes the biconditional (if and only if) operation of boolean logic. It succeeds only when both antecedent and consequent are true, or both are false. The DTD fragment for feature co- occurrence constraints is thus as follows: As an example, consider the following extract from Gazdar, Klein, Pullum, and Sag (1985:246-247): FCR 1: [+INV] --> [+AUX, FIN] FCR 7: [BAR 0] <--> [N] & [V] & [SUBCAT] FCR 8: [BAR 1] --> ~[SUBCAT] FCR 20: ~([SLASH] & [WH]) Further notational conventions are here introduced. <--> is the biconditional operator of boolean logic; that is, "a <--> b" means "(a --> b) & (b --> a)". FCR 20 is the only constraint that is not expressed with an implication operator; I assume it to mean if either SLASH or WH is defined, the other cannot be defined, and have thus translated it below into two conditionals. The above would be encoded as follows in an FSD: 3. ANOTHER EXAMPLE It turns out that feature system declarations work equally well to describe the networks of system delicacy used in systemic linguistics. In fact, systemic networks seem to offer a graphic notation for representing feature ranges and constraints. The following simple network is from R. A. Hudson, English Complex Sentences (North Holland, 1972), page 60: / | | 1st-person | | | --> | 2nd-person | | personal | | 3rd-person ---| | masculine pronoun --> < | | | > --> | feminine | | | | | singular -----| | neuter | --> | | | plural \ This systemic network could be translated into a feature system declaration like the following: This is a sample "toy" FSD based on a systemic network for the inflection of English personal pronouns in R. A. Hudson, English Complex Sentences (North Holland, 1972), page 60. It was encoded by Gary F. Simons, Summer Institute of Linguistics (Dallas, TX) on January 28, 1991. major syntactic category subtype of pronoun