Notes on sgmls handling of search for entities C. M. Sperberg-McQueen September 1993, rev. January 1994 The resolution of public entity names (i.e. how the entity manager finds them in the file system or elsewhere) is implementation dependent. These notes, originally intended solely for my own use, describe how the public-domain parser sgmls actually implements the mapping. It should be pointed out that the approach used by sgmls does not provide a completely arbitrary mapping between entity name or formal public identifier and the file identifier; this is not a problem for most users. Let us start with the basic question: 'How does sgmls locate an external entity when a PUBLIC identifier is used?' To that, the high-level answer is: 1 first it checks for an environment variable named SGML_PATH, which you will therefore probably want to set beforehand to something useful. If it doesn't find one, it uses a hardcoded default path (check your system documentation). 2 the SGML_PATH value is a set of one or more system identifier patterns, separated by colons or semicolons (OS-dependent, check your documentation). Each pattern can have literal characters as well as special keywords of the form %X (the sgmls documentation calls these 'substitution fields'). E.g. %S;N.%C;/sgml/pub/%N.%C 3 for each system ID pattern, in sequence, sgmls replaces the substitution fields with appropriate string values. It then looks for a file with that system identifier. If it finds it, it opens it. If it can't find it, it tries the next pattern. The meaning of the substitution fields is described in file sgmls.doc, which is part of the standard distribution. If a substitution field is meaningless (e.g. because it refers to a different type of PUBLIC identifier from the one being processed), the pattern is passed over. The low-level answer follows. N.B. in version 1.1 some of the handling of blanks, underscores, etc. has changed, and may not be as described: check the documentation for the version you are using. (Since I use operating systems with case-folded file names, I am also careless about case sensitivity issues; if you have case-sensitive file names, be careful.) 1 Simple example Set an environment variable called SGML_PATH to a value. For the moment, let's assume you set it by saying something like this: set SGML_PATH %S:%N.%X:%N.%V:%N.%C The default path (at least on Unix systems) is noted below. The SGML_PATH environment variable governs the search SGMLS performs for public entities. Specifically, given declarations like these: SGMLS will ask the file system for a series of files, in an order determined by the SGML_PATH value. The first one found by the file system is the one used by SGMLS. For ISOLat1: %S /* no search, no system id given */ %N.%X ISOLat1.vpe /* we are looking for a parameter entity with a device-dependent version */ ISOLat1.ppe /* if we find no device-specific form, we use the device-independent one */ %N.%V ISOLat1.LOCAL /* the device-specific form is called LOCAL */ %N.%C ISOLat1.ENTITIES /* the public text class is ENTITIES */ For ISOLat2, similarly, substituting ISOLat2 for ISOLat1. For the other one: %S p2idlist.entities /* system id */ %N.%X p2idmss.ppe /* no local version, only public */ %N.%V /* no search, no local version specified */ %N.%C p2idmss.ENTITIES /* the public text class is ENTITIES */ This is as far as I understand things at present; certainly sgmls has succeeded in picking up isolat1.local, isolat1.entities, isolat1.ppe, p2idmss.entities, and p2idlist.entities, under appropriate conditions. 2 Tree-structured directories, a more complex example In a system with tree-structured directories, of course, more of the public identifier can be used to find stuff. The default path search for the ISOLat1 would be, as I understand it: /usr/local/lib/sgml/%O/%C/%T:%N.%X:%N.%D or /usr/local/lib/sgml/%O/%C/%T gives one or the other of the following, I'm not sure about the case mapping: /usr/local/lib/sgml/iso_8879-1986/entities/added_latin_1 /usr/local/lib/sgml/ISO_8879-1986/entities/Added_Latin_1 %N.%X gives ISOLat1.vpe, then ISOLat1.ppe %N.%D appears to give nothing (I don't think there is a data content notation given here) 3 Overview of public identifiers and sgmls substitution fields For the record, the overall structure of formal public identifiers is: pubid ::= owner '//' class ('-//')? desc '//' lang ('//' version)? owner ::= 'ISO' data | '+//' data | '-//' data class ::= CHARSET | ENTITIES | DTD | DOCUMENT | etc. desc ::= data lang ::= /* code from ISO 639 */ | 'ESC' n/n n/n version ::= data The components of the publid identifier are picked up by different 'substitution fields': %P the entire public ID? %O the owner name (minus the '+//' and '-//') (%I, %R, and %U can be used to ensure that a search pattern only succeeds for ISO owners, registered owners, or unregistered owners; they expand to the empty string in the appropriate case, and to null (failure) otherwise) %C the class %T the description of the public entity %L the language code (EN, FR, etc. from ISO 639) %E the character set escape sequence %V the version descriptor Various types of case folding and character substitution or character deletion are performed, which should be described in the documentation for the version of sgmls you are using (they are set at compile time, and differ in the Unix, VMS, and DOS versions, to suit the operating systems). Still other substitution fields pick up other parts of the entity declaration: %D the entity's "data content notation". I am not sure, but believe what is meant by this is the notation name given for an external entity declared as being in a specific notation. In the declaration the name 'tiff' is the notation name. %N the entity name (the name used in references to this entity) %P the public identifier (the whole string, it appears) %S the system identifier (i.e. in the string 'bar.doc') %X a string chosen from a rather complex table, depending on whether sgmls is searching for a data entity, subdocument entity, general text entity, parameter entity, dtd, or lpd, and on whether it is declared without a public identifier, with a public identifier, with a device-dependent version string or without one %Y a string chosen from a simple table depending on whether sgmls is searching for a subdoc entity, a data entity, a text entity, a parameter entity, a dtd, or an lpd %A causes the search to fail if the formal public identifier contains an 'unavailable text' indicator