Introductory Notes on the TEI Guidelines 1 Basic Characteristics and Design Goals August 16, 1990 This list has had a lot of recent subscriptions in response to the announcement that the TEI Guidelines are now available in draft form; TEI-L now goes to over 275 addresses. The 600 pre-printed copies of the draft, which we originally thought might be a bit too many to get rid of in the year before version 2 is ready, may at this rate all be spoken for before the month of August is out. We're happy about all the interest, because it suggests that many others agree with the organizers of the TEI that we need methods for text encoding suitable for multiple uses of the same texts, for exchange of texts among researchers and others interested, for languages other than English and scripts other than Latin, and which will work with all kinds of text, not only the most common. This list should play a big role in the revision of the Guidelines, and to help get the relevant discussion started, it might be a good idea for the editors to discuss from time to time some of the background to the current draft -- a sort of TEI tutorial over the net. This will, we hope, provoke some questions from participants in the list, and will lead over time to discussions of the many thorny technical and other issues involved with a project like this. Much of what we say at the beginning may seem (or be) basic and uncontroversial, and those who like fireworks may wish we would jump right to the burning questions and get some arguments going. It appears though that some of the noncontrover- sial basics are essential to even understanding some of the trickier burning questions, so we are going to go slow at first. Anyone who wants to start a second thread on any burning issue of their choice may do so. We count on the many participants in this list who are serving on the TEI working committees to jump in and amplify or supplement our account wherever you see fit. WHO IS THE TEI FOR? Let's start with something fairly simple: who is the TEI for and what are the basic goals? The goals of the TEI are to define a format for encoding texts in a linear data stream which is suitable for the interchange of textual material between researchers, and to provide concrete recommendations, for those who can use them, as to what features of texts should usually be recorded. As the letterhead puts it, the TEI is an "Initiative for Text Encoding Guidelines and a Common Interchange Format for Literary and Linguistic Data". Note some non-obvious points: 1. The TEI came out of the community of those using computers to do research on or with texts, and they are our primary constituency. That is: literary scholars, linguists, computational linguists, historians, philosophers, theologians, philologists, people work- ing on machine translation, ... you name it. The publishing industry, database vendors, software developers, and others with commercial interests in electronic text are interested in the TEI, and many are sharing their expertise with us, but they are not the *primary* constituency. If research and publishing were to turn out to require different things, the TEI would go with the needs of researchers. It's important to note that this is mostly an imaginary issue: so far the requirements of all these groups seem astonishingly close to identical. Very concretely: I have not encountered a single problem faced by humanists which does not have an analogue in a problem faced by linguists, and one in a problem faced by publishers or commercial database vendors. And vice versa. Some- times the problems look different, but so far most differences have proven superficial. We believe that what will work for researchers must work for other applications as well. So in a real sense, though researchers are the primary constituency, the real intended constituency is everyone who works with electronic text in *any* way, and wants to be able (a) to move the text from system to system without information loss, or (b) to use the text for more than one thing. 2. One major intended use for the Guidelines is as a specification for an interchange format. Transfer between researchers, machines, programs, networks would use such a format very simply: as a description of what my text will look like when it passes from my hands to yours, or what I would like yours to look like when yours reaches me. An interchange format does not tell anyone what to encode, any more than the ASCII code tells us how to write novels or manuals. What is encoded is the intellectual responsi- bility of the researcher; no one can take that responsibility away. 3. The other major intended use is as a guide for those encoding texts for general use (and one hopes that that includes most of those encoding texts). The Guidelines should provide a sample set of textual features that many people have found useful in textual work, together with ways of encoding those features. No one is required to encode all those textual features, but the list should (if we do our work right) be taken seriously as a checklist of what the community as a whole tends to find useful. Software developers should also benefit from the guidelines in both these ways: as a definition of an export-import format (or as an inter- nal file format, if you wish!) *and* as a checklist of textual features commonly thought important. I suppose many of us have seen software which suffered from its makers' sometimes unconsciously narrow concep- tion of the kinds of texts it would be used for -- the Guidelines should be useful as a sort of brain-storming, concept-broadening tool for developers. 1.1 Basic requirements The basic requirements for a text encoding scheme have been stated in the NEH proposals for TEI funding. (Quick tip of the hat to the NEH, the EEC, and the Mellon Foundation for their funding. Without them, it wouldn't be happening nearly as fast.) An encoding scheme is any (systematic?) method of representing or encoding textual data in machine-readable form. Typically, an encoding scheme must include: 1. methods for recording the characters in the text (including dia- critics, special symbols, non-Roman alphabets, etc.) 2. conventions for rendering a text in a single linear sequence (specifying how footnotes, end-notes, critical apparatus, parallel texts, and other non-linear complications are handled) 3. methods for recording logical divisions of texts (e.g. book, chap- ter, paragraph; act, scene, speech, line; ...) 4. methods for recording analytic information like literary or lin- guistic analysis 5. conventions for delimiting in-line comments and other ancillary material 6. conventions for identifying the text being encoded and those responsible for encoding it To create a single encoding scheme suitable for common use, the TEI first formulated (in the original planning conference in 1987 and in working papers since) the following requirements for the scheme to be developed: 1. It should specify a common interchange format. 2. It should provide a set of recommendations for encoding new textu- al materials. 3. It should document the existing major schemes and investigate the feasibility of developing a metalanguage in which to describe them. 4. It must be a set of guidelines, not a set of rigid requirements. 5. It must be extensible. 6. It should be device- and software-independent. 7. It should be language-independent. 8. It should be application-independent. As design goals, it was specified that the guidelines should: 1. suffice to represent the textual features needed for research 2. be simple, clear, and concrete 3. be easy for researchers to use without special-purpose software 4. allow the rigorous definition and efficient processing of texts 5. provide for user-defined extensions 6. conform to existing and emergent standards We can expatiate on these at great length, if anyone isn't sure what we mean by them, but I won't here. 1.2 How we stand The current draft, be it noted, does *not* solve all these problems or wholly fulfill all of the design goals. It wasn't expected to -- some of the hard problems were intentionally saved for the second cycle. Here is my personal checklist of where we stand with respect to the goals listed above (which as you can tell from the overlaps were taken from different documents). * The current draft (version 1.0) does specify both an interchange format and recommendations, though perhaps not as explicitly as one might have expected. It may need to become more explicit in defin- ing the interchange format. * It does not document any existing encoding schemes, though work is continuing on that topic. * The metalanguage and syntax committee did consider the formulation of a metalanguage for defining existing schemes, but decided against it. Descriptions will take the form of prose and of algorithms for translating from a given scheme into the TEI scheme, using a variety of existing software tools (e.g. sed scripts, Rexx execs, Snobol programs, or even yacc and lex code). * It is certainly a set of guidelines rather than requirements, and device- and software-independent. It is also, however, not fully implemented in software -- this has the advantage that the design is not unduly biased by implementation issues, but it makes it hard to demonstrate or validate the scheme. * It is extensible, but the mechanisms for specifying extensions need work to be usable without heavy-duty knowledge of SGML. * It has no bias that we have consciously put there in favor of any one language, but the TEI has not addressed, let alone solved, the problems of languages other than those already most effectively cov- ered by international data-processing standards. The current draft is silent on topics where people need the most guidance: older forms of languages not covered by ISO standards, Asian scripts, treatment of bidirectional text (e.g. Hebrew and English), and so on. We expect to work on these in the next two years, but for some issues there is little we can do but document and call attention to existing methods of handling these problems (e.g. ISO 10646 or the Unicode effort -- two unfortunately incompatible approaches to han- dling Chinese and other Asian scripts). * It does provide what we think is an adequate *basis* for handling all the known needs of research; it probably needs extension in many areas to provide not just the *basis* for the required solutions, but some version of the solutions themselves. * It's as simple and clear as we could make it, but we expect to hear about lots of obscurities in the draft. (Let's say it again--please let us know if there are things that aren't clear!) * It can be used without special software, at least at the simpler levels. A lot of work is needed, however, before we have something we can hand to the average literary scholar who uses Nota Bene or Word Perfect or Microsoft Word and wants to create a TEI-conformant file. (Volunteer macro-writers sought!) * So far, at least, the Guidelines can be used as specified in the ISO standard which defines SGML. There are some technical reasons which mean that the TEI guidelines may not be definable as a "conforming application" of SGML -- these mostly relate to syntactic freedoms of SGML which are forbidden by the current version of the Guidelines. That's it for the basic goals of the TEI. Coming up: discussions of SGML basics, the TEI tags for core structural features, other core tags in the TEI scheme, and character-set issues. After that, we should be able to raise some of the more advanced questions.