========================================================================= Date: Mon, 4 Mar 91 11:57:00 EDT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: John Lavagnino Subject: Comments on Critique by Literature Working Group I've been wondering at the lack of response on this list to the report released last month by the Literature Working Group. Here, at any rate, are my comments on the most important issues they suggest---which skip over the questions they raise about the content and audience of TEI P1, and about its relation to actual software, because my notes on those subjects would add nothing to what Michael Sperberg-McQueen says in his Valentine's Day progress report. 1) ``The Perspective of the Literature Scholar'' as seen by the Working Group The group's report rests on an assumption about what the encoding of literary texts is for, and how it's done, that doesn't reflect the breadth of actual practice. Section 1.B of their report talks about the ``pragmatics of work on literature texts'' and states their assumption: that the only reason you enter literary texts into a computer is to make mechanical analyses of large amounts of previously-edited text. This is one kind of work the TEI guidelines need to support. But it's not the only one, and other kinds of work have different needs, which aren't taken into consideration by the group when they make many of their absolute statements about what should and should not be. For example, they assert that descriptive markup is always preferable. Their argument sounds dubious to me even for the kind of work they're talking about; but it's entirely irrelevant for the editors of modernized texts, in which the encoding can indicate the meaning rather than the appearance with ease and certainty---because the editor who chooses to italicize a word is alive and present, ready to swear to a belief that it is a foreign word. This is not a trivial application. The number of modernized texts that is produced is far greater than the number of critical editions; and having them available in a standard electronic encoding would be of great value. But the Working Group's report assumes that this kind of work does not exist and is not of interest---a position it takes with respect to lots of other computer applications that literature scholars have thought up. In justification for this position, the report claims that the survey conducted by the Working Group revealed that everybody agrees with them in wanting a very minimal markup, and in opposing the general distribution of texts containing interpretive information: ``Literature scholars are not interested in, in fact many object vehemently to, the perspective of obtaining texts which already contain - explicitly or implicitly - literary interpretations. The responses and comments elicited by the Survey bear eloquent witness to this.'' This claim is false. The report released by the Working Group on their survey turned up, on the question of interpretive information, 15 people who considered it ``essential'' or ``important,'' and 20 who thought it ``should not be included.'' This does not justify the group's sweeping claim about what all literary scholars want. What that report instead shows is that some people think an encoding for interpretive information is a splendid idea---the Survey elicited eloquent comments to this effect---and that others hate the idea. The Working Group should sort out for us the reasons behind both positions, and suggest how to accommodate both groups of scholars. Instead, they've decided that one group of scholars just doesn't matter. 2) The Working Group's notion of the TEI's value The Working Group also seems pretty convinced that you'd never want to follow the TEI guidelines for anything except writing tapes to ship off to the Oxford Text Archive; that they're obviously not appropriate for ``local'' use. Their report asserts that the guidelines are only for ``interchange and possibly archival purposes''---``interchange'' here evidently meaning ``between distantly-separated people,'' and never just ``between programs on your computer,'' as TEI P1 1.1.3 suggests. The report tells us: ``SGML should not take precedence over the needs of scholars.'' I find this surprising: the Working Group never seems to have considered that SGML---or some encoding standard for use in literary programs---is itself a need of scholars. I always assumed that standardization would help to bring us better software for literary computing, because it would make it possible to get general tools for processing texts; right now you can only use the ones designed for the particular encoding you've chosen, or invented locally. The Working Group seems to think that it's better if we all use incompatible home-grown encoding (and hence use only software we've written locally), and translate to SGML only for communication with the outside world. (Of course, if it's only for such communication nobody's ever going to use it.) The only justifications they give for local encoding are archaic: it's easier to type; you can read it, whereas SGML encoding of any complexity is hard to read. These are not serious arguments against a scheme which is not intended for data entry and is not intended to serve as a visual representation of a text for people to read. I, at least, am sticking with SGML. John Lavagnino Department of English and American Literature Brandeis University Waltham, MA 02254 USA Internet: lav@binah.cc.brandeis.edu Bitnet: lav@brandeis ========================================================================= Date: Mon, 4 Mar 91 08:59:47 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Rindfleisch@SUMEX-AIM.STANFORD.EDU Subject: Away from my Mail I will be gone and not reading my mail until Tuesday, March 19. Your message regarding " Comments on Critique by Literature Working Group" will b e read when I return. If your message concerns something urgent, please contact Monica Wong (Wong@SUMEX-AIM) or phone my office at (415) 723-5569. Tom R. ========================================================================= Date: Tue, 5 Mar 91 20:41:48 MET Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Harry Gaylord Subject: Coptic and Greek Papyrology If you do not work on Coptic texts or in Greek papyrology, please disregard this message. Greek papyrologists need more symbols than normal researchers in Greek texts. The new standards for character sets seem to be concentrated upon normal Greek texts. If you need more characters in your work than are described in the TLG complete set, please send me a list of the additional character descriptions. I will include tne necessary characters in either the Dutch Standards Institute commentary to the proposals for the new ISO 10646 standard or work out entity declarations as set out in the ISO SGML standard. Coptic has been removed from the proposed ISO 10646 standard. If you have proposals for adding Coptic characters, please contact me. There are a number of Coptic characters attached to the Greek section of UNICODE, but I am not sure if this is complete. Harry Gaylord (galiard@let.rug.nl) Groningen University, The Netherlands ========================================================================= Date: Wed, 6 Mar 91 13:25:00 +0100 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Jan Hajic Date: 6 Mar 91 13:24 +0100 From: Jan Hajic To: Message-ID: <224*haj@divsun.unige.ch> Subject: Prague Summer School *** SUMMER SCHOOL IN COMPUTATIONAL LINGUISTICS *** *** Formal and Computational Models of Meaning *** TIME AND PLACE: July 8-21, 1991, Prague, Czechoslovakia ORGANIZERS: Faculty of Mathematics and Physics and Faculty of Philosophy, Charles University TEACHERS: Jurij D. Apresjan, B.T.Sue Atkins, Christian Boitet, Jens Erik Fenstad, Charles J. Fillmore, Eva Hajicova, Petr Sgall, George Lakoff, Martha E. Pollack, James Pustejovsky, Mats Rooth, Hans Uszkoreit, Wolfgang Wahlster For more inforamtion, contact: e-mail: MATRACE@CSPUNI12.BITNET (or UMLEH@CSEARN.BITNET) MFF UK - linguistics, c/o Dr. Eva Hajicova Malostranske nam. 25, 118 00 Praha 1, Czechoslovakia Voice: +42-2-532136 Fax: +42-2-847688 (attn. MFF UK linguistics) ========================================================================= Date: Fri, 8 Mar 91 10:59:56 DNT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Hans J rgen Marker Subject: Other text encoding inititatives? 360/TEI 730/$1696 At the Danish Data Archive we hav been approached by the national council for standardisation, DS. They are involved in an international cooperation regarding laying out standards for encoding text. They were most astonished when we informed them of the TEI, and none of their international partners ever told them of the TEI. The question they raised was, how is TEI related to standardisation mesures carried out on the the governmental level for standardisation of text encoding? Could you people out there help me with the answer? Hans Joergen Marker Danish Data Archives ========================================================================= Date: Fri, 8 Mar 91 15:45:37 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Don Walker Subject: Away from the office from 8 to 25 March I will be in California and then in Tempe at the ACH/ALLC Conference. For urgent matters, contact my secretary Elaine Molchan at em@flash.bellcore.com or (+1-201)829-4594 for information on how to reach me there. Don Walker ========================================================================= Date: Thu, 14 Mar 91 09:57:07 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list Comments: Warning -- original Sender: tag was From: Elli Mylonas Subject: Response to Final Critique A Response to the Literature Working Group's "Final Critique"

Introduction

Tagging and Interpretation

All markup is interpretation.

Most texts are already interpreted.

All text entry is interpretation.

Tagging is scholarly work.

Presentational markup is also interpretation

The Importance of Descriptive Markup

DTDs

A DTD Helps Create Software and Helps Formalize Interpretations

The Apparent Loss of Flexibility Created by a DTD is an Illusion

Different Kinds of Scholarship & Texts

Only Texts that are Based on Book Technology are Discussed

Reference Systems are more complex than just pages & lines

Verbosity of Tagging / Localization

Verbosity is not What it Seems

Examples of Software and Macros for Working with SGML

Introduction

In their "Final Critique", Paul Fortier and the Literature Working Group present an evaluation of the TEI guidelines (1.1) as they apply to the tagging of literary texts. Following are comments on several of the points made in the "Final Critique." However, before presenting a detailed response, we discuss some facts about the solutions chosen by the TEI. We will also explicitly present some of the theoretical assumptions on which our responses are based.

SGML serves to facilitate our work as preparers of texts for other scholars to use and as scholars ourselves. SGML provides a general mechanism that may easily be used to encode any structures that may be abstracted as a hierarchy. It also allows encoding of more complex non-heirarchical structures, though with a concommitant increase in the complexity of the markup. The information represented in this way can cover a large range of different semantic domains -- typography, cross-reference structures, metrical and verse patterns, imagery and discourse structures, even the physical state of a text. For most types of text, certain text features provide a basic level that most readers or users of a text would require in order for the text to be useful. Examples of such features are those which are conventionally expressed by the appearance of a text.

It is also important to remember that SGML encoding adds great value to electronic texts because electronic texts are not only interchanged by different researchers with differing research agendas and hardware, but may also be subjected to different kinds of processing by the same person on the same computer. SGML makes an electronic text more versatile and amenable to processing by computer by providing unambiguous indications of those features that a scholar deems of importance. This processing ranges from output using different devices, to statistical analysis, to standard text searches. At the same time, moving a text that has been written and printed in book form into electronic form will always entail some compromise. We must try to retain as much information as possible, and to be as general as we can in not excluding features that other readers and users of the text may want. However, even if we photograph each page of a book and present it as page image, there may still be information on the page that is not represented in the resulting computer file.

In order to maximize our ability to share these texts and to encourage the development of software for using and analyzing them, it is important to minimize the needless and unmotivated diversity of basic tagging that would otherwise result. Indeed, as text entry projects have increased in number, the current range of encoding schemes is already creating problems in sharing texts. Although many of the respondents to the survey did not appear to be interested in interchange, it is the experience of the writers that as soon as they have a text on-line, they begin to receive requests for that text. Also, a number of the larger text project have the dissemination of texts as their primary goal. This is why the TEI must come up with a basic tagging vocabulary, and define the basic structures that are likely to be commonly tagged. Of course the textual features that scholars may wish to encode are without limit -- this is why the TEI does not attempt to preempt scholarly creativity, and provides several means to extend and supplement its recommended encoding scheme.

Finally, texts are not studied only by scholars of literature, nor is literature the entire scope of the TEI--the same text may prove informative for linguists, historians, philosophers and other scholars, disciplines with their own particular interests and methodologies. It is prudent, when encoding a text, to provide as much generally useful information as possible, so the text may be of value to many different fields and disciplines. When discussing encoding schemes, the range of media on which a text might have been written must also be considered: there are texts whose forms vary from collections of papyrus fragments to the increasing number of machine-readable texts that were created in machine readable form and may never have existed as paper publications. /* ----------------------------------------------------------------- */

Tagging and Interpretation

All markup is interpretation.

The Literature Working Group makes this statement, and we certainly would not disagree with it. Indeed, we would go even further, and say that interpretation is required for all sorts of markup, both presentational and descriptive. Word boundaries, italics, themes, font shifts, are all subject to opinion. Some perhaps are more controversial than others, but even straight physical descriptions can cover a wide range of levels of detail and precision. For any particular purpose, some levels of markup are more relevant than others. For example, Antonio Zampolli tells us that the lexicographer concerned with building a machine parsable dictionary may be indifferent to font shifts. (Discussion at TEI workshop, Chicago, 9/90.)

Most texts are already interpreted.

All texts, except perhaps, an author's manuscript, contain interpretation. Even if the edition is not critical or canonical, Even if the edition is not critical or canonical, it nevertheless contains the interpretations of the publisher, editor, and in some cases, even the compositor. The Literature Working Group points out that scholars tend to work with a canonical or prestigious version of a text which is recognized by those engaged in serious professional work. (paragraph 14 of the Critique) Such texts gain their value from the interpretive work that has been put into them by the scholar who created the edition. There are cases where the exact preservation of the physical presentation of a text is important. Such cases require detailed presentational markup, and are discussed below.

All text entry is interpretation.

Entering a text that is a primary source is itself a task that entails interpretation. Whether it is the scanner operator or the scholar, someone has to make the decision of what specific features on a specific page are part of the tagging scheme. In many cases, scanner operators are working from (manually) marked up copies of texts, which have been marked by a scholar to disambiguate features that could otherwise be confused. Other times, the scanner operator is someone who can perform the disambiguation and make the decision her- or himself. Finally, most texts are proofread by a person who has the knowledge and the authority to make decisions about the tagging.

Tagging is scholarly work.

Before a text is entered into electronic form, someone has to make the decisions about what features are to be marked and how they are to be marked. The decision must also be made as to where the chosen features actually appear, so they may be entered correctly (see above). Only experts who are intimately acquainted with the texts and related scholarly problems involved can make these decisions, develop the tag sets and tagging schemes, and instruct the encoders. The work of devising markup schemes and Document Type Definitions is not an easy task, not so much because it is technically difficult, but because it involves complex decisions about the final encoding of the electronic document. These decisions must be based on a detailed and comprehensive knowledge of the form and content of the text.

It is also not true that the scholar and the person putting in markup will never be the same person. ("Final Critique", ad TEI Guidelines 7.3.1.1, ad TEI Guidelines 5.11.1) Most likely, texts are being entered under the supervision of a scholar in order that she can ultimately work on them. When she does, she will need markup of the chosen features in the text in order to aid her own work; markup may also be used to preserve certain aspects of her own textual interpretations for use by future scholars. She will in that case, have to decide on the relevant tags, and enter them into the text. It is worth noting that the presence of information in a text does not require that it be used by another scholar -- a purely statistical analysis of word frequencies might completely ignore markup, while an information retrieval application would use it extensively.

Presentational markup is also interpretation

There are also texts for which it is extremely important that the conversion into electronic form retain as much information as possible about their presentation. Examples of such texts are early printed works in which the typography is significant, manuscripts and fragmentary texts where the position of the letters on the page is important, and visually creative genres like concrete poetry and literary collages. In order to tag such a text, one may need to be able to describe any nuance of variation in point size, or typeface, or line spacing. It is not enough to say that a portion of a text is indented, or bolded, or in italics. A vocabulary of such tags that is sufficiently broad to cover all potentialities seems impossible to create, and even if it were, would not be suitable for interchange.

Therefore, marking the presentation of a text, entails creating abstractions, and then interpreting the page image in terms of these abstractions. The person in charge of entering the text, or of marking up the text to be entered, must decide which presentational features are to be preserved in the markup, and to what extent nuances of spacing, printing or layout are to be regarded as distinctive. /* --------------------------------------------------------------- */

The Importance of Descriptive Markup

The importance of descriptive markup is that it makes the structure of a text explicit and thus allows processing to take place that makes use of that structure. It also tends to include more information in a text than simple presentational markup can. When text features like headings, quotations, and direct speech are tagged, it is possible to use the text as if it were a database, display and output it on very different media and do various types of context sensitive analysis. Furthermore, it is always easier to remove detail and make the tagging of a text simpler, than to try and insert detail once it has been removed.

DTDs

The Literature Working Group also protests the use of DTDs to encode the structure of literary texts, because they feel that this forces a particular interpretation on a text. ("Final Critique" ad TEI Guidelines 2.1.4, ad TEI Guidelines 6.1 para 2)

A DTD Helps Create Software and Helps Formalize Interpretations

The existence of a DTD is helpful for analysis software since it provides a concise description of which textual features can occur in particular contexts. The purpose underlying the original conceptualization of the DTD was to ensure that newly written machine readable documents corresponded to the "correct" form decreed for that type of document, thus ensuring that it would be suitable for automatic processing by a computer. It turns out that a DTD is also surprisingly useful as a way to rigorously describe the tagging decisions made in encoding a text. The verification facilities of SGML can be used to determine if the actual document, as tagged, matches the descriptive theory of the document formed by the scholar as recorded in the DTD. This can reveal shortcomings in the DTD as well as errors in the encoding of the text. In either case, it provokes a deeper consideration of the interpretation which is (inevitably) being done.

The Apparent Loss of Flexibility Created by a DTD is an Illusion

DTDs are not meant to be rigid molds into which we cram texts. On the contrary, they grow and change as our understanding of a text changes. Creating a DTD and then using it to validate a document provides interesting and useful information about the structures of the document and the text type. SGML, because of the validation process, can counter many of our assumptions about the structure of a text, and thus enriches the scholarly process of analyzing and understanding it.

The DTD for textual features of interest to scholars can be modified by the mechanisms documented in Chapter 8 of the TEI guidelines. This description is distinctly non-tutorial at the moment and will probably always remain a task for a person who is not afraid to plunge into SGML. Nevertheless, DTD extension is an important area that is an essential and integral part of the TEI approach to textual tagging.

Different Kinds of Scholarship & Texts

Only Texts that are Based on Book Technology are Discussed

When the Work Group discusses markup of texts, they appear to only be taking a certain type of text into account. Texts written in Europe in the last 4 or 5 centuries, that have been printed as books. In that light, their comments about reference systems and pagination have some validity, as do the elements that they single out for tagging. However, there are many literary texts do not fit these criteria. Examples of these are ancient texts, where we have many manuscript copies of lost papyrus originals where the lines and pages may no longer represent the original lineation (except in poetry), and texts created on the computer, which do not have pages, lines and other features derived primarily from books In these cases, the elements that should be tagged differ from those in a book. The "Final Critique" does not address these issues. Finally, the technology of the printed book is only one phase in the development of text. It was preceded by the papyrus scroll, and is being followed by electronic texts and hypertexts. Basing markup solely on typography and book pagination is to build into an electronic text artifacts of one particular display technology.

Reference Systems are more complex than just pages & lines

An examination of reference systems will make these comments clearer. It is not possible to unambiguously locate a place in every text by using pages and lines. In the case of ancient texts, physical pages and lines are not significant (except in the case of poetry, and even there, line breaks in lyric are disputed). Reference is usually based on the line or page breaks of a particular edition, the rest of whose particulars are no longer known, as in the case of Stephanus pages in Plato. This information, which is no longer tied to the physical aspect of a text, provides clear location information for all editions of that text. In the case of texts created on the computer, a reference system may have to rely on the tagging, since those are the features whose function corresponds most closely to the page. The Guidelines actually give a fairly detailed description of tagging alternate reference schemes (TEI Guidelines 5.6, 5.7)

Verbosity of Tagging / Localization

Throughout the Critique, the Literature Working Group appears to be overly concerned with problems of data entry and display. As they point out themselves, data entry shortcuts and minimization for internal purposes are possible. A lot of these arguments are based on the assumption the primary function of electronic texts is to be read by humans in their electronic form. Coding schemes like SGML make an effort to be human-readable, as an aid in data preparation, and a surety against electronic obsolescence. However, the primary function of these texts is to be processed and displayed by the computer. If a text is tagged with extreme shortcuts, and control characters for brevity, it may ultimately be much harder to process and interchange than one which contains verbose but generic tagging. It should also be noted one of the uses that the TEI modification features provide is the renaming of any tag.

Verbosity is not What it Seems

Tags that seem very verbose, like actually contribute toward an economy of tagging ("Final Critique" ad TEI Guidelines 5.3.2). If, instead of the rendition attribute, the TEI recommended a separate tag for each type of rendition, then it would be necessary to have thousands of tags for every presentational nuance possible in a text. Instead, the attribute that provides the rendition may have any value the tagger of the text chooses to give it. It is also possible to restrict the values of an attribute, by specifying a list of allowed values.

Examples of Software and Macros for Working with SGML

In several places in the Critique, the Literature Working Group requests the inclusion of examples of macros, to show how local encoding can be changed to SGML encoding. This is not an appropriate task for the TEI to undertake in the Guidelines. The Guidelines are for the most part a language specification, like the ANSI specification for the C programming language, or the formal description of the WordPerfect document format. Specifying a complicated language is difficult and important and it is a distinctly different task than devising ways to _use_ the language efficiently or easily. In the ANSI specification for C, for instance, there are no examples of how to implement syntax aware C editors. Since macros and other pieces of code are extremely dependent on the software and hardware platform being used (just among the projects represented by the writers, there are 3 operating systems and 5 or 6 pieces of software being used for preparing texts as SGML documents) the people who are best suited to creating such macros are the computer experts who are working on individual projects. Not only that, but, as we all know, software and hardware are mercurial and evanescent. Including examples of that sort would mean that the Guidelines would be providing incorrect and outdated information almost from the outset.

Notwithstanding the above, we think that a tutorial introduction to tagging literary texts that covers specific techniques and methods would be a valuable companion to the TEI Guidelines, as would a survey of available software tools. The recent bibliography of SGML compiled David Barnard, Robin Cover and Nicholas Duncan is a wonderful resource (Queens University TR 90-281). Also the Markup Manual for the Milton Textbase, written by Lou Burnard. Elaine Brennan Assistant Director, Women Writers Project, Brown University Steve DeRose Senior Software Engineer, EBT, Providence, RI David Durand Computer Science, Boston University Elli Mylonas Managing Editor, Perseus Project. Research Associate, Classics, Harvard University Allen Renear Senior Planning Analyst for Humanities Computing, Brown University In composing this response, we benefited from many valuable discussions with members of the Brown University Computing in the Humanities Users' Group. ========================================================================= Date: Fri, 29 Mar 91 10:30:00 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "NANCY M. IDE (914) 437 5988" Subject: query concerning pre-processing of texts Although this is not strictly TEI matter, when reading this query on the LN list I thought there may be a number of people on TEI-L who may have some information. --ni =========================================================================== Hello everyone! My name is Alain Matthey and I am a computer scientist. I am working as a research assistant in the Laboratory of Speech and Language Processing of the University of Neuchatel (Switzerland). The members of this laboratory are working now on a research project which consists to develop a kind of spell and grammar checker like Grammatik, IBM's Critique, Mac Proof, Hugo or Sans fautes but for French native speakers who write in English. In this project, I will have to implemant the "preprocessing step" which consists to recognise and delimit the sentences and the words of a text. How to find and delimit automatically sentences and words in any kind of ASCII texts? That's the problem!!! So I am looking for some informations (bibliography, papers, etc.) about "preprocessing of ASCII texts". For any more informations or for an answer, please contact me at the address above: Alain Matthey Laboratoire de traitement du langage et de la parole UNIVERSITE DE NEUCHATEL Avenue du Premier-Mars 26 CH-2000 NEUCHATEL SWITZERLAND Phone: 038 25 38 51 (int. 27) Fax: 038 25 18 32 E-mail: LTMATTHEY@CNEDCU51.BITNET Thank you very much for your help! Best regards. Alain Matthey P.S. It is not forbidden to write in French!!!