========================================================================= Date: Thu, 1 Oct 1992 12:55:45 CDT Reply-To: U59467@UICVM.BITNET Sender: "TEI-L: Text Encoding Initiative public discussion list" From: U59467@UICVM.BITNET Subject: Forwarded note below ======================================================================= Received: from UICVM.BITNET by UICVM (Mailer R2.07) with BSMTP id 6778; Thu, 01 Oct 92 06:43:47 CDT Received: from ACADVM1.UOTTAWA.CA by UICVM (Mailer R2.07) with BSMTP id 6739; Thu, 01 Oct 92 06:42:54 CDT Received: from UOTTAWA (DMEGGINS) by ACADVM1.UOTTAWA.CA (Mailer R2.07) with BSMTP id 6271; Fri, 25 Sep 92 19:21:52 EDT Date: Fri, 25 Sep 92 19:19:25 EDT From: David Megginson Subject: Re: Names of IPA symbols To: "TEI-L: Text Encoding Initiative public discussion list" In-Reply-To: Message of Fri, 25 Sep 1992 15:02:35 -0400 from On Fri, 25 Sep 1992 15:02:35 -0400 Glenn Adams said: >Why would anyone want to create a new character set for IPA? ISO10646 >already provides a full encoding of IPA. Not for e-mail or Usenet news interchange, where only a subset of plain ASCII (to use the terminology loosely) will survive transmission. They are probably looking to find a way of encoding IPA in ASCII, possibly using SGML entity notation. David %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% David Megginson Department of English, dmeggins@acadvm1.uottawa.ca University of Ottawa %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ========================================================================= Date: Thu, 1 Oct 1992 18:45:36 CDT Reply-To: "John Plate" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "John Plate" Subject: Re: Forwarded note below In-Reply-To: <9210011757.AA02780@dkuug.dk>; from "U59467@UICVM.BITNET" at Oct 1, 92 12:55 pm ----------------------------Original message---------------------------- > On Fri, 25 Sep 1992 15:02:35 -0400 Glenn Adams said: > > >Why would anyone want to create a new character set for IPA? ISO10646 > >already provides a full encoding of IPA. > > Not for e-mail or Usenet news interchange, where only a subset of plain > ASCII (to use the terminology loosely) will survive transmission. They > are probably looking to find a way of encoding IPA in ASCII, possibly > using SGML entity notation. FYI: There has been developed an encoding scheme for (all?) known characters into plain ascii characters. The work has been done by Keld Simonsen . -- John Plate InfoTek aps, Ellebjergvej 2, DK-2450 Copenhagen, Denmark Fax (+45) 3116 1607 ========================================================================= Date: Fri, 2 Oct 1992 07:12:39 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Forwarded note below In-Reply-To: U59467%UICVM.BITNET@pucc.princeton.edu's message of Thu, 1 Oct 1992 12:55:45 CDT <9210012201.AA02679@sapir.metis.com> ----------------------------Original message---------------------------- Date: Thu, 1 Oct 1992 12:55:45 CDT From: U59467%UICVM.BITNET@pucc.princeton.edu Subject: Re: Names of IPA symbols On Fri, 25 Sep 1992 15:02:35 -0400 Glenn Adams said: >Why would anyone want to create a new character set for IPA? ISO10646 >already provides a full encoding of IPA. Not for e-mail or Usenet news interchange, where only a subset of plain ASCII (to use the terminology loosely) will survive transmission. They are probably looking to find a way of encoding IPA in ASCII, possibly using SGML entity notation. Email & Usenet is not a problem for ISO10646 (at least if one is using an 8-bit clean mailer); simply use the UTF (universal transformation format) which is part of the 10646 standard. This transformation converts 10646 into an 8-bit octet stream that protects the C0, C1, SPACE, and DEL octet values, i.e., it makes it ISO2022 8-bit compatibile. I have written some conversion routines; for anyone interested, they can be obtained by anonymous FTP from METIS.COM [140.186.33.40], or, if you don't have FTP access, send me mail and I can email it. Glenn Adams ========================================================================= Date: Fri, 2 Oct 1992 07:17:18 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Re: Forwarded note below In-Reply-To: <9210011915.AA04284@dkuug.dk> (01 Oct 1992 18:45:36 -0500 (19921001234536)) ----------------------------Original message---------------------------- | There has been developed an encoding scheme for (all?) known | characters into plain ascii characters. The work has been done by | Keld Simonsen . I have tried to use this encoding scheme. I was in the working group (IAB/IETF 822 WG) which rejected it on technical merit, and on several procedural factors. The encoding scheme uses characters from the invariant set from ISO 646, usually in pairs, to encode other characters. It's called "mnemonic". It works well for the accented characters of ISO Latin 1 (ISO 8859-1). Included in the tables are approximately 2000 characters, with sharply decreasing "mnemonic" characteristics in their encoding. Two characters isn't enough, and can't be enough. Apart from the ideographic characters of Chinese, Japanese, and Korean, ISO DIS 10646-1.2 contains 7500 characters. It is obvious that with 84 characters, only 7056 combinations are available, 95% of which are far from mnemonic. The encoding scheme is static, with a fixed binding between a given "mnemonic" character sequence and a character. SGML, on the other hand, has established an _enabling_ notation, in which it is possible to choose _any_ name for a character, according to conventions and user needs. The key is the "definitional character entity set", which maps an entity name to a _character_ (meaning), as opposed to the "display character entity set", which maps it to a _glyph id_ or _coded character_ for display or other processing purposes. The definitional character entity set has usually seen entity declarations of the form where the comment was intended to embody some kind of standardized definitional name of the character, but the entity itself was just a bracketed version of the entity name. These have been largely useless, as the entity name is not part of the ESIS, and a display version of the definitional character entity set had to be remapped manually. Much effort has been put into adopting a standardized set of entity _names_, whereas the purpose has been to standardize an encoding for the _characters_ some entity is used to access. With the adoption of ISO 10646, this can change. ISO 10646 contains all (all!) known characters in the universe, and part 1 (ISO 10646-1), the Basic Multilingual Plane, is already here, with around 40 thousand characters, all with unique names. We can use the unique names in the definitional character entity sets, and then the mapping to a new display version can be done mechanically, by describing the native character set or glyph set in terms of these unique names: If we also use the same names to describe characters in public character sets (and private versions, too), we don't need to worry about mapping and conversion tables, anymore. They can be created on the fly. I attach a fairly long article, which has two functions: expose the weaknesses of Keld Simonsen's design, and to suggest a solution using SGML's character set declarations. I present our versions of the complete, encoded ISO Latin 1 (ISO 8859-1:1987) as an example. (This is what makes the article so long). As Glenn correctly states, ISO 10646 has already done a complete encoding of IPA. Although I couldn't care less which numbers are assigned to a given character, I'm deeply appreciative of the unique names that SC 2/WG 2 have assigned to them. It makes the number a matter of convention, one which is only a handle to the name and meaning of a character in the context of coded representation. Transformation between coded representations is thereby possible by meaning, rather than by hand-prepared number-to-number conversion tables. My work is complete for character set registered according to ISO 2375. I'm working on the IBM code pages, and the assorted randomness from other vendors. The work will not be published until I can get some funding for this work. I can give away many individual hours every week, but I can't give away four months of my time without some remuneration. If, from the following article, the TEI would wish to take part in the funding, I think much work in the character set WG can be saved, and a cleaner design can be used. Included in my work is a complete set of library functions to access the tables, a complete ISO 2022 data stream parser and translator to ISO 10646, and utilities to convert SGML documents between any two document character sets, given appropriate and available entity sets. Apart from the character set declarations, also definitional character entity sets from ISO 8879, with the same names. The complete character set declaration for ISO 10646 is 1.4M in size, and contains names for 29,000 characters. I plan to write the specifications for and to implement a "character set manager" for SGML, to be located between the entity manager and the parser proper, so a parser can always work with ISO 10646, and the application can provide the parser with its own character set, in which it will receive the document data. A forthcoming artice in will describe the design. It will be submitted to ISO for consideration in the revised SGML, and new entity set declarations will be suggested defined in this scheme. I think this will be of interest to the TEI, too. Note: Many ask me why I haven't published this, yet. I don't like to published half-finished work with errors and omissions, and I have so far only published the fact that I'm working on this, in order to fill in the picture when reinventions of the wheel are marketed as round. I don't intend to publish this until ISO 10646-1 is published, which will hopefully be this year. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis. ------------------------------------------------------------------------ Newsgroups: comp.fonts,comp.protocols.iso,comp.os.os2.programmer,comp.os.ms-windows.program mer.misc,comp.os.misc,sci.lang Path: enag From: Erik Naggum Organization: Department of Informatics, University of Oslo, Norway Message-ID: <23361B@erik.naggum.no> Date: 29 Sep 1992 00:06:10 +0100 References: <1992Sep28.080944.22880@ugle.unit.no> Subject: Re: Character Sets (AGAIN) Lines: 355 Harald Tveit Alvestrand writes: | | I think it would be fair to include the reasons I don't like it, rather than imply that it's just a matter of "liking" a format or not: the tables are generally unreadable, they're full of errors, and they can't be debugged by inspection, so you can't even find the errors without doing very time-consuming comparisons with the original material. I started doing this time-consuming work, but found it easier to go to the original sources myself, and start over. That's why I don't "like" RFC 1345. All other users of it will also have to do this painstaking checking all over, because the RFC's content can't be trusted. (The author has announced a new, improved edition, but again, we have to trust it, since its correctness and accuracy is extremely hard to inspect, even for character sets you know well by heart.) Compare the following two definitions of ISO 8859-1 (ISO Latin 1): From RFC 1345: &charset ISO_8859-1:1987 &rem source: ECMA registry &alias iso-ir-100 &g1esc x2d41 &g2esc x2e41 &g3esc x2f41 &alias ISO_8859-1 &alias ISO-8859-1 &alias latin1 &alias l1 &alias IBM819 &alias CP819 &code 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3 DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC NS !I Ct Pd Cu Ye BB SE ': Co -a << NO -- Rg '- DG +- 2S 3S '' My PI .M ', 1S -o >> 14 12 34 ?I A! A' A> A? A: AA AE C, E! E' E> E: I! I' I> I: D- N? O! O' O> O? O: *X O/ U! U' U> U: Y' TH ss a! a' a> a? a: aa ae c, e! e' e> e: i! i' i> i: d- n? o! o' o> o? o: -: o/ u! u' u> u: y' th y: (Note: this is really four different character sets, including two control character sets, and two graphic character sets: ASCII, and the right half of ISO 8859-1, which is known as ISO registration number 100, and it's only this right half which has ISO 2022 escape code (g1esc) "ESC 2/13 4/1" (or x2d41). The complete ISO 2022 escape code sequence is "ESC 2/0 4/3 ESC 2/1 4/0 ESC 2/8 4/2 ESC 2/2 4/3 ESC 2/13 4/1". We see that this RFC does a major disservice to the community by pretending that an escape sequence is more than it is, and lists four character sets where only one is actually identified (ISO #100). It should also be said that ISO #100 is intended to be used with ISO #6 (ASCII, or the new IRV), but it is still identified as only 96 graphic characters, not 256 graphic and control characters. It should also be noted that not all control characters in C1 (PA, HO, BH...) had been standardized at the time of publication of the RFC, and many are in fact withdrawn by now. In general, C1 control codes are used as escape sequences, too. E.g., CSI (CI in the above) is uniformly used as "ESC 5/11", not as "9/11". ISO 10646 requires that escape sequences shall be used.) Compare this with my encoding of the ISO 2375 Register of Character Sets to Be Used with Escape Sequences (according to ISO 2022), of all four character sets. (The descriptions are according to ISO 8879 (SGML) character set declarations, and weremainly intended for use with SGML, but have greater utility than that. Comments are surrounded by "--".) -- "ISO Registration Number 1//CHARSET C0 Set of ISO 646//ESC 2/1 4/0" -- -- "ISO 646:1991//CHARSET C0 Set//ESC 2/1 4/0" -- BASESET "ISO 2022//CHARSET Empty C0 Set//ESC 2/1 7/14" DESCSET 0 -- 0000 -- 1 -- NUL -- "NULL" 1 -- 0001 -- 1 -- SOH -- "START OF HEADING" 2 -- 0002 -- 1 -- STX -- "START OF TEXT" 3 -- 0003 -- 1 -- ETX -- "END OF TEXT" 4 -- 0004 -- 1 -- EOT -- "END OF TRANSMISSION" 5 -- 0005 -- 1 -- ENQ -- "ENQUIRY" 6 -- 0006 -- 1 -- ACK -- "ACKNOWLEDGE" 7 -- 0007 -- 1 -- BEL -- "BELL" 8 -- 0008 -- 1 -- BS -- "BACKSPACE" 9 -- 0009 -- 1 -- HT -- "HORIZONTAL TABULATION" 10 -- 000A -- 1 -- LF -- "LINE FEED" 11 -- 000B -- 1 -- VT -- "VERTICAL TABULATION" 12 -- 000C -- 1 -- FF -- "FORM FEED" 13 -- 000D -- 1 -- CR -- "CARRIAGE RETURN" 14 -- 000E -- 1 -- SO -- "SHIFT OUT" 15 -- 000F -- 1 -- SI -- "SHIFT IN" 16 -- 0010 -- 1 -- DLE -- "DATA LINK ESCAPE" 17 -- 0011 -- 1 -- DC1 -- "DEVICE CONTROL ONE" 18 -- 0012 -- 1 -- DC2 -- "DEVICE CONTROL TWO" 19 -- 0013 -- 1 -- DC3 -- "DEVICE CONTROL THREE" 20 -- 0014 -- 1 -- DC4 -- "DEVICE CONTROL FOUR" 21 -- 0015 -- 1 -- NAK -- "NEGATIVE ACKNOWLEDGE" 22 -- 0016 -- 1 -- SYN -- "SYNCRONOUS IDLE" 23 -- 0017 -- 1 -- ETB -- "END OF TRANSMISSION BLOCK" 24 -- 0018 -- 1 -- CAN -- "CANCEL" 25 -- 0019 -- 1 -- EM -- "END OF MEDIUM" 26 -- 001A -- 1 -- SUB -- "SUBSTITUTE" 27 -- 001B -- 1 -- ESC -- "ESCAPE" 28 -- 001C -- 1 -- IS4 -- "INFORMATION SEPARATOR FOUR" 29 -- 001D -- 1 -- IS3 -- "INFORMATION SEPARATOR THREE" 30 -- 001E -- 1 -- IS2 -- "INFORMATION SEPARATOR TWO" 31 -- 001F -- 1 -- IS1 -- "INFORMATION SEPARATOR ONE" -- "ISO Registration Number 6//CHARSET ISO 646:1991 IRV//ESC 2/8 4/2" -- -- "ISO 646:1991//CHARSET IRV//ESC 2/8 4/2" -- BASESET "ISO 2022//CHARSET Empty G0 Set//ESC 2/8 7/14" DESCSET 32 -- 0020 -- 1 "SPACE" 33 -- 0021 -- 1 "EXCLAMATION MARK" 34 -- 0022 -- 1 "QUOTATION MARK" 35 -- 0023 -- 1 "NUMBER SIGN" 36 -- 0024 -- 1 "DOLLAR SIGN" 37 -- 0025 -- 1 "PERCENT SIGN" 38 -- 0026 -- 1 "AMPERSAND" 39 -- 0027 -- 1 "APOSTROPHE" 40 -- 0028 -- 1 "LEFT PARENTHESIS" 41 -- 0029 -- 1 "RIGHT PARENTHESIS" 42 -- 002A -- 1 "ASTERISK" 43 -- 002B -- 1 "PLUS SIGN" 44 -- 002C -- 1 "COMMA" 45 -- 002D -- 1 "HYPHEN-MINUS" 46 -- 002E -- 1 "PERIOD" 47 -- 002F -- 1 "SOLIDUS" 48 -- 0030 -- 1 "DIGIT ZERO" 49 -- 0031 -- 1 "DIGIT ONE" 50 -- 0032 -- 1 "DIGIT TWO" 51 -- 0033 -- 1 "DIGIT THREE" 52 -- 0034 -- 1 "DIGIT FOUR" 53 -- 0035 -- 1 "DIGIT FIVE" 54 -- 0036 -- 1 "DIGIT SIX" 55 -- 0037 -- 1 "DIGIT SEVEN" 56 -- 0038 -- 1 "DIGIT EIGHT" 57 -- 0039 -- 1 "DIGIT NINE" 58 -- 003A -- 1 "COLON" 59 -- 003B -- 1 "SEMICOLON" 60 -- 003C -- 1 "LESS-THAN SIGN" 61 -- 003D -- 1 "EQUALS SIGN" 62 -- 003E -- 1 "GREATER-THAN SIGN" 63 -- 003F -- 1 "QUESTION MARK" 64 -- 0040 -- 1 "COMMERCIAL AT" 65 -- 0041 -- 1 "LATIN CAPITAL LETTER A" 66 -- 0042 -- 1 "LATIN CAPITAL LETTER B" 67 -- 0043 -- 1 "LATIN CAPITAL LETTER C" 68 -- 0044 -- 1 "LATIN CAPITAL LETTER D" 69 -- 0045 -- 1 "LATIN CAPITAL LETTER E" 70 -- 0046 -- 1 "LATIN CAPITAL LETTER F" 71 -- 0047 -- 1 "LATIN CAPITAL LETTER G" 72 -- 0048 -- 1 "LATIN CAPITAL LETTER H" 73 -- 0049 -- 1 "LATIN CAPITAL LETTER I" 74 -- 004A -- 1 "LATIN CAPITAL LETTER J" 75 -- 004B -- 1 "LATIN CAPITAL LETTER K" 76 -- 004C -- 1 "LATIN CAPITAL LETTER L" 77 -- 004D -- 1 "LATIN CAPITAL LETTER M" 78 -- 004E -- 1 "LATIN CAPITAL LETTER N" 79 -- 004F -- 1 "LATIN CAPITAL LETTER O" 80 -- 0050 -- 1 "LATIN CAPITAL LETTER P" 81 -- 0051 -- 1 "LATIN CAPITAL LETTER Q" 82 -- 0052 -- 1 "LATIN CAPITAL LETTER R" 83 -- 0053 -- 1 "LATIN CAPITAL LETTER S" 84 -- 0054 -- 1 "LATIN CAPITAL LETTER T" 85 -- 0055 -- 1 "LATIN CAPITAL LETTER U" 86 -- 0056 -- 1 "LATIN CAPITAL LETTER V" 87 -- 0057 -- 1 "LATIN CAPITAL LETTER W" 88 -- 0058 -- 1 "LATIN CAPITAL LETTER X" 89 -- 0059 -- 1 "LATIN CAPITAL LETTER Y" 90 -- 005A -- 1 "LATIN CAPITAL LETTER Z" 91 -- 005B -- 1 "LEFT SQUARE BRACKET" 92 -- 005C -- 1 "REVERSE SOLIDUS" 93 -- 005D -- 1 "RIGHT SQUARE BRACKET" 94 -- 005E -- 1 "CIRCUMFLEX ACCENT" 95 -- 005F -- 1 "LOW LINE" 96 -- 0060 -- 1 "GRAVE ACCENT" 97 -- 0061 -- 1 "LATIN SMALL LETTER A" 98 -- 0062 -- 1 "LATIN SMALL LETTER B" 99 -- 0063 -- 1 "LATIN SMALL LETTER C" 100 -- 0064 -- 1 "LATIN SMALL LETTER D" 101 -- 0065 -- 1 "LATIN SMALL LETTER E" 102 -- 0066 -- 1 "LATIN SMALL LETTER F" 103 -- 0067 -- 1 "LATIN SMALL LETTER G" 104 -- 0068 -- 1 "LATIN SMALL LETTER H" 105 -- 0069 -- 1 "LATIN SMALL LETTER I" 106 -- 006A -- 1 "LATIN SMALL LETTER J" 107 -- 006B -- 1 "LATIN SMALL LETTER K" 108 -- 006C -- 1 "LATIN SMALL LETTER L" 109 -- 006D -- 1 "LATIN SMALL LETTER M" 110 -- 006E -- 1 "LATIN SMALL LETTER N" 111 -- 006F -- 1 "LATIN SMALL LETTER O" 112 -- 0070 -- 1 "LATIN SMALL LETTER P" 113 -- 0071 -- 1 "LATIN SMALL LETTER Q" 114 -- 0072 -- 1 "LATIN SMALL LETTER R" 115 -- 0073 -- 1 "LATIN SMALL LETTER S" 116 -- 0074 -- 1 "LATIN SMALL LETTER T" 117 -- 0075 -- 1 "LATIN SMALL LETTER U" 118 -- 0076 -- 1 "LATIN SMALL LETTER V" 119 -- 0077 -- 1 "LATIN SMALL LETTER W" 120 -- 0078 -- 1 "LATIN SMALL LETTER X" 121 -- 0079 -- 1 "LATIN SMALL LETTER Y" 122 -- 007A -- 1 "LATIN SMALL LETTER Z" 123 -- 007B -- 1 "LEFT CURLY BRACKET" 124 -- 007C -- 1 "VERTICAL LINE" 125 -- 007D -- 1 "RIGHT CURLY BRACKET" 126 -- 007E -- 1 "TILDE" -- "ISO Registration Number 77//CHARSET C1 Control Set//ESC 2/2 4/3" -- -- "ISO 6429:1983//CHARSET C1 Control Set//ESC 2/2 4/3" -- BASESET "ISO 2022//CHARSET Empty C1 Set//ESC 2/2 7/14" DESCSET 128 -- 0080 -- 4 UNUSED 132 -- 0084 -- 1 -- IND -- "INDEX" 133 -- 0085 -- 1 -- NEL -- "NEXT LINE" 134 -- 0086 -- 1 -- SSA -- "START OF SELECTED AREA" 135 -- 0087 -- 1 -- ESA -- "END OF SELECTED AREA" 136 -- 0088 -- 1 -- HTSD-- "CHARACTER TABULATION SET" 137 -- 0089 -- 1 -- HTJ -- "CHARACTER TABULATION WITH JUSTIFICATION" 138 -- 008A -- 1 -- VTS -- "LINE TABULATION SET" 139 -- 008B -- 1 -- PLD -- "PARTIAL LINE FORWARD" 140 -- 008C -- 1 -- PLU -- "PARTIAL LINE BACKWARD" 141 -- 008D -- 1 -- RI -- "REVERSE LINE FEED" 142 -- 008E -- 1 -- SS2 -- "SINGLE-SHIFT TWO" 143 -- 008F -- 1 -- SS3 -- "SINGLE-SHIFT THREE" 144 -- 0090 -- 1 -- DCS -- "DEVICE CONTROL STRING" 145 -- 0091 -- 1 -- PU1 -- "PRIVATE USE ONE" 146 -- 0092 -- 1 -- PU2 -- "PRIVATE USE TWO" 147 -- 0093 -- 1 -- STS -- "SET TRANSMIT STATE" 148 -- 0094 -- 1 -- CCH -- "CANCEL CHARACTER" 149 -- 0095 -- 1 -- MW -- "MESSAGE WAITING" 150 -- 0096 -- 1 -- SPA -- "START OF GUARDED AREA" 151 -- 0097 -- 1 -- EPA -- "END OF GUARDED AREA" 152 -- 0098 -- 3 UNUSED 155 -- 009B -- 1 -- CSI -- "CONTROL SEQUENCE INTRODUCER" 156 -- 009C -- 1 -- ST -- "STRING TERMINATOR" 157 -- 009D -- 1 -- OSC -- "OPERATING SYSTEM COMMAND" 158 -- 009E -- 1 -- PM -- "PRIVACY MESSAGE" 159 -- 009F -- 1 -- APC -- "APPLICATION PROGRAM COMMAND" -- "ISO Registration Number 100//CHARSET Latin 1//ESC 2/13 4/1" -- -- "ISO 8859-1:1987//CHARSET Latin 1, right half//ESC 2/13 4/1" -- BASESET "ISO 2022//CHARSET Empty G1 Set//ESC 2/13 7/14" DESCSET 32 -- 0020 -- 1 "NO-BREAK SPACE" 33 -- 0021 -- 1 "INVERTED EXCLAMATION MARK" 34 -- 0022 -- 1 "CENT SIGN" 35 -- 0023 -- 1 "POUND SIGN" 36 -- 0024 -- 1 "CURRENCY SIGN" 37 -- 0025 -- 1 "YEN SIGN" 38 -- 0026 -- 1 "BROKEN BAR" 39 -- 0027 -- 1 "SECTION SIGN" 40 -- 0028 -- 1 "DIAERESIS" 41 -- 0029 -- 1 "COPYRIGHT SIGN" 42 -- 002A -- 1 "FEMININE ORDINAL INDICATOR" 43 -- 002B -- 1 "LEFT-POINTING DOUBLE ANGLE QUOTATION MARK" 44 -- 002C -- 1 "NOT SIGN" 45 -- 002D -- 1 "SOFT HYPHEN" 46 -- 002E -- 1 "REGISTERED SIGN" 47 -- 002F -- 1 "OVERLINE" 48 -- 0030 -- 1 "DEGREE SIGN" 49 -- 0031 -- 1 "PLUS-MINUS SIGN" 50 -- 0032 -- 1 "SUPERSCRIPT DIGIT TWO" 51 -- 0033 -- 1 "SUPERSCRIPT DIGIT THREE" 52 -- 0034 -- 1 "ACUTE ACCENT" 53 -- 0035 -- 1 "MICRO SIGN" 54 -- 0036 -- 1 "PILCROW SIGN" 55 -- 0037 -- 1 "MIDDLE DOT" 56 -- 0038 -- 1 "CEDILLA" 57 -- 0039 -- 1 "SUPERSCRIPT DIGIT ONE" 58 -- 003A -- 1 "MASCULINE ORDINAL INDICATOR" 59 -- 003B -- 1 "RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK" 60 -- 003C -- 1 "VULGAR FRACTION ONE QUARTER" 61 -- 003D -- 1 "VULGAR FRACTION ONE HALF" 62 -- 003E -- 1 "VULGAR FRACTION THREE QUARTERS" 63 -- 003F -- 1 "INVERTED QUESTION MARK" 64 -- 0040 -- 1 "LATIN CAPITAL LETTER A WITH GRAVE" 65 -- 0041 -- 1 "LATIN CAPITAL LETTER A WITH ACUTE" 66 -- 0042 -- 1 "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" 67 -- 0043 -- 1 "LATIN CAPITAL LETTER A WITH TILDE" 68 -- 0044 -- 1 "LATIN CAPITAL LETTER A WITH DIAERESIS" 69 -- 0045 -- 1 "LATIN CAPITAL LETTER A WITH RING ABOVE" 70 -- 0046 -- 1 "LATIN CAPITAL LETTER AE" 71 -- 0047 -- 1 "LATIN CAPITAL LETTER C WITH CEDILLA" 72 -- 0048 -- 1 "LATIN CAPITAL LETTER E WITH GRAVE" 73 -- 0049 -- 1 "LATIN CAPITAL LETTER E WITH ACUTE" 74 -- 004A -- 1 "LATIN CAPITAL LETTER E WITH CIRCUMFLEX" 75 -- 004B -- 1 "LATIN CAPITAL LETTER E WITH DIAERESIS" 76 -- 004C -- 1 "LATIN CAPITAL LETTER I WITH GRAVE" 77 -- 004D -- 1 "LATIN CAPITAL LETTER I WITH ACUTE" 78 -- 004E -- 1 "LATIN CAPITAL LETTER I WITH CIRCUMFLEX" 79 -- 004F -- 1 "LATIN CAPITAL LETTER I WITH DIAERESIS" 80 -- 0050 -- 1 "LATIN CAPITAL LETTER ETH" 81 -- 0051 -- 1 "LATIN CAPITAL LETTER N WITH TILDE" 82 -- 0052 -- 1 "LATIN CAPITAL LETTER O WITH GRAVE" 83 -- 0053 -- 1 "LATIN CAPITAL LETTER O WITH ACUTE" 84 -- 0054 -- 1 "LATIN CAPITAL LETTER O WITH CIRCUMFLEX" 85 -- 0055 -- 1 "LATIN CAPITAL LETTER O WITH TILDE" 86 -- 0056 -- 1 "LATIN CAPITAL LETTER O WITH DIAERESIS" 87 -- 0057 -- 1 "MULTIPLICATION SIGN" 88 -- 0058 -- 1 "LATIN CAPITAL LETTER O WITH STROKE" 89 -- 0059 -- 1 "LATIN CAPITAL LETTER U WITH GRAVE" 90 -- 005A -- 1 "LATIN CAPITAL LETTER U WITH ACUTE" 91 -- 005B -- 1 "LATIN CAPITAL LETTER U WITH CIRCUMFLEX" 92 -- 005C -- 1 "LATIN CAPITAL LETTER U WITH DIAERESIS" 93 -- 005D -- 1 "LATIN CAPITAL LETTER Y WITH ACUTE" 94 -- 005E -- 1 "LATIN CAPITAL LETTER THORN" 95 -- 005F -- 1 "LATIN SMALL LETTER SHARP S" 96 -- 0060 -- 1 "LATIN SMALL LETTER A WITH GRAVE" 97 -- 0061 -- 1 "LATIN SMALL LETTER A WITH ACUTE" 98 -- 0062 -- 1 "LATIN SMALL LETTER A WITH CIRCUMFLEX" 99 -- 0063 -- 1 "LATIN SMALL LETTER A WITH TILDE" 100 -- 0064 -- 1 "LATIN SMALL LETTER A WITH DIAERESIS" 101 -- 0065 -- 1 "LATIN SMALL LETTER A WITH RING ABOVE" 102 -- 0066 -- 1 "LATIN SMALL LETTER AE" 103 -- 0067 -- 1 "LATIN SMALL LETTER C WITH CEDILLA" 104 -- 0068 -- 1 "LATIN SMALL LETTER E WITH GRAVE" 105 -- 0069 -- 1 "LATIN SMALL LETTER E WITH ACUTE" 106 -- 006A -- 1 "LATIN SMALL LETTER E WITH CIRCUMFLEX" 107 -- 006B -- 1 "LATIN SMALL LETTER E WITH DIAERESIS" 108 -- 006C -- 1 "LATIN SMALL LETTER I WITH GRAVE" 109 -- 006D -- 1 "LATIN SMALL LETTER I WITH ACUTE" 110 -- 006E -- 1 "LATIN SMALL LETTER I WITH CIRCUMFLEX" 111 -- 006F -- 1 "LATIN SMALL LETTER I WITH DIAERESIS" 112 -- 0070 -- 1 "LATIN SMALL LETTER ETH" 113 -- 0071 -- 1 "LATIN SMALL LETTER N WITH TILDE" 114 -- 0072 -- 1 "LATIN SMALL LETTER O WITH GRAVE" 115 -- 0073 -- 1 "LATIN SMALL LETTER O WITH ACUTE" 116 -- 0074 -- 1 "LATIN SMALL LETTER O WITH CIRCUMFLEX" 117 -- 0075 -- 1 "LATIN SMALL LETTER O WITH TILDE" 118 -- 0076 -- 1 "LATIN SMALL LETTER O WITH DIAERESIS" 119 -- 0077 -- 1 "DIVISION SIGN" 120 -- 0078 -- 1 "LATIN SMALL LETTER O WITH STROKE" 121 -- 0079 -- 1 "LATIN SMALL LETTER U WITH GRAVE" 122 -- 007A -- 1 "LATIN SMALL LETTER U WITH ACUTE" 123 -- 007B -- 1 "LATIN SMALL LETTER U WITH CIRCUMFLEX" 124 -- 007C -- 1 "LATIN SMALL LETTER U WITH DIAERESIS" 125 -- 007D -- 1 "LATIN SMALL LETTER Y WITH ACUTE" 126 -- 007E -- 1 "LATIN SMALL LETTER THORN" 127 -- 007F -- 1 "LATIN SMALL LETTER Y WITH DIAERESIS" Slightly more verbose :-), but also easily debuggable: it's parsable, and the character number is explicitly identified with the character name. (Missing characters and resulting "shifts" account for about 400 errors in RFC 1345.) The names are drawn from ISO/IEC DIS 10646-1.2, and will be updated to include the official names from the published standard. The fact that these are actually delimited strings also makes it possible to construct conversion tables by name lookup, instead of typing in (with errors) a pre-composed conversion table. If this peaks your interest, please drop me a line. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis. ========================================================================= Date: Sat, 3 Oct 1992 21:42:27 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Erik Naggum's message of Fri, 2 Oct 1992 07:17:18 CDT <9210021228.AA02957@sapir.metis.com> ----------------------------Original message---------------------------- I agree with Erik Naggum about the impracticality of the mnemonic naming proposed by Keld Simonsen. While I believe Keld's intentions were good, I also think that his working model is untenable, i.e., that one can derive a set of useful mnemonics (2 character in length) for each character in the union of all character sets. As Erik points out, the nature of Keld's proposed notation limits the collection of mnemonics to a number far less than the number of elements in 10646; furthermore, the mnemonic value is quickly lost, and, indeed, is completely irrelevant in the case of Han characters [How should one choose a mnemonic for a Han character? Should its meaning be used? Or its pronunciation? In either case, both meaning and pronunciation differ across the different uses of the same character among different writing systems, e.g., Chinese, Japanese, Korean, & Vietnamese.] On the other hand, if one were to use the full names of 10646, a file may be quite unwieldy in its size due to the enormous expansion required to convert non-ISO646 character references to entity names. Personally, I think folks should be thinking about concrete syntaxes whose baseset is ISO10646, rather building systems based on the reference concrete syntax. Of course these two concrete syntaxes are isomorphic by means of entity referencing. But we should really be building full 10646 syntaxes. Document transfer can easily be accomplished by means of appropriate transformation methods. Glenn Adams P.S. Erik exaggerates when he says that "ISO 10646 contains all (all!) known characters in the universe." There are many characters which are known but are not yet encoded in 10646; they will be there eventually, but we're not there yet. ========================================================================= Date: Sat, 3 Oct 1992 21:43:54 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > I agree with Erik Naggum about the impracticality of the mnemonic naming > proposed by Keld Simonsen. While I believe Keld's intentions were good, > I also think that his working model is untenable, i.e., that one can derive > a set of useful mnemonics (2 character in length) for each character in > the union of all character sets. It is a misunderstanding that my scheme is only 2 characters, about 500 characters in RFC 1345 have short identifiers with 3 ore more characters in them. About 24.000 Chinese characters are also defined, and all of these are 5 characters long. The mechanism for naming Chinese characters, is just referencing a code point, this method is also used in ISO 10646. Keld ========================================================================= Date: Sat, 3 Oct 1992 21:44:45 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 2 Oct 92 17:17:26 +0100 <9210021617.AA03772@dkuug.dk> ----------------------------Original message---------------------------- Date: Fri, 2 Oct 92 17:17:26 +0100 From: Keld J|rn Simonsen It is a misunderstanding that my scheme is only 2 characters about 500 characters in RFC 1345 have short identifiers with 3 ore more characters in them. About 24.000 Chinese characters are also defined, and all of these are 5 characters long... Thanks for the clarification. If this is the case, then why don't you simply use the ISO10646 names. This would eliminate having a redundant name collection which can only cause confusion and allow errors to creep in. Other than for efficiency reasons (i.e., the length of the character name), there doesn't seem to be any justification for having another set of names. And, if storage efficiency is the only possible justification, I'm sure there are better ways to accomplish this, e.g., using 10646 as the BASESET in the concrete syntax. Glenn ========================================================================= Date: Sat, 3 Oct 1992 21:45:31 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > Date: Fri, 2 Oct 92 17:17:26 +0100 > From: Keld J|rn Simonsen > > It is a misunderstanding that my scheme is only 2 characters > about 500 characters in RFC 1345 have short identifiers with > 3 ore more characters in them. About 24.000 Chinese characters are > also defined, and all of these are 5 characters long... > > Thanks for the clarification. If this is the case, then why don't > you simply use the ISO10646 names. This would eliminate having a > redundant name collection which can only cause confusion and allow > errors to creep in. Other than for efficiency reasons (i.e., the > length of the character name), there doesn't seem to be any > justification for having another set of names. And, if storage > efficiency is the only possible justification, I'm sure there are > better ways to accomplish this, e.g., using 10646 as the BASESET > in the concrete syntax. > > Glenn There are several reasons for the design, here are a few: 1. For readability. The 10646 names are too long to be useful for humans reading text, while my notation is at least more adequate. For example my name: Keld Jrn Simonsen vs: Keld Jrn Simonsen It is all a matter of taste, but I find the latter readable, it does not disturb my rhythm of reading too much. Reading the 10646 names would fill my brain with LETTER and LATIN and STROKE, which is really not that relevant. 2. for writabililty: If I cannot generate a character directly from the keyboard, I can use a kind of compose character sequence to input it. The 10646 name is then very long and very error-prone to input, the above example is 33 characters, which would lead to many times of mistyping, while the two-letter combination is much easier to type, and (at least for this example) more easy to remember. 3. For presentation: The character set tables in RFC 1345 can be presented in about 100 pages, while a equivalent presentation using the 10646 names would be about a factor 10 larger. Thus short names save trees, are more manageable in publication etc. Keld ========================================================================= Date: Mon, 5 Oct 1992 22:59:46 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Character Names [Encoding IPA] In-Reply-To: <9210021442.AA00775@boas.metis.com> (02 Oct 1992 10:42:05 -0400 (19921002144205)) ----------------------------Original message---------------------------- Glenn, I need to clarify a point. You write: | On the other hand, if one were to use the full names of 10646, a | file may be quite unwieldy in its size due to the enormous expansion | required to convert non-ISO646 character references to entity names. This is not what I suggested, and Keld has argued against this, from his point of view, too. What I suggest is that we _describe_ a character by means of its unique name, as in the character set declaration I posted. Here, character number 248 when G1 is invoked into the right half is "LATIN SMALL LETTER O WITH STROKE". If we don't have ISO Latin 1 available to an SGML application, we can use an entity reference, and the name is immaterial if we define it in terms of the full name: After this declaration, we can use the entity reference "&oe;" to access the character. Using Keld's example, we wouldn't write: Keld J&LATIN-SMALL-LETTER-O-WITH-STROKE;rn Simonsen but Keld J&oe;rn Simonsen The SGML parser will resolve this reference for us, and if the application has defined a display version of the entity set, for instance as in if he has Latin 1 capability in the display engine, it will come out right. It's crucial to understand the difference between the _definitional_ and the _display_ version of character entity sets. I'm addressing the problem of using unique names to bind characters to entity definitions. Having done this, and having an application character set, or code sequences to accomplish a given glyph on the display device, it's trivial to produce a mapping by name lookup. E.g., if TeX is used as the processing back-end: Definition: Local mapping: "LATIN SMALL LETTER O WITH STROKE" = "\{o}" Produces a display version: We can also use Keld's mnemonic encoding (as long as we stick to ISO Latin 1, it's a good idea, and well done): "> | Personally, I think folks should be thinking about concrete syntaxes | whose baseset is ISO10646, rather building systems based on the | reference concrete syntax. Of course these two concrete syntaxes | are isomorphic by means of entity referencing. But we should really | be building full 10646 syntaxes. Document transfer can easily be | accomplished by means of appropriate transformation methods. This is what I'm doing, already. Howerver, believing that text entry systems will be ISO 10646 compliant within the next few billion dollars of software sales is a pipe dream. Therefore, we need a "charactser set manager" which can read any character data stream, compliant with ISO 2022 or IBM CDRA, or whatever, and let the parser see it as pure and undiluted ISO 10646. Passing to the application, we need to invoke the character set manager once again to convert the internal representation (ISO 10646) to whatever the application will understand. SGML already supports understanding a document character set based on a syntax reference character set, but it's not powerful enough to describe a data stream encoding. (That's why the TEI needs a "writing system declaration", for instance.) I think this should not be handled by the application, but by a general utility between the general utility we know as the entity manager and the parser. Thus, the parser will use ISO 10646 as its document _and_ syntax reference character set. SGML cannot, however, communicate very well with the application on the applications' terms when it comes to character set, and the ESIS as defined does not include _any_ information about character sets (boo! hiss!), so something needs to be done, in this area, too. It's my intention to bring this kind of change about in SGML II, the sequel. I hope I have managed to clarify my design. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis. ========================================================================= Date: Mon, 5 Oct 1992 23:00:44 CDT Reply-To: "Liam R. E. Quin" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Liam R. E. Quin" Subject: Re: Forwarded note below ----------------------------Original message---------------------------- Glenn Adams wrote that > Email & Usenet is not a problem for ISO10646 (at least if one is using an > 8-bit clean mailer); simply use the UTF (universal transformation format) > which is part of the 10646 standard Unfortunately, as the readers of TEI-L are aware, LISTSERV and other BITNET mailers are by no means 8-bit clean. Neither are many Unix-based mailers, for that matter. Hence the use of uuencode. One reason for wanting an ASCII IPA encoding is so that one can work with the text in environments that do not even support an IPA font, let alone Unicode or ISO 10646. Lee -- Liam Quin, lee@sq.com, SoftQuad, Toronto, 416 239-4801; the barefoot programmer lq-text (Unix text retrieval package) mailing list: lq-text-request@sq.com ========================================================================= Date: Mon, 5 Oct 1992 23:01:12 CDT Reply-To: Olle Jarnefors Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Olle Jarnefors Subject: ISO 10646/UCS/Unicode and Email In-Reply-To: <9210012313.AA00717@boas.metis.com> "Fri, 2 Oct 1992 07:12:39 CDT" ----------------------------Original message---------------------------- Glenn Adams writes: > Email & Usenet is not a problem for ISO10646 (at least if one is using an > 8-bit clean mailer); simply use the UTF (universal transformation format) > which is part of the 10646 standard. A word of warning though: Email on the Internet is governed by two Internet standards, RFC 821 (SMTP) for mail transport and RFC 822 for the format of messages. So-called 8-bit clean mailers are ILLEGAL according to these two standards, which allow only transport of octets 0-127. A very significant extension of RFC 822 called MIME (RFC 1341) is proposed as a new Internet standard and will probably be adopted in 1993. It was published in June 1992. Among other things MIME makes it possible to include 8-bit data in Internet email messages by providing two alternative 8-bit => 7-bit transformations, BASE64 and QUOTED-PRINTABLE. The proper way to send ISO 10646 or Unicode text in Internet email then will probably be by means of the double transformation 10646-text ==UTF==> control character safe 8-bit text ==BASE64/QUOTED-PRINTABLE==> 7-bit text The working group within the Internet Engineering Task Force that has developed the MIME proposal is well aware of the 16/32-bit character sets Unicode and ISO 10646. The work to adjust MIME to 10646/Unicode was postponed though, because of widespread scepticism in the group about the chances of the draft standard 10646 to pass the international vote. Now that the standard is approved, this issue should be reopened. I have yet seen no activity on this on the mailing-list of the IETF working group, though. The two editors of RFC 1341, the MIME specification, are: Nathaniel S. Borenstein MRE 2D-296, Bellcore 445 South St. Morristown, NJ 07962-1910 Phone: +1 201 829 4270 Fax: +1 201 829 7019 Email: nsb@bellcore.com Ned Freed Innosoft International, Inc. 250 West First Street Suite 240 Claremont, CA 91711 Phone: +1 714 624 7907 Fax: +1 714 621 5319 Email: ned@innosoft.com ========================================================================= Date: Mon, 5 Oct 1992 23:01:36 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > Sure, shorter character names will be more efficient IF ONE HAS TO TYPE > THEM IN AS NAMES. While this may be necessary for a transition period > until 10646-based editors are available, it definitely should not be seen > as a long term solution. It is completely absurd to require human users > to enter characters as names of characters. Nobody in their right mind > is going to do this for a Japanese or Chinese text, or a Thai or Arabic > text for that matter. Instead of coming up with ineffective interim solutions > that will never be used, we should be building fully-enabled 10646 systems. > A task which I have been working at now for the last three years. I agree that the best thing is to have the real thing right there at your fingertips. But this is not the case now, and it will take a long time before all equipment will have all of 10646 easily accessible. And there will even be new versions of 10646 with characters not in the older versions, so there will always be the problem that some equipment cannot just generate/display all of the current 10646. And thus there will always be a need for a fallback representation. Furthermore a fallback representation will increase the possibilities of interoperation. Actually, with the mechanisms outlined in RFC 1345, you can have all of 10646 capabilities on almost all current hardware of today, making migration much smoother. For the Japanese/Chinese input: well today it is much used in those countries to type in the information in latin characters, or in Hiragana, because having a full Han/Kanji keyboard is just unmanageable. I understand that at least the Japanese are quite happy with this method. This is very similar to typing in a short identifier. Also for Latin use, it is very commonplace to type a non-spacing diacritic and then the base letter, my PC does that and it is an established standard all over Europe and other places using such characters. This is also very similar to typing in the short identifier of RFC 1345. So I do not think the inputting of character names for characters is absurd, this is very similar to what is in wide use all over the world today. keld ========================================================================= Date: Mon, 5 Oct 1992 23:01:59 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Character Names [Encoding IPA] In-Reply-To: <9210021808.AA00826@boas.metis.com> (02 Oct 1992 14:08:01 -0400 (19921002180801)) ----------------------------Original message---------------------------- Glenn, We have a philosophical difference at this point (in your reply to Keld), one which I think it's important to bring up. | Sure, shorter character names will be more efficient IF ONE HAS TO | TYPE THEM IN AS NAMES. Not necessarily only for this reason. A character and a glyph is not the same thing. For a detailed exposition of the difference, see [1]. A user must be able to access glyphs, not just characters. The SGML entity mechanism supports this. ISO 10646 does not support this distinction. Indeed, it's the opinion of SC 18 (1) that it confuses it. | While this may be necessary for a transition period until 10646- | based editors are available, it definitely should not be seen as a | long term solution. We have to realize that the present plethora of character sets will survive for decades, in documents, in text entry devices, in software systems, in display devices, etc, etc, and making them inaccessible and/or useless after some time T is counter-productive in a strong sense. A long-term solution will need to address this problem. It should be noted that the solution is _not_ to have mapping tables, and code-point to code-point roundtrip conversion, as character coding is *much* more complex than this simple models affords. | It is completely absurd to require human users to enter characters | as names of characters. I don't think anybody has advocated that, either. However, SGML (and Keld, too) realizes that we have to access a character somehow, and whether it be produced by a sequence of keystrokes which result in one character number from the text entry device, or many, is immaterial. | Nobody in their right mind is going to do this for a Japanese or | Chinese text, or a Thai or Arabic text for that matter. Not if you write lots of text in that writing system, because then you have text entry devices which support your natural habitat, so to speak. However, if we allow for the existence of several habitats and diverse populations of them, we find that some of them are more likely to aid the survival of large character sets than others. The Japanese and Chinese already have multiple-key-stroke-one-character entry schemes. Arabic has many problems because of its incredibly out-dated focus on handwritten characters which tie together and are weakly adapted to idea of individual "characters" to begin with (i.e. the "smallest freely combining units of [a writing] system" [1] is fuzzy). | Instead of coming up with ineffective interim solutions that will | never be used, ... Objection! There are already more than a handful "interim" solutions used by millions of people, who won't stop now, unless they can get _more_ with the new technology than they can with the present. I don't need to mention more than TeX, SGML and Word Perfect. | ... we should be building fully-enabled 10646 systems. A task which | I have been working at now for the last three years. Excellent! However, the rest of the world has been working against you, if you view your work as saving the world for the future (which is how I'm inclined to view it). We need to reach out and capture someone's interest before he will let go of his millions of characters' worth of extant documents. Matter of fact, we have enough problems making people realize that it's a good idea to use standardized 8-bit character sets, and I won't even mention the problems we have trying to have people identify the character sets they do choose. We can't demand that the world will adopt ISO 10646 unless the changeover will be less painful than continuing with what they're doing, no matter how much better they will get it afterwards. After all, most users are cowards who shy away from any short-term pain although we who know better know that they will be happier afterwards. It's the "software dentist problem". | For that matter, I would also argue that no human user should be | forced to learn anything about SGML to use it. This is the heart of the philosophical difference. SGML is much more a philosophy of information representation than an actual language. The idea that we represent structure and identify attributes (one of which is the generic identifier) with (element) contents, is very much different from the immediately gratifying visual presentation that users have been brought up to think is the solution to their information processing needs, not the means to _present_ the solution. Therefore, it's important the the users think in terms of structure, of elements and element types, of data attributes, of notations and special formats, of the separation of information from presentation. This doesn't have to mean that they will see SGML source documents, but, to use the words of Yuri Rubinsky, "[the users] will be invited to abandon their worst habits" [2]. Bad habits don't go away by themselves. It's important that Yuri stresses "invite". That's also what ISO 10646 does. We can't force them to come to the party if they don't want to, however much we would like to see them there. | It should simply be an underlying serial representation like RTF. RTF is an encoding of the past, when presentation and information were inseperable. SGML is addressing the future, when "a gigabyte is a small amount of information", and information _management_ will completely overshadow other information technology disciplines. (Key question: what can you do with an RTF document, apart from looking at it?) | Applications should provide user interfaces that present the | functional abstraction of SGML without requiring any specific | knowledge of SGML syntax or representations. I'm not sure this is possible, precisely because the abstractions are _very_ hard to communicate to someone who isn't already accustomed to syntax and representations. After all, SGML affords abstractions over representations of information, into very high-level concepts. Many people who use SMGL daily aren't aware of them, because they don't know what it is SGML _can_ do, if put to it. It's _very_ unlikely that a user interface should be able to communicate a functional abstraction where the information necessary to make the abstraction is absent or weakly refined in the user. Best regards, ------- (1) I'm not speaking official on behalf of ISO/IEC JTC 1/SC 18, but this is message in [1], and in ongoing discussions. ------- [1] Character-Glyph Model Discussion. Attachment 1 to ISO/IEC JTC 1/ SC 18 N3592 Rev. "Liaison statement to JTC 1/SC 2 from JTC 1/SC 18 on ISO/IEC DIS 10646-1.2" (1992-05-26) [2] Yuri Rubinsky, in the Forward to Charles F. Goldfarb: The SGML Handbook. Oxford University Press, 1991. ISBN 0-19-853737-9. -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis. ========================================================================= Date: Mon, 5 Oct 1992 23:02:19 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 2 Oct 92 20:05:16 +0100 <9210021905.AA07509@dkuug.dk> ----------------------------Original message---------------------------- Date: Fri, 2 Oct 92 20:05:16 +0100 From: Keld J|rn Simonsen For the Japanese/Chinese input: well today it is much used in those countries to type in the information in latin characters, or in Hiragana, because having a full Han/Kanji keyboard is just unmanageable. I understand that at least the Japanese are quite happy with this method. This is very similar to typing in a short identifier. No, I'm afraid this won't work. One cannot simply type out the Romaji or Kana representation of Japanese, the Pinyin or Zhuyin representation of Chinese, or the Hangul representation of Korean, and use that as a shorthand for the Han characters which possess such pronunciations. The mapping from these pronunciations to Han characters is one to many, e.g., one of my smaller Chinese dictionaries, Han-ying cidian, has over 100 characters which correspond to the syllable YI. There are a few Han character name conventions in use, e.g., the Chinese Telegraph Code, wherein operators a memorize 4-digit number for each character. But this is quite impractical these days. So I do not think the inputting of character names for characters is absurd, this is very similar to what is in wide use all over the world today. I think this is a leap of faith on your part and not corroborated at all by any evidence. I am familiar with word processors which are used in India, in the Middle East, in Southeast Asia (Thailand, Burma, Vietnam), those of Korea, China, and Japan, and in no case do I know one instance where a word processor represents characters by means of their names, nor do I know of any process by which users (simply) enter the names of characters. [Han input methods are not merely the entry of character names; they require significant user interaction to select the appropriate character from an ambiguous name]. Glenn Adams ========================================================================= Date: Mon, 5 Oct 1992 23:02:38 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > Date: Fri, 2 Oct 92 20:05:16 +0100 > From: Keld J|rn Simonsen > > For the Japanese/Chinese input: well today it is much used in > those countries to type in the information in latin characters, > or in Hiragana, because having a full Han/Kanji keyboard is > just unmanageable. I understand that at least the Japanese are > quite happy with this method. This is very similar to typing in > a short identifier. > > No, I'm afraid this won't work. One cannot simply type out the Romaji > or Kana representation of Japanese, the Pinyin or Zhuyin representation > of Chinese, or the Hangul representation of Korean, and use that as > a shorthand for the Han characters which possess such pronunciations. > The mapping from these pronunciations to Han characters is one to many, > e.g., one of my smaller Chinese dictionaries, Han-ying cidian, has over > 100 characters which correspond to the syllable YI. There are a few > Han character name conventions in use, e.g., the Chinese Telegraph Code, > wherein operators a memorize 4-digit number for each character. But this > is quite impractical these days. I understand that a very common Japanese input method consists of typing in Romaji (Hiragana represented in Latin), via the hiragana characters specifying the sound of the Kanji, and then a list of Kanji Characters that has this sound, which the user then can select from. If the user then selects item 8 on the list, this would not be very different from writing the romaji representation of the kana, followed by the digit 8. > So I do not think the inputting of character names for characters > is absurd, this is very similar to what is in wide use all over > the world today. > > I think this is a leap of faith on your part and not corroborated at > all by any evidence. I am familiar with word processors which are used > in India, in the Middle East, in Southeast Asia (Thailand, Burma, Vietnam), > those of Korea, China, and Japan, and in no case do I know one instance > where a word processor represents characters by means of their names, > nor do I know of any process by which users (simply) enter the names of > characters. [Han input methods are not merely the entry of character names; > they require significant user interaction to select the appropriate character > from an ambiguous name]. In Europe it is common practice to input special characters by name in several word processors, like TeX, SGML, troff; in Wordperfect you do it by character code, and secretaries do this. In WordPerfect you can also input characters by name, like "alpha", "beta" etc, in formulae. But maybe you are not informed on European practice? Yes, many Input Methods are smarter than that, but input methods based on character naming are widely used in some areas of the world, and not "absurd". Keld ========================================================================= Date: Mon, 5 Oct 1992 23:03:05 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: SGML Philosophy, Character vs. Glyph In-Reply-To: Erik Naggum's message of 02 Oct 1992 20:08:10 +0100 (19921002190810) <23365.09@erik.naggum.no> ----------------------------Original message---------------------------- Regarding the philosophy of SGML, I am not suggesting that a user shouldn't begin to change his/her perception of textual information in the structured way that SGML allows. What I *am* suggesting is that mortal users should not be forced to learn SGML syntax. I don't believe a user should be typing raw SGML text. I also think likewise about TeX, which, over the years I have become quite expert at; but, because I know the pain of learning it (and SGML), shouldn't want it forced on anyone else. I have also used very high-level structured text systems that give as much or more structure, e.g., Concordia on the Symbolics Lisp Machine. In the latter, the user *did* have to learn the functional abstractions of structured text, but *did not* have to learn an arcane syntax that only computer scientists could learn to love. I certainly agree that users will indeed have to learn SGML (and TeX) until such high-level systems are more wide-spread. But I would like to suggest that this is an interim step to a more user-friendly technology that gives the same power, but leaves abstract syntax behind. [Of course a high level system could use SGML (or another structure inducing syntax) underneath the covers, and may indeed publish this layer for use by expert users]. As for the Glyph/Character distinction, I have just reviewed the Liaison statement to which you refer, Attachment 1 to ISO/IEC JTC 1/SC 18 N3592 Rev. "Liaison statement to JTC 1/SC 2 from JTC 1/SC 18 on ISO/IEC DIS 10646-1.2" (1992-05-26). I agree with the goals of this liaison statement, particularly in regard to the desire for WG2 to specify an operational model that specifies the distinction between characters and glyphs. I urged as much in the discussions that led to the 10646 ballot. However, there is a serious problem with the Liaison statement regarding its definition of character. It defines character thus: "Characters are the abstract lexical (or logical) elements of writing systems, considered independently of language, culture, and other external perceptions. They generally represent the smallest freely- combining units of such a system. Characters, since they more directly concern correspond to the abstract "meaning" or identity of the text they coney, are used in nearly all operations except for actual text presentation." The first sentence is logically impossible, at least according to the best definition of writing system that I think one can find. This sentence asserts that "a character is the element of a writing system," yet, at the same time, it states that such an element is "considered independently of language..." But a writing system is bound not only to a particular language but also to a particular set of orthographic rules (i.e., culural uses). The French writing systems are different from the English writing systems; the British English writing system is distinct from the American English writing system. A writing system is a particular form which written language takes in the context of a particular language, a particular set of orthographic rules (e.g., correspondences between formal and functional entities), and a collection of symbols drawn from one or more scripts (e.g., Latin, Arabic, Han, etc.). Like phonology, whence phonemes are defined only in relation to a particular system of language, graphemes also can be defined only in relation to a particular language and set of rules which give rise to these units' capacity as "the smallest freely-combining units." If this definition had left out the second part of the first sentence, i.e., "considered independently...", then I would be happy with this definition; although I would say that this is simply the "intuitive" definition of a character (as understood by an element of an Alphabet, e.g., an element of the Spanish alphabet, or an element of the English alphabet, or an element of the Vietnamese alphabet - all being separate writing systems based on the Latin script). Such an "intuitive" notion of character is not tenable in the context of a universal character set which, indeed, does remove linguistic distinctions. Because of this latter fact, one cannot use the intuitive definition of character, but must, instead, adopt a different definition that is centered around the formal elements of scripts rather than the elements of writing systems. I would suggest that if SGML systems think the user is specifying glyphs, then these systems are in seriously need of conceptual clarification. Character sets (10646) included are not the enumeration of glyphs. They are enumeration of abstract elements which, singularly or jointly, serve to represent meaning, and which, during display functions, may be mapped to glyphs for the purpose of display. Glenn Adams ========================================================================= Date: Tue, 6 Oct 1992 17:01:59 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 2 Oct 92 22:02:43 +0100 <9210022102.AA10861@dkuug.dk> ----------------------------Original message---------------------------- Date: Fri, 2 Oct 92 22:02:43 +0100 From: Keld J|rn Simonsen I understand that a very common Japanese input method consists of typing in Romaji (Hiragana represented in Latin), via the hiragana characters specifying the sound of the Kanji, and then a list of Kanji Characters that has this sound, which the user then can select from. If the user then selects item 8 on the list, this would not be very different from writing the romaji representation of the kana, followed by the digit 8. No, this won't work either. The appearance of an item in the list is based on (1) context and (2) frequency of appearance. The first applies mainly to Japanese systems, which do not convert a single character at a time, but instead convert entire phrases or sequences of phrases (bunsetsu); consequently, the appearance of an item in the menu is sensitive to its surrounding phrasal context; also because, in pretty much any CJK system, the system remembers how frequently a user selects a character and rearranges the selection menu to reflect usage. This frequency data is keyed to individual users (but may also be keyed to individual documents which store their own hindo (frequency) data). Glenn ========================================================================= Date: Tue, 6 Oct 1992 17:02:40 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 2 Oct 92 22:02:43 +0100 <9210022102.AA10861@dkuug.dk> ----------------------------Original message---------------------------- Date: Fri, 2 Oct 92 22:02:43 +0100 From: Keld J|rn Simonsen In Europe it is common practice to input special characters by name in several word processors, like TeX, SGML, troff; in Wordperfect you do it by character code, and secretaries do this. In WordPerfect you can also input characters by name, like "alpha", "beta" etc, in formulae. But maybe you are not informed on European practice? I happen to think this practice is hopelessly outdated. A modern system should provide virtual keyboards and multiple input methods that allow the user to enter character data without recourse to knowing the names of characters, i.e., by having key mappings to the desired characters, by employing appropriate key(s) -> composite character sequence mappings, or by employing the necessary language representation conversion as needed to enter Han characters. Glenn ========================================================================= Date: Tue, 6 Oct 1992 17:03:26 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > Date: Fri, 2 Oct 92 22:02:43 +0100 > From: Keld J|rn Simonsen > > In Europe it is common practice to input special characters > by name in several word processors, like TeX, SGML, troff; in > Wordperfect you do it by character code, and secretaries do this. > In WordPerfect you can also input characters by name, like > "alpha", "beta" etc, in formulae. But maybe you are not informed on > European practice? > > I happen to think this practice is hopelessly outdated. A modern system > should provide virtual keyboards and multiple input methods that allow the > user to enter character data without recourse to knowing the names of > characters, i.e., by having key mappings to the desired characters, by > employing appropriate key(s) -> composite character sequence mappings, or by > employing the necessary language representation conversion as needed to > enter Han characters. > > Glenn Well, yes, this may not be the technology that one is dreaming of. But is it current technology and it will be in use for many years to come. And this is the point. I am talking about a fallback technology, and fallbacks are only relevant if you do not have the real thing. Fallbacks can also be utilized for producing the real thing, in a kind of bootstrap mode. So the notation is for the use of those users which lack behind in technology, eg. todays 7-and 8-bit users, which actually amounts to quite a few people, and this population will only decline slowly. keld ========================================================================= Date: Tue, 6 Oct 1992 17:04:01 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Erik Naggum's message of 02 Oct 1992 19:15:19 +0100 (19921002181519) <23365.07@erik.naggum.no> ----------------------------Original message---------------------------- From: Erik Naggum Date: 02 Oct 1992 19:15:19 +0100 (19921002181519) Having done this, and having an application character set, or code sequences to accomplish a given glyph on the display device, it's trivial to produce a mapping by name lookup. E.g., if TeX is used as the processing back-end: Definition: Local mapping: "LATIN SMALL LETTER O WITH STROKE" = "\{o}" Produces a display version: I think this model doesn't take into account the full generality of character to glyph mappings. Assuming that SGML is specifying characters (and not glyphs), I gather that, during display, these characters will be mapped to glyphs. But this should not be confused with the use of entity names to represent characters. How will this model handle the 1-N mappings of Arabic and Indic scripts, and the N-1 mappings to ligature glyphs? May I assume that this char->glyph mapping is outside the scope of SGML? [I've seen early drafts of DSSSL in which this process is made explicit; however, I saw no evidence that those drafts take into account the generality of the char <-> glyph relationship. Perhaps that has been rectified by now?] Therefore, we need a "charactser set manager" which can read any character data stream, compliant with ISO 2022 or IBM CDRA, or whatever, and let the parser see it as pure and undiluted ISO 10646. Passing to the application, we need to invoke the character set manager once again to convert the internal representation (ISO 10646) to whatever the application will understand. I completely agree with this model. Applications will continue to use old character sets forever. 10646 is a good choice as a canonical representation for a parser, with conversion to/from extant charsets at the boundary. I think many systems will use this model. Glenn ========================================================================= Date: Tue, 6 Oct 1992 17:05:49 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Re: Character Names [Encoding IPA] In-Reply-To: <9210030111.AA00925@boas.metis.com> (Fri, 2 Oct 92 21:11:11 EDT) ----------------------------Original message---------------------------- | May I assume that this char->glyph mapping is outside the scope of | SGML? Yes, this is true. Maybe the example with TeX as the back-end was a bad example, since I didn't want to use it for char->glyph mapping. Such is, strictly speaking, only possible in the application, since the parser doesn't know what the data characters are supposed to mean, in the eyes of the application. All I'm doing with my model is allow them to agree on an encoding that both will be happy with. It is, after all, up to the application to support the parser or entity manager with the "display version" for the public entity set references. Still, an SGML document _must_ allow a user-specified char->glyph mapping, og perhaps user-specified glyphs. The only way this can be done is with entity references to a glyph ID of some sort. | [I've seen early drafts of DSSSL in which this process is made | explicit; however, I saw no evidence that those drafts take into | account the generality of the char <-> glyph relationship. Perhaps | that has been rectified by now?] I'm not following the DSSSL work as closely as I feel I should, so I can't answer this, unfortunately. It's an important question. | I completely agree with this model. Applications will continue to | use old character sets forever. 10646 is a good choice as a canonical | representation for a parser, with conversion to/from extant charsets | at the boundary. I think many systems will use this model. Thanks. I'm writing on a specification for this model, and will send you (and anyone else who might be interested) a copy upon completion. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis. ========================================================================= Date: Tue, 6 Oct 1992 17:07:08 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: IPA short identifiers ----------------------------Original message---------------------------- I vaguely remember that this discussion was originated in a request for short names for the International Phonethic Alphabeth. It is true that RFC 1345 does not currently cover IPA, but the intention is to cover it, and I would be very happy to work together with somebody from the IPA community to make this happen. Keld ========================================================================= Date: Tue, 6 Oct 1992 17:07:44 CDT Reply-To: Lou Burnard Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Lou Burnard Subject: MIME ----------------------------Original message---------------------------- I was interested to see Glenn Adams' reference to the MIME RFC, though not because of its support for eight-bit characters (which, like most things to do with character sets, I regard as marginally less interesting than watching paint dry). How many other readers of this list have read the RFC? How many others have been as depressed as I was by its espousal of something called "rich text" format? This is nothing (or not much) to do with any proprietary format of a similar name, nor, despite appearances, is it much to do with SGML. What it offers is the ability to put labelled bracketting into your text and have the receiving mailer reformat it, maybe. The RFC is a bit coy about the relationship between what it proposes and SGML. It clearly *isnt* SGML, because there isn't any DTD -- a set of example names for elements is proposed with some vague semantics (bold, italic etc.) but you can add to it at will and there's no way of telling your recipient what your extensions mean -- but it clearly would like to be because you must make sure your brackets are properly nested. On the other hand, there's no indication of what the element nesting means: at one point it says that something and something are both legal and *both have the same effect*, which seems somewhat counter intuitive to me. The example in the RFC is for bold and italic, which makes sense; but suppose (choosing pairs at random from the proposed list) a and b were 'heading' and 'footing', or 'flushleft' and 'center'? The acronym 'RFC', as we all know, doesn't stand for "request for comment" but "really firm concrete" so I don't anticipate getting this sort of idiocy changed in the near future. What might be nice though is to work towards a world in which a MIME message could specify 'TEI.2' as its message type. Any advice on how we might cause that to happen would be gratefully received... Lou Burnard ========================================================================= Date: Tue, 6 Oct 1992 17:09:13 CDT Reply-To: David Megginson Sender: "TEI-L: Text Encoding Initiative public discussion list" From: David Megginson Subject: Re: Character Names [Encoding IPA] In-Reply-To: Message of Tue, 6 Oct 1992 00:01:36 -0400 from ----------------------------Original message---------------------------- Not everyone is working to develop standards. I am a language specialist, and I would love a way to post IPA transcriptions _now_, not five years from now. And, BTW, I am not willing to type in 20-character entity names. David %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% David Megginson Department of English, dmeggins@acadvm1.uottawa.ca University of Ottawa %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ========================================================================= Date: Wed, 7 Oct 1992 06:37:21 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Re: Character Names [Encoding IPA] In-Reply-To: (06 Oct 1992 17:09:13 -0500 (19921006220913)) ----------------------------Original message---------------------------- David, and all the rest of you, too: I can sympathize with your desire to not engage in the standards process and instead get on with the real work, but I have no sympathy if you don't carefully read messages about such processes before you debunk what isn't even under consideration. (No, nobody has tried to push very long entity names in any way.) Please _read_ the following, before you decide to cast a vote of disapproval on a proposal you have no interest in before it's a fact that will annoy you in practical applications. Trust me, I wouldn't work on this if I weren't really, really, deeply annoyed by the fact that the present public entity sets are almost completely useless, and require hours of manual labor to change into something useful for a given display device. _You_ don't do this work, so _you_ may not care, but you have to pay (and trust) the moron who is set to do such menial labor, because SGML gurus sure try to spend their time on other things. OK, bear with me, and follow me beneath the cover. This will be a fairly short trip, and it won't hurt. Take a "schwa", or some other favorite IPA character, with you, and come along. In a document you type, you would rather have a "schwa" if you could, and if you can, more power to your editing tool. I assume, however, that you can't, or you wouldn't be here. We already know that if we can associate a name with this symbol, we can reference the name and get the symbol. So far so good. SGML even provides the syntax for these entity references, so we don't have that wheel to invent, too. The problem you still face is the one of finding out what your "schwa" is called this week, and whether the application of the month can handle it in the new version. If you're always using documents with the same application on the same old system, you can be satisfied with a local solution (a.k.a. "hack"), and let others have their problems. However, there are standards bozos like me who care about the general case. I care about two things: (1) how you can find the name of your "schwa" in a definitional entity set (which merely lists the entity names and what they "mean"), (2) how the parser can select the right display version (which lists entity names and the associated magic to have the desired character end up in the resulting document). I propose that we use the facilities of SGML to differentiate between _definitional_ and _display_ character entity sets to use the full name of the _character_ the entity is intended to capture. Note: SGML already has this differentiation built-in, it's just that most parsers don't use it, because it hasn't added value to do so. (Until now.) Instead, SGML gurus waste their time on moving public entity sets between machines and applications, and they hate it. If you don't see many public entity sets that support many different display devices, this will give you a clue why. With ISO 10646 (the huge character set standard), we got a list of names of characters for free (or as free as anything published ISO). This was just a means to an end to the people who wrote _that_ standard, but we can use the same means to another end. For instance, our schwa is called "LATIN SMALL LETTER SCHWA". (You can look this up in ISO 10646, and you can even gaze at the nice little characteristic glyph and nod in recognition.) Given a unique name for every conceivable character, and then some, we can make a mapping from _entity_ to _character_: which tells us that &schwa; will yield a schwa, properly parsed. Note that I picked this name at random. You may not think it's particularly random, and I don't think so either, but I could have chosen "frotz", and if your SGML system insisted on fixed entity names, you would have to put up with that, or complain. Now, "frotz" is much better than "latin-small-letter-schwa-this-week", and I promise you that you'll never see the latter, but it still is a pain. That is, you face a choice, as a user, of an entity name, thus: (1) you can use an entity name that has been standardized, in the hopes that everybody will use the same entity name for the same thing, which is actually a character, or (2) you can use any entity name which refers to the character you really want if you properly declare it. The present trend is the former, and we already know that this won't and can't work. The reasons are very simple: People already differ in their tastes, and there are only so many character that can be named with short entity names. Also, there are at least two widely disseminated public entity sets which differ _radically_ in entity names for the same characters. I propose the second solution, which will enable us to map _any_ definitional entity set onto a display entity set for a given environment (i.e., display device) by _name_ (and that's the long _character_ name, as opposed to whatever entity name you choose, probably very short if I get your message, something like "e", right?). That is, given and a display device which will render the schwa if we give it the decimal code 159, we can have a mapping (which "happens" to look very much like a character declaration): 159 -- 009F -- 1 "LATIN SMALL LETTER SCHWA" to produce the resulting display entity declaration: Note, again, that this is completely irrespective of the _entity_ name, which can be everything from one character to your local maximum number of characters long. What will this buy you? If you have an in-house SGML guru, it will buy you a friendlier in-house SGML guru, because he has to do that anal- retentive work on those stupid tables only once, and he doesn't have to do a lot of boring work if you get a document from somebody with a different choice for entity names. If you don't have an in-house SGML guru, you'll get better telephone support. Somewhere down the line, you'll even get SGML support for a wider variety of output devices, and perhaps more useful and easier obtainable public entity sets which will actually work with your application, _without_ calling said SGML guru (at night). As the user, you'll also be able to select entities by looking at a list of full names of characters in a published, reliable standard (unlike some vendor's (missing) documentation), and you can be ascertained that whatever public entity set you use, it will come out just like you want, even after you move it to a different vendor's system. (At least this is what I wake up remembering that I dreamt.) You can even invent local characters that isn't in _any_ public entity set, if your parser is smart enough. (Mine is, of course.) Now, why do I care about this? I develop solution that I try to propose for standards, I develop software to use those solutions in practice, to field test them, and to gain that all-important feedback, and I use the results of this stuff to talk to 3 widely different printers in my own SGML system. Guess who got sick and tired of fiddling with unreadable and unmaintainable code tables and other assorted randomness? That's me, and I set out to do something about it. Of course, _I_ don't want to type in "LATIN SMALL LETTER SCHWA" every time I want a schwa, either. I may be crazy for all this work I do on SGML, but I'm not a complete idiot. Matter of fact, I'd like to tell my SGML editing system that "Hey, you! I'm going to use a schwa over and over in this document, and I'd like to have it accessible with a minimum of fuss." So, I proceded to solve this problem, too. If you have one of those workstations that can give you whatever random graphic character you want, and you can map this character to some code you can input from your keyboard, why shouldn't you be able to be _saved_ all this "&schwa;" business to begin with? The solution is called "dynamically redefinable character sets", and it doesn't take very much to have one running on _your_ workstation, even on your Windows system. Under this scheme, you would tell your parser that you'd used code 159 for schwa by mapping 159 to LATIN SMALL LETTER SCHWA as above. (Only, I expect you won't have to do this yourself.) Then your parser and application will be happy, because it knows what to do with a character named LATIN SMALL LETTER SCHWA, even if it wouldn't have a single clue about "code 159" if it bit it in the rear. See, as a user, myself, "I'm a little irritated" (that's a trademarked, prize-winning understatement) for having to use entity names for common Norwegian letters. (They occur about as frequently as "q" in English.) Unlike other users, I'm not docile enough to accept "it's supposed to be that way". I rebel, I throw out the rascals, and I solve the problem. If other users don't see the problem, and are willing to wait until they repeatedly meet it, and end up throwing thousands of dollars' worth of computing machinery down nine floors, be my guest. If I solve it before you get thus aggravated, send me the net worth of the computer you _didn't_ destroy in frustration and anger, OK? God knows I'm not getting paid for this work in _this_ life. I'm trying to solve a problem you have, and if you don't want to listen, that's your prerogative. If you recognize the problem, and have any comments, I'd be very happy to hear about them. If you don't recognize the problem, and think you're happy where you are, don't try to obstruct the solution to problems that every foreign language SGML users has to fight every single day, unless they have _very_ friendly SGML guru(s) in their immediate vicinity. Best regards, -- Erik Naggum : ISO 8879 SGML : +47 295 0313 : ISO 10744 HyTime : : ISO 10646 UCS : Memento, terrigena. : ISO 9899 C : Memento, vita brevis. ========================================================================= Date: Wed, 7 Oct 1992 06:37:40 CDT Reply-To: Erik Naggum Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Erik Naggum Subject: Re: MIME In-Reply-To: <00961AE9.1CE88120.26683@vax.ox.ac.uk> (06 Oct 1992 17:07:44 -0500 (19921006220744)) ----------------------------Original message---------------------------- Lou Burnard writes: | | The RFC is a bit coy about the relationship between what it proposes | and SGML. Its authors don't think very highly of SGML, because they don't know what it is, but behave as if they did. I tried to stop richtext, on the same grounds that the authors rightfully rejected a number of other things: on technical merit (i.e., lack thereof). Many people saw the point that there was no need to reinvent a markup language in the presence of SGML, but the authors insisted, quite adamantly, that this be included, or they'd quit. They won, for some mysterious reason. I drafted a paper on the difference between between RichText and SGML, which I can send to anyone who might be interested. (Please state clearly that you think the topic is vastly more interesting than watching paint dry.) | The acronym 'RFC', as we all know, doesn't stand for "request for | comment" but "really firm concrete" so I don't anticipate getting this | sort of idiocy changed in the near future. This is utter nonsense. RFC is "Request for Comments", as it has always been. It's up to the Internet Architecture Board to "promote" certain RFCs on the standards track. MIME is a "draft standard" at present, and will go through a review period before it's "promoted" to full standard status. I'm going to be there, and I'm going to get richtext cleaned up. | What might be nice though is to work towards a world in which a MIME | message could specify 'TEI.2' as its message type. Any advice on how | we might cause that to happen would be gratefully received... The RFC itself lists the procedures and authority to do so on page 2. If you haven't read this RFC closely enough to find it, I don't blame you. It's one of the worst RFCs I've seen (and I started my work with Internet mail in 1987; I've seen many of them). Here's the pertinent paragraph: MIME has been carefully designed as an extensible mechanism, and it is expected that the set of content-type/subtype pairs and their associated parameters will grow significantly with time. Several other MIME fields, notably including character set names, are likely to have new values defined over time. In order to ensure that the set of such values is developed in an orderly, well-specified, and public manner, MIME defines a registration process which uses the Internet Assigned Numbers Authority (IANA) as a central registry for such values. Appendix F provides details about how IANA registration is accomplished. Contact the Internet Assigned Numbers Authority if you have questions. I would recommend delaying such a move until MIME has been reviewed and promoted to "standard", though. Best regards, -- Erik Naggum : ISO 8879 SGML : +47 295 0313 : ISO 10744 HyTime : : ISO 10646 UCS : Memento, terrigena. : ISO 9899 C : Memento, vita brevis. ========================================================================= Date: Wed, 7 Oct 1992 06:38:12 CDT Reply-To: Peter Flynn Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Peter Flynn Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- Glenn comments on Keld: > >In Europe it is common practice to input special characters > >by name in several word processors, like TeX, SGML, troff; in > >Wordperfect you do it by character code, and secretaries do this. > >In WordPerfect you can also input characters by name, like > >"alpha", "beta" etc, in formulae. But maybe you are not informed on > >European practice? >I happen to think this practice is hopelessly outdated. A modern system >should provide virtual keyboards and multiple input methods that allow the >user to enter character data without recourse to knowing the names of >characters, i.e., by having key mappings to the desired characters, by >employing appropriate key(s) -> composite character sequence mappings, or by >employing the necessary language representation conversion as needed to >enter Han characters. I agree with Glenn, but let's face it, we just don't have the hardware. About 10 years ago I saw a demo keyboard from (I think) Anderson-Jacobsen which has LEDs in each keycap, and as you got the host to send escape sequences to the terminal, to change alphabet, the keycaps lit up with the right symbols. It wouldn't be *that* expensive to do, perhaps make a keyboard cost $200 instead of $100. Fixing the characters shown on the screen is (by now) trivial, except that very little software actually does it right. I for one would love to go back to my first keyboard layout, used on LinoType hotmetal composing machines, because it was ergonomic, but that's fantasy because the physical design was different. So while we wait for the world to wake up, what do we do? As Keld says, we carry on using name-sequences because it's comprehensible, even though it's more keystrokes, or using (re-using or even mis-using?) control characters or characters from 128d thru 255d as in many wordprocessors. Some pragmatic fixes: We use PC-Write a lot on campus. It's a bit middle-of-the-road, and not a graphical WP system, but it's robust, cheap and easy to use. Diacriticals are entered by pressing the letter, the grave-accent, and then the accent character, so an á goes in as an "a", a "`" and a "'". The grave accent key causes a cursor-left movement over the letter, and when the accent character is typed, the screen character changes to the requisite diacritical. There's a set of keystrokes like this to enter all the IBM PC code page 437 and 850 non-ASCII characters. It's OK, faster than Alt-x-y-z and in any case PC-Write can define keyboard keys to generate any character you want, if you really need to emulate QWERTZ or AZERTY or anything else (altho you still have to stick paper on the keycaps to show what's what). Why am I saying this? Well, we also use TeX and SGML, so we can use this reprogramming capability (and most four-wheel WP and DTP should be able to do this) to let the user have single keystrokes, but generate the required characters in the file. Nothing special here, lots of folks do it. But one of the "hidden" reasons we pressed for PC-Write is that you can also reprogram the print drivers (which are actually character-conversion specs in a plain editable text file) to output whatever you want when you print-to-disk. So a user can have arbitrary IBM PC characters in a file so that the text looks right on the screen, but if they print to disk, they get a file full of TeX or SGML. Not only that, but so long as they stick to a specific filetype it's transparent to them (.doc will output .tex and .iso will output .sgm) because the print driver can be made sensitive to the input filetype and use the correct definitions. You want SGML instead of TeX? rename the file from mythesis.doc to mythesis.iso and print to disk again. And if you don't want TeX or SGML, you still have a valid PC-Write file for plain WP purposes. In any case, TeX v3 is 8--bit, so you can put a (say) IBM PC character 160d (á) in your input file and use \catcode`\x=\active\defx{\'a} in your macro file (where `x' is actually a 160d, omitted here to honor email restrictions) and TeX will quite happily print an a-acute. I'm sorry if this sounds like a plug, it's not meant to be, just that we saw this one coming years ago and tackled it head-on. A skeleton version of this stuff is in PCWRITEX.ARC (or .ZIP) in SIMTEL-20 and a variety of other archives, if you want to look at it. OK, so it's not perfect. It's restricted to the crappy and proprietary IBM PC character set, and PC-Write is a visual WP, not a logical one, so it has no clue about structure, making it silly to use this and expect to get structured LaTeX or valid SGML out of it, all it's doing is mapping characters. But it sure as hell takes the pain out of *my* job, when students come up with a file, suddenly having decided they want to use TeX instead of WP. NOW...who's going to genericise this kind of process so it will (a) work with an arbitrary character set (ISO or otherwise); (b) work in a sensible graphical mode (inside MS-Windows, X-Windows etc); (c) run on something other than DOS; and (d) be publicly available? The source code of PC-Write is available from the authors, but there are other, probably much better, editors also available. The alternative is to shell out $6000 for a copy of Arbortext's SGML/edit (no TEI discount) or use SoftQuad's Author/Editor (our choice for the future), or one of the other offerings: and they all have their drawbacks, including an assumption that the user (often a WP-oriented student, staff or faculty member) has a sound working knowledge of SGML. Ah, I hear you say: wait for WordPerfect's SGML version. Hehehe. Much as I hate to plug WordPerfect (IMHO *the* antithesis of ergonomic editing), if we all take it on board, and publicly praise it, Microsoft _et al_ will not be far behind giving SGML compatibility to their competing products (actually I was under the impression that Word already had some form of SGML s sensitivity). Then perhaps we will be on the right road, but we still have the major hurdle to overcome: getting users to think structured rather than think visual, until the software becomes capable of divining (or asking for) *why* the user has switched to boldface, or has entered an isolated line beginning with a digit and a period :-) Sorry for the ramble... ///Peter ========================================================================= Date: Fri, 9 Oct 1992 15:57:50 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Peter Flynn's message of Wed, 7 Oct 1992 06:38:12 CDT <9210071142.AA04806@sapir.metis.com> ----------------------------Original message---------------------------- Date: Wed, 7 Oct 1992 06:38:12 CDT From: Peter Flynn We use PC-Write ... Diacriticals are entered by pressing the letter, the grave-accent, and then the accent character ... I'm glad to see that you used postfix diacritic entry rather than the dead-key approach. The reason I may sound a bit idealistic in my recent mail is because I am have been busy for the last couple of years creating Unicode/10646 text processing libraries, including rich-text layout engines for unrestricted multilingual text. Rather than focusing on how to use what is out there, I've been actively building new technology that solves these problems from the start. Of course I now have the real challenge of integrating this into existing technology. I have cause to hope for good progress though, as a large PC software house is actively supporting this work. Glenn ========================================================================= Date: Fri, 9 Oct 1992 15:58:13 CDT Reply-To: Terry Crowley Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Terry Crowley Subject: Re: MIME ----------------------------Original message---------------------------- Don't condemn all of MIME because of richtext. The inclusion of richtext was a mistake for a whole variety of reasons. However, the other parts of MIME that specify the ways to describe content types and encodings in mail messages is valuable and will assist in achieving interoperability between heterogeneous mail and document systems. This will not happen because of MIME's inherent document structuring facilities, but because it will enable me to know what I'm getting and therefore reliably and automatically convert it into a form that is readable to the user. I believe that even if richtext is left in the final version of the document, it will be mostly ignored by implementors since it is so poorly defined and of limited utility. Terry ========================================================================= Date: Fri, 9 Oct 1992 15:58:41 CDT Reply-To: JOHN DUKE Sender: "TEI-L: Text Encoding Initiative public discussion list" From: JOHN DUKE Subject: Electronic AACR2 ----------------------------Original message---------------------------- This message is being cross-posted to AUTOCAT, PACS-L, and TEI-L The Joint Steering Committee for Revision of AACR and the publishers of AACR2 have authorized the production of an electronic version of AACR2 using the Standard Generalized Markup Language (SGML). George Alexander (a non-librarian who is versed in SGML tagging) and I were selected to convert the heavily-coded typographer's version of AACR2 into an SGML format that can be used by developers to construct useful products for the cataloging community. There are two goals to our efforts to tag AACR2 in SGML format: 1. To supply enough information to permit a computer program to reproduce a reasonable facsimile of the printed page of AACR2, including varying sizes of type, italics and bold, tab settings, etc. 2. To provide certain hooks into the structure of the rules (such as rule references, options, and examples) to permit a computer program to manipulate the rules intelligently. The completed SGML tagged rules, or AACR2-e, will not be an end- user's product. It will require an interface to interpret the tags and to take advantage of the structural hooks. Developers of AACR2-e may develop linkages to other products, such as the LC Rule Interpretations or the MARC format documents. Developers may also insert additional SGML tags into the text for a more sophisticated product. In short, it is expected that developers will add value to the raw AACR2-e file. We are close to completing prototype models for chapters 2 and 22. The purpose of this communication is to solicit reviewers who are willing to evaluate the work that has been done and to make suggestions for further improvements prior to the final release of the complete AACR2-e. Ideally, the reviewers will be those who anticipate developing a product from the raw AACR2-e file. Reviewers should be familiar with the printed AACR2 and have an awareness of (and preferably actual working experience with) SGML coding. We expect to ship the two prototype chapters at the end or October or early November. If you are interested in serving as a reviewer, please respond to me personally (_NOT_ through the list) by October 23. Include a statement of your qualifications for the project and what, if any, products you are considering developing using AACR2-e. A limited number of reviewers will be selected. Those who are accepted as reviewers will receive the SGML-tagged file of chapters 2 and 22 on a 3 1/2" MS-DOS diskette that can be viewed through a standard ASCII text editor. Reviewers will also be required to sign an agreement that, among other things, binds them not to disseminate the files to others, not to modify the text of AACR2, not to engage in any development work in advance of the full release of the machine-readable AACR2, and to share the results of any experimentation and research of the files with the publishers of AACR2. I am also interested in the views and comments of those who are not reviewers regarding the development of AACR2-e, either as a personal communication or through the forum of this list. Thank you for your assistance on this important project. ******************************************************************* John Duke BITNET: jduke@vcuvax Assistant Director INTERNET: jduke@ruby.vcu.edu Network and Technical Services PHONE: 804/367-1100 University Library Services FAX: 804/367-0151 Virginia Commonwealth University Richmond, VA 23284-2033 ========================================================================= Date: Fri, 9 Oct 1992 15:59:07 CDT Reply-To: "Liam R. E. Quin" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Liam R. E. Quin" Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- Peter Flynn wrote a wonderful ramble (as he modestly called it) about the pragmatics of text editing on today's -- yesterday's, some would say -- hardware. > The source code of PC-Write is available from the authors, but there are > other, probably much better, editors also available. The alternative is to > shell out $6000 for a copy of Arbortext's SGML/edit (no TEI discount) or > use SoftQuad's Author/Editor (our choice for the future), or one of the other > offerings: and they all have their drawbacks, including an assumption that > the user (often a WP-oriented student, staff or faculty member) has a sound > working knowledge of SGML. We like to think (and many of our users agree) that an Author/Editor user does _not_ need a sound knowledge of SGML. Instead, one uses a graphical representation of the document, with icons representing the start & end of document elements. One has to know about the idea of structured documents, but that transcends SGML or any one product, I'd hope. Of course, if one is going to use RulesBuilder to create DTDs one has to know SGML. It's also worth mentioning that no-one ever need type, or even see, any actual SGML with Author/Editor. Many of our users don't even realise they're using SGML! I do agree with Peter that > but we still have > the major hurdle to overcome: getting users to think structured rather than > think visual, until the software becomes capable of divining (or asking for) > *why* the user has switched to boldface, or has entered an isolated line > beginning with a digit and a period The nearest we've come up with so far in our released product, I think, apart from the obvious context sensitivity, is associating a natural- language description with each tag, so you can at least hove something like E1 Emphasised, or bold, text KW Keywords or jargon and so forth. Associating screen styles helps a lot, too. Showing an IPA string with the IPA font is, in the absence of Unicode, 10646, or suitable ASCII representations of the IPA, a necessity, of course. We are all looking forward to a world where 16-bit character sets are the norm, and where large numbers of special and national characters are available, but in the meantime we have to live with what we have. I think this is a large part of the difference between Glenn, Erik and Keld: Glenn is already using Unicode (I understand); it's less clear to some of the other people on this list how they should go from where they are now to where hey would like to be in three or five years' time, without spending more money tham they have. Reviewing a book on X Windows recently which described a terminal window as `just like an ordinary alphanumerical terminal', I made a note asking the author how many of the readers - many of whom wil be students - will ever have seen an `alphanumerical terminal'. Probably not many. It's easy to forget that there can be a huge gulf between an R & D establishmet or University and what people are acually using through necessity. Lee -- Liam Quin, lee@sq.com, SoftQuad, Toronto, 416 239-4801; the barefoot programmer lq-text (Unix text retrieval package) mailing list: lq-text-request@sq.com HexSweeper NeWS game/OPEN LOOK UI FAQ & program list/Metafont list/XView ports ========================================================================= Date: Fri, 9 Oct 1992 15:59:51 CDT Reply-To: marchand@ux1.cso.uiuc.edu Sender: "TEI-L: Text Encoding Initiative public discussion list" From: James Marchand Subject: double byte ----------------------------Original message---------------------------- I don't want to get into this argument, but I did want to mention that BYTE magazine has this month (October 1992) a discussion of the problems one encounters in dealing with Japanese: "In the Land of the Double Byte," 47f. Although occasionally somewhat amateurish, it seems to be based on "hands-on" experience. ========================================================================= Date: Sat, 10 Oct 1992 22:52:16 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > I think this is a large part of the difference between Glenn, Erik and > Keld: Glenn is already using Unicode (I understand); it's less clear > to some of the other people on this list how they should go from where > they are now to where hey would like to be in three or five years' > time, without spending more money tham they have. My comments origine from the experience with the migration from 7-bit national ISO 646 variants, and national EBCDICs (Danish) to the 8-bit world of ISO 8859-1 and PC, HP, and MacIntosh character sets. This has been a long and painfull process, starting 10 years ago, and we are still in the midst of this process. In some areas it has just begun, and it seems that the software is slower to convert than the hardware; maybe the data is the slowest to convert. The 7 to 8-bit migration should then be not that difficult compared to an 8-bit to 16 or 32 bit world; the 7 -> 8 bit conversion does not change the allocation of data needed as text can still be stored in bytes. But I do not think it is a coincidense that Erik and I have these considerations: Nordic countries like Norway and Denmark have had widespread use of national 7-bit character sets, while it is my impression that the rest of Europe have not had the same emphasis on national character sets. From my experience in heterogeneous computer environments and as a national and international internet electronic mail provider for heterogeneous systems it is clear that the migration in character set support is very slow - we are talking multiple decades - and there is a need for intercommunication between the different character set worlds during these many years, which schemes like the SGML or the Mnemonic internet specification may facilitate. But then I agree with Glenn that the best thing is to have the real thing: full support for all of the characters of the world in all products. 10646 (implies also UNICODE) is the only candidate for this, with its various forms. Keld ========================================================================= Date: Sat, 10 Oct 1992 22:52:37 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: "Liam R. E. Quin"'s message of Fri, 9 Oct 1992 15:59:07 CDT <9210100712.AA06693@sapir.metis.com> ----------------------------Original message---------------------------- Date: Fri, 9 Oct 1992 15:59:07 CDT From: "Liam R. E. Quin" I think this is a large part of the difference between Glenn, Erik and Keld: Glenn is already using Unicode (I understand); it's less clear to some of the other people on this list how they should go from where they are now to where hey would like to be in three or five years' time, without spending more money tham they have. The problem, as I see it, is that many users, including myself, need full 10646 (Unicode) systems yesterday. Many more also need it but don't yet know this. It's a little like networking. Before TCP/IP became a defacto standard after the introduction of BSD4.2 Unix in 1984, many different, incompatible systems were built, e.g., Houston SPOOLER, ARPA/NCP, Berknet, UUCP, BITNET, and so forth. Even after TCP/IP was published in '81, and DARPA contracted BB&N and Berkeley to develop implementations, it wasn't until Sun Micro began mass producing their Berkeley-based systems that internetworking really became possible in a larger community. Even now we don't quite have a single internetworking standard, with battles still occurring between TCP/IP and OSI. We are even farther away from a standard email transport. I believe the same (or greater) effort will be required by system vendors before 10646 sees wide-spread usage. When a popular system uses 10646 as its native character set, when Emacs and other popular editors use it as their primary character set, when TeX and other commercial document production systems are converted to use it, when mail transports and user agents are capable of using 10646, when C Compilers allow source representations in 10646, then, and only then, will users in the large be able to reap the benefits of 10646. But none of this will happen until there is a set of underlying tools provided by the system vendor to convert these software subsystems. I completely agree with Peter and Liam about the need for interim solutions, and about the lack in current systems (though I wouldn't say hardware, but software - keyboard hardware is rather incidental to this whole issue). I have made a choice to devote my energy to creating 10646 systems rather than building interim solutions based on existing software. I believe that system vendors are very interested in doing the same, yet they can't abandon their existing customer base and existing software. So even they won't be quickly offering full 10646 systems. The only one I know of to have made this committment is the Plan 9 OS of Bell Labs. But that isn't a commercial product yet as far as I know. It is also possible that GO's Penpoint will converge quickly to full 10646. Microsoft Windows NT supports both Unicode and the standard 8-bit Windows character sets at the same time; however, I haven't seen any commitment from them to provide 10646/Unicode support in DOS or Windows 3.X. Apple is supporting Unicode in Quickdraw GX; Next has announced intentions to do the same, though I suspect we must wait for NextStep 4.0 or greater before they do so. I know that other system vendors of the Unicode Consortium are also working on it, but have made no public announcements about its use in their system software. Of course Xerox already has been using Unicode for a while. Unfortunately, all of these vendors are going to produce different ways for programmers to interact with 10646/Unicode-encoded text. If you think the differences between Berkeley Sockets and AT&T Streams is bad, just wait until you see the different APIs that are going to be promoted for 10646/ Unicode text processing. What we need is a DARPA or NSF project to promote a standard set of APIs for what is going to be an extremely important set of programmer interfaces. But, given the current hands-off-the-market policy promoted by the present administration, I don't see any direction whatsoever occurring from that end. The real technical challenge that will face vendors, beyond the problems of getting a decent API in place, is backwards software compatibility and efficiency. Many vendors will try to figure out how to fit 10646 into the current world of strings.h (or some form of wide string) functions, e.g., strcpy(), strlen(), getc(), isalpha(), etc. This is going to be quite difficult given that these functions (particularly ones like getc()) think only in code elements, whereas the traditional semantics of these functions is that they are dealing with text elements (w.r.t. Unicode). What should the system software do when a program expects A WITH ACUTE to be one code element, and not two? Will uses of getc() have to be replaced by gets() in order to return more than one code element? Or will getc() have to compose sequences into precomposed code elements (but then getc() would have to look at all the character code elements, etc.)? The other problem will be efficiency - both space and time. In cases where a two-fold expansion of string or text file storage requirements is not acceptable, systems will have to perform automatic compression or compaction behind the programmer's or user's back. But this will affect time performance, as will the more general nature of text processing algorithms needed to perform normal tasks, such as display. For example, in the past, printf() could treat each character code as a glyph code to be drawn to a character (glyph) terminal; while DrawString [Apple Quickdraw], TextOut [Microsoft Windows], XDrawString [X Window System], and the like could do the same but increment the current point by a variable instead of a fixed amount. The process becomes much more complex in the general case when displaying a 10646-encoded string. A general-purpose 10646 display algorithm must be able to perform one-to-many and many-to-one mappings between character code elements and glyphs, it must handle bidirectional and horizontal/vertical layout requirements, it will probably need to separate a text into segments according to the primary script of a segment [so as to invoke script sensitive or writing system sensitive display behavior], it will have to accomodate the different notions of a baseline that hold among different scripts, it will have to deal with diacritic placement issues, and so on. It will take a lot of effort to create a general purpose display algorithm and have it perform as well as current display functions in the special case of ASCII-only characters. Given these problems, I do expect that interim solutions will continue to be created. I even expect that many 10646 APIs will be intermediate steps as more experience is gained in the more general problems of 10646-based text processing. At the same time, I hope that enough folks will set their sights on the whole 10646 picture that we can come up with some workable solutions. Personally, I would like to see this all done and finished, so we (I) can get to the real business of using full multilingual universal text systems to do the actual text processing business at hand. My apologies for the length of this message... Glenn Adams ========================================================================= Date: Wed, 14 Oct 1992 09:39:08 CDT Reply-To: "Masataka Ohta" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Masataka Ohta" Subject: Re: Character Names [Encoding IPA] In-Reply-To: <9210121347.AA27688@kuis.kuis.kyoto-u.ac.jp>; from "Glenn Adams" at Oct 10, 92 10:52 pm ----------------------------Original message---------------------------- > The problem, as I see it, is that many users, including myself, need full > 10646 (Unicode) systems yesterday. Many more also need it but don't yet > know this. It's a little like networking. Before TCP/IP became a defacto > standard after the introduction of BSD4.2 Unix in 1984, many different, > incompatible systems were built, e.g., Houston SPOOLER, ARPA/NCP, Berknet, > UUCP, BITNET, and so forth. While your suggestive analogy between networking and character encoding is quite interesting, wouldn't it be more acculate to regard 10646 (Unicode) as OSI? > What should the system software do when a program expects A WITH ACUTE to > be one code element, and not two? Will uses of getc() have to be replaced > by gets() in order to return more than one code element? Or will getc() > have to compose sequences into precomposed code elements (but then > getc() would have to look at all the character code elements, etc.)? I'm afraid it is too late. Those well known questions should have been answered before 10646 was standardized through which 10646 could have been modified to be a much more usable standard. > Given these problems, I do expect that interim solutions will continue to > be created. What we need now is, according to your analogy, TCP/IP. Masataka Ohta ========================================================================= Date: Wed, 14 Oct 1992 09:39:48 CDT Reply-To: NEUMAN@GUVAX.BITNET Sender: "TEI-L: Text Encoding Initiative public discussion list" From: NEUMAN@GUVAX.BITNET Subject: Please post 2nd call. ----------------------------Original message---------------------------- Please consider posting this reminder of the upcoming deadline for submissions to ACH-ALLC93. Thank you. M.N. ...................................................................... Dear Colleagues, November 1st, the deadline for submitting proposals for ACH-ALLC93, is fast approaching. We welcome your inquiries and your submissions. For more details, see the call for papers below. Regards, Michael Neuman Georgetown University for ACH-ALLC93 ......................................................................... ASSOCIATION FOR COMPUTERS AND THE HUMANITIES ASSOCIATION FOR LITERARY AND LINGUISTIC COMPUTING 1993 JOINT INTERNATIONAL CONFERENCE ACH-ALLC93 JUNE 16-19, 1993 GEORGETOWN UNIVERSITY, WASHINGTON, D.C. CALL FOR PAPERS This conference is the major forum for literary, linguistic and humanities computing. It is concerned with the development of new computing methodologies for research and teaching in the humanities, the development of significant new networked-based and computer-based resources for humanities research, and the application and evaluation of computing techniques in humanities subjects. TOPICS: We welcome submissions on topics such as text encoding; statistical methods for text analysis; hypertext; text corpora; computational lexicography; morphological, syntactic, semantic and other forms of text analysis; also, computer applications in history, philosophy, music and other humanities disciplines. For the 1993 conference, ACH and ALLC extend a special invitation to members of the library community to contribute to the conference on the topics of creating and cataloguing network-based resources in the humanities, developing and integrating databases of texts and images of works central to the humanities, and refining retrieval techniques for humanities databases. LOCATION: Georgetown, an historic residential district along the Potomac River, is a six-mile ride by taxi from Washington National Airport. International flights arrive at Dulles Airport, which offers regular bus service to the Nation's Capital. REQUIREMENTS: Proposals should describe substantial and original work. Proposals describing the development of new computing methodologies should make clear how these methodologies are applied to research and/or teaching in the humanities. Those concerned with a particular application (e.g., a study of the style of an author) should cite previous approaches to the problem and should include some critical assessment of the computing methodologies used. All proposals should include references to important sources. ABSTRACT LENGTH: Abstracts of 1500-2000 words in length should be submitted for presentations of thirty minutes including questions. SESSION PROPOSALS: Proposals for sessions (90 minutes) are also invited. These should take the form of either: (a) Three papers. The session organizer should submit a 500-word statement describing the session topic, include abstracts of 1000-1500 words for each paper, and indicate that each author is willing to participate in the session. (b) A panel of up to 6 speakers. The panel organizer should submit an abstract of 1500-2000 words describing the panel topic, how it will be organized, the names of all the speakers, and an indication that each speaker is willing to participate in the session. DEADLINE FOR SUBMISSIONS: November 1, 1992 NOTIFICATION OF ACCEPTANCE: February 1, 1993 FORMAT FOR SUBMISSIONS: Electronic submissions are strongly encouraged, and should follow strictly the format given below. Submissions that do not conform to this format will be returned to the authors for reformatting, or may not be considered if they arrive near the deadline. All submissions should include a header in the following format: TITLE: title of paper AUTHOR(S): names of authors AFFILIATION: affiliations of author(s) CONTACT ADDRESS: full postal address of main author (for contact) E-MAIL: electronic mail address of main author followed by other authors (if any) FAX NUMBER: fax for main author PHONE NUMBER: phone for main author ELECTRONIC SUBMISSIONS: Please submit plain ASCII text files. Files that include formatting by a wordprocessor, TAB characters, and soft hyphens are not acceptable. Paragraphs should be separated by blank lines. Headings and subheadings should be on separate lines and be numbered. References (up to six) and notes should appear at the end of the abstract. Where necessary, a simple markup scheme for accents and other characters that cannot be transmitted by electronic mail should be used; provide an explanation of the markup scheme after the title information. If diagrams are necessary for the evaluation of an electronic submission, they should be faxed to 1-202-687-6003 (after dialing one's international access code) or 202-687-6003 (from within the US), and a note to indicate the presence of diagrams should be inserted at the beginning of the abstract. Address for electronic submissions: Neuman@GUVAX.Georgetown.edu (include a subject line " Submission for ACH-ALLC93"). PAPER SUBMISSIONS: Submissions should be typed or printed on one side of the paper only, with ample margins. Six copies should be sent to ACH-ALLC93 (Paper submission) Dr. Michael Neuman Academic Computer Center 238 Reiss Science Building Georgetown University Washington, D.C. 20057 PUBLICATION: A selection of papers presented at the conference will be published in the series Research in Humanities Computing edited by Susan Hockey and Nancy Ide, published by Oxford University Press. INTERNATIONAL PROGRAM COMMITTEE Chair: Marianne Gaunt, Rutgers University (ACH) Thomas Corns, University of Wales, Bangor (ALLC) Paul Fortier, University of Manitoba (ACH) Jacqueline Hamesse, Universite Catholique Louvain-la-Neuve (ALLC) Susan Hockey, Rutgers and Princeton Universities (ALLC) Nancy Ide, Vassar College (ACH) Randall Jones, Brigham Young University (ACH) Michael Neuman, Georgetown University (ACH) (Local organizer) Antonio Zampolli, University of Pisa (ALLC) INQUIRIES Please address all inquiries to: ACH-ALLC93 Dr. Michael Neuman, Local Organizer Academic Computer Center 238 Reiss Science Building Georgetown University Washington, D.C. 20057 Phone: 202-687-6096 FAX: 202-687-6003 Bitnet: Neuman@Guvax Internet: Neuman@Guvax.Georgetown.edu Please include your name, full mailing address, telephone and fax numbers, and e-mail address with any inquiry. ========================================================================= Date: Thu, 15 Oct 1992 14:24:07 CDT Reply-To: CSC301AMD@HOFSTRA.BITNET Sender: "TEI-L: Text Encoding Initiative public discussion list" From: CSC301AMD@HOFSTRA.BITNET Subject: TITLES vs. HEADERS vs. FOOTERS ----------------------------Original message---------------------------- I am doing some work with SGML document markup, and have become interested in the question of "titles" vs. "headers" vs. "footers". For example, is a title just a special case of a header? How about if the format of a document is such that each chapter's title is repeated on the top of each page? Would that be considered a title, a header or both? Could you have a "header" which is not the title of anything? And are footers (not footnotes) simply headers in a different physical location? Can anyone point me to a formal definition or archives of previous discussion on the subject? I am also interested in anyone's personal thoughts and opinions. Thanks in advance - Anne Daro Mgr, Tech. Services Hofstra University Hempstead, NY CSC301AMD@VAXC.HOFSTRA.EDU ========================================================================= Date: Thu, 15 Oct 1992 17:47:39 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: "Masataka Ohta"'s message of Wed, 14 Oct 1992 09:39:08 CDT <9210141534.AA07875@sapir.metis.com> ----------------------------Original message---------------------------- Date: Wed, 14 Oct 1992 09:39:08 CDT From: "Masataka Ohta" > The problem, as I see it, is that many users, including myself, need full > 10646 (Unicode) systems yesterday. Many more also need it but don't yet > know this. It's a little like networking. Before TCP/IP became a defacto > standard after the introduction of BSD4.2 Unix in 1984, many different, > incompatible systems were built, e.g., Houston SPOOLER, ARPA/NCP, Berknet, > UUCP, BITNET, and so forth. While your suggestive analogy between networking and character encoding is quite interesting, wouldn't it be more acculate to regard 10646 (Unicode) as OSI? No. I wasn't comparing 10646 to any one of these networking standards; rather, I was comparing the rise of networking support in general to what I believe will be the case in the rise of 10646 support. This is particularly true since 10646 has no competing character set; this clearly is not the case with TCP/IP vs. OSI where there is a strong competition. If the merger hadn't occurred between Unicode and 10646, we probably would have ended up in a similar situation; but since they were merged, we can thankfully avoid this situation. > What should the system software do when a program expects A WITH ACUTE to > be one code element, and not two? Will uses of getc() have to be replaced > by gets() in order to return more than one code element? Or will getc() > have to compose sequences into precomposed code elements (but then > getc() would have to look at all the character code elements, etc.)? Those well known questions should have been answered before 10646 was standardized through which 10646 could have been modified to be a much more usable standard. These questions *were* addressed prior to 10646 standardization. The resolution was that TEXT ELEMENTS != CODE ELEMENTS. The problem now is to evolve text processing software to recognize this new, more general situation. The correct decision was made with 10646; it will now require implementors to find the correct implementation solutions. > Given these problems, I do expect that interim solutions will continue to > be created. What we need now is, according to your analogy, TCP/IP. No. What we need now is two or three implementations from which we can learn, rather than arguing ex nihilo. Glenn Adams ========================================================================= Date: Thu, 15 Oct 1992 17:50:13 CDT Reply-To: "Peter Graham, Rutgers U., (908) 932-2741" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Peter Graham, Rutgers U., (908) 932-2741" Subject: Titles etc. vs running heads--Victorian ----------------------------Original message---------------------------- Anne Daro asked about titles vs. headers vs. footers. If for some reason one wanted to publish a document with each page having its own running title, a la Victorian novels, is there provision for that as well? --Peter Graham Rutgers University ========================================================================= Date: Fri, 16 Oct 1992 17:43:46 CDT Reply-To: U59467@UICVM.BITNET Sender: "TEI-L: Text Encoding Initiative public discussion list" Comments: Resent-From: U59467@UICVM From: U59467@UICVM.BITNET ----------------------------Original message---------------------------- Received: from UICVM.BITNET by UICVM (Mailer R2.07) with BSMTP id 3392; Sat, 10 Oct 92 04:57:36 CDT Received: by talcott.harvard.edu; Sat, 10 Oct 92 03:11:52 EDT Received: by harvard.harvard.edu (5.54/a0.25) (for UICVM.BITNET!TEI-L@talcott) id AA01665; Sat, 10 Oct 92 03:11:40 EDT Received: from deepthought.cs.utexas.edu by cs.utexas.edu (5.64/1.142) with SMTP id AA09666; Sat, 10 Oct 92 02:11:23 -0500 Received: from relay2.UU.NET by deepthought.cs.utexas.edu (5.64/1.140) with SMTP id AA08073; Sat, 10 Oct 92 02:11:29 -0500 Received: from nwnexus.wa.com by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA25117; Sat, 10 Oct 92 03:11:15 -0400 Received: from halcyon.com by nwnexus.wa.com with SMTP id AA13169 (5.65c/IDA-1.4.4 for ); Sat, 10 Oct 1992 00:09:46 -0700 Received: from DialupEudora (halcyon.com) by halcyon.com with SMTP id AA19155 (5.65c/IDA-1.4.4 for TEI-L@UICVM.BITNET); Sat, 10 Oct 1992 00:09:38 -0700 Date: Sat, 10 Oct 1992 00:09:38 -0700 Message-Id: <199210100709.AA19155@halcyon.com> To: TEI-L@UICVM.BITNET From: randy%halcyon.halcyon.com@harvunxw.BITNET(C. Brandon Gresham, Jr.) Subject: SGML on the Macintosh I have just returned to civilization after twelve years in Saudi Arabia. Since I am an infomaniac and a book lover my discovery of the InterNet and particularly Project Gutenberg has been a wonder. Michael S. Hart (HART@vmd.cso.uiuc.edu) has indicated that a project is under way to distribute the Oxford Text Archives BUT by distributing materials in TEI SGML only. He indicated this may be a source of an answer to my question. My question: do you know of any sources for programs that handle SGML on the Macintosh? randy@halcyon C. Brandon Gresham, Jr. Ad Hoc Enterprises Issaquah WA U.S.A. ========================================================================= Date: Fri, 16 Oct 1992 17:44:32 CDT Reply-To: tucusito@kokuki.kuaero.kyoto-u.ac.jp Sender: "TEI-L: Text Encoding Initiative public discussion list" From: tucusito@kokuki.kuaero.kyoto-u.ac.jp Subject: Information request. ----------------------------Original message---------------------------- I am researching on text encoding and text analysis. I will welcome Bibliographic information or other helps related with this subjects. thanks F Zeidan. e-mail: tucusito@kokuki.kuaero.kyoto-u.ac.jp Kyoto University. Japan. ========================================================================= Date: Fri, 16 Oct 1992 17:44:59 CDT Reply-To: Pius ten Hacken Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Pius ten Hacken Subject: Fonts for Macintosh ----------------------------Original message---------------------------- I hope I do not abuse this list if I use it to ask the following question. I am working on an Apple Macintosh, and I sometimes want to use characters which are not available to me at present. They are of three kinds: 1. Latin characters with diacritics, such as used in Slavic languages (e.g. s + acute), Rumanian (e.g. t + cedille), Turkish (e.g. i without dot), the transcription of Sanskrit (e.g. m with a dot under it), etc. 2. Greek characters with accent, spiritus, or small iota below it. 3. Special phonetic characters from the IPA set, e.g. schwa, n with one of the legs prolonged, etc. MMost fonts I know are OK for Western-European languages only. The only exception is Symbol, which binds Greek characters to the keys, but in a way which is only good for mathematic formulae with an occasional Greek character. I tried a font editor, Fontographer 3.5, which gives reasonable (=hardly satisfactory) results for group 1 and for schwa, but is a disaster in all other cases. Does anyone know how I can obtain a font, or fonts containing the characters of groups 1, 2, and 3, that can be used on a Macintosh and printed on a postscript printer ? Thanks in advance, Pius ten Hacken ========================================================================= Date: Fri, 16 Oct 1992 17:45:29 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > > What should the system software do when a program expects A WITH ACUTE to > > be one code element, and not two? Will uses of getc() have to be replaced > > by gets() in order to return more than one code element? Or will getc() > > have to compose sequences into precomposed code elements (but then > > getc() would have to look at all the character code elements, etc.)? > > Those well known questions should have been answered before 10646 was > standardized through which 10646 could have been modified to be a much > more usable standard. The question is answered in the forthcoming 10646 standard: if you want the character LATIN CAPITAL LETTER A WITH ACUTE you can only code it as one code element. The two code elements LATIN CAPITAL LETTER A and COMBINING ACUTE ACCENT do not together constitute a character LATIN CAPITAL LETTER A WITH ACUTE. So you only have to have getc() look at more than one code element, and you only have to test for one value when you look for this character, namely the accented character coded as one code element, and not for the comebined two-code entity. Keld Simonsen ========================================================================= Date: Fri, 16 Oct 1992 17:46:03 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 16 Oct 1992 01:27:35 +0100 <199210160027.AA06496@dkuug.dk> ----------------------------Original message---------------------------- Date: Fri, 16 Oct 1992 01:27:35 +0100 From: Keld J|rn Simonsen X-Charset: ASCII X-Char-Esc: 38 > > What should the system software do when a program expects A WITH > > ACUTE to be one code element, and not two? Will uses of getc() > > have to be replaced by gets() in order to return more than one > > code element? Or will getc() have to compose sequences into > > precomposed code elements (but then getc() would have to look > > at all the character code elements, etc.)? > > Those well known questions should have been answered before 10646 was > standardized through which 10646 could have been modified to be a much > more usable standard. The question is answered in the forthcoming 10646 standard: if you want the character LATIN CAPITAL LETTER A WITH ACUTE you can only code it as one code element. The two code elements LATIN CAPITAL LETTER A and COMBINING ACUTE ACCENT do not together constitute a character LATIN CAPITAL LETTER A WITH ACUTE. I can't figure from this whether Keld is being intentionally misleading or whether he isn't aware of the details of 10646. If by "character," Keld means an element of 10646, then he is correct: a combination of characters in 10646 does not create another "character," that is, if by "character" one means an element of 10646. However, if by "character" one means an element of a writing system (or of an alphabet), then Keld is quite wrong, since, indeed, one can arbitrarily form a "character" in the sense of an element of a writing system by combining code elements in 10646. So, if I have an alphabet which has the element LATIN CAPITAL LETTER A WITH ACUTE, I am completely free to encode this as either one or two code elements. In this sense, LATIN CAPITAL LETTER A WITH ACUTE constitutes a text element in the context of some text process and writing system. A user of 10646 is quite free to encode such a text element with more than one code element or with alternative code element spellings. So you only have to have getc() look at more than one code element, and you only have to test for one value when you look for this character, namely the accented character coded as one code element, and not for the comebined two-code entity. Most programs operate on text elements, which, in pre-10646 days corresponded to code elements. getc() was designed in a context where a code element could be equated with a text element. With 10646 this situation has changed. If an implementation desires to impose text element/code element equivalence, then it must be prepared to translate text elements which are spelled out by multiple code elements into single code elements to be returned by getc(). Since 10646 doesn't encode all possible combinations of code elements to be treated thus as single (precomposed) code elements, such an implementation must be prepared to dynamically assign elements from the Private Use Zone in order to represent unencoded composite text elements. Such a system must also be able to convert such private encodings into public encodings when interchanging text. Glenn Adams ========================================================================= Date: Fri, 16 Oct 1992 17:46:32 CDT Reply-To: fab@fungus.zso.dec.com Sender: "TEI-L: Text Encoding Initiative public discussion list" From: fab@fungus.zso.dec.com Subject: Unicode Implementors Workshop, Sulzbach (Taunus) Germany ----------------------------Original message---------------------------- UNICODE / ISO 10646 IMPLEMENTERS WORKSHOP #4 December 3 & 4 1992 Sulzbach (Taunus) Germany This is an announcement for the upcoming fourth workshop in a series of workshops on implementing the Unicode Standard and sponsored by the Unicode Consortium. This is the first time the workshop is being offered in Europe. Should you not be able to attend yourself, please pass this notice on to somebody else who might be interested in attending this workshop. Unicode: Unicode is an international character encoding standard that encompasses all the world's national scripts in a 16bit code space. Unicode is a profile of the International Standard ISO 10646. Supported by most major computer and software vendors, Unicode greatly facilitates the development of internationally accepted software. All currently used character sets can be transferred to Unicode. Target Audience: Software developers, technical writers, managers or engineers developing or considering development of software for the international market. Date and Time: December 3 and 4, 1992, 9am - 5pm Location: The workshop will be held at the Holiday Inn in the German town of Sulzbach (Taunus). Agenda: December 3 will feature a full day, professionally developed, lecture course covering the Goals and Architecture of the Unicode Standard. It will address the problems posed by support for writing systems worldwide and how Unicode's design enables their solution. In addition, the course will explore several implementation strategies for Unicode and illustrate them with specific examples. December 4 will feature invited papers on implementation aspects of Unicode and its relation to other standards. These seminar-style talks will cover specific, practical problems and point out solutions, as well as provide case studies and demonstrations of Unicode implementations underway. Speakers: Introductory Course Glenn Adams, Metis Technology, Inc. & The Institute for Advanced Professional Studies Keynote Address Gvtz H. Siebrecht General Manager Unisys Deutschland GmbH Unicode and 10646 Isai Scheinberg IBM Corporation-Canada Program Migration to Unicode Alan Barrett Lotus Development Ireland Non-Spacing Marks Mark Davis Taligent, Inc. Operating System Support (Windows N/T) Michel Suignard Microsoft Europe Codeset Conversions Lloyd Honomichl Novell, Inc. Unicode In the XPG/Posix Model Gary Miller IBM Corporation Collating Unicode Data Alain La Bonti Ministry of Communications, Quebec Unicode Support in the Application Development Tool Kit Tuoc Vinh Luong Borland International Internationalization in Windows Past and Future Bill Hall Novell, Inc. Unicode and Print Servers Tadao Yamasaki IBM Corporation/Pennant Unicode/UCS: What does it mean for the European environment? J|rgen Bettels Digital Switzerland Practical Experience with Unicode Bidi Algorithms Alex Morcos Microsoft Corporation Unicode and Network Internationalization Wayne Taylor Novell, Inc. Proceedings: A complete set of notes will be provided for the course at no extra charge. Hotel: The workshop will be held at the Holiday Inn, Sulzbach (Taunus), Germany. The conference rate for a single room is DM 165 per night which includes breakfast. Lunch and dinner on December 3, and lunch on December 4 are included in the conference registration fee. Room arrangements should be made directly with the hotel, requesting the Unicode Workshop rate. Reservations +49-6196-763810 Fax +49-6196-72996 Registration: To register, please complete the registration form to the right and send together with a check, or credit card information, to either of the locations below. Cancellations must be received by November 27, 1992 and will carry a DM 50 (US$37) cancellation fee. Contacts: European Contact: Unicode Implementer's Workshop c/o Unisys Deutschland GmbH Frau Helga Mifka/ML Postfach 1110 D-6231 Sulzbach (Taunus) Germany Phone: +49-6196-991259 Fax: +49-6196-991860 U.S./Canada Contact: Unicode Implementer's Workshop Classic Consulting, International 2249 LeClair Drive Coquitlam, British Columbia V3K 6P6 Canada Phone: 604-931-7600 Fax: 604-937-5898 E-mail: 72630.107@compuserve.com Registration form: Name: ________________________________________ Company: _____________________________________ Address: _____________________________________ City: ________________________________________ Country: _____________________________________ Postal Code: _________________________________ Phone: _______________________________________ Fax: _________________________________________ Non Member: DM 550______ US$370______ Unicode Member: DM 450______ US$300______ __ Check Enclosed __ AMEX __ Visa __ MC Card#: ________________________ Exp Date:_____ Signature: ___________________________________ Make checks payable to Unicode, Inc. Employees of Unicode Full and Associate Members are eligible for the member discount. Cancellations must be received by November 27, 1992 and will carry a cancellation charge of DM 50 (US$37). ========================================================================= Date: Fri, 16 Oct 1992 17:57:11 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: Character Names [Encoding IPA] In-Reply-To: Keld J|rn Simonsen's message of Fri, 2 Oct 92 18:40:39 +0100 <9210021740.AA06062@dkuug.dk> [Note: This message was accidentally excluded from the recent discussion on Character Names. We apologize for any confusion this may have caused.-Ed.] Date: Fri, 2 Oct 92 18:40:39 +0100 From: Keld J|rn Simonsen 1. For readability. 2. For writabililty: 3. For presentation: Sure, shorter character names will be more efficient IF ONE HAS TO TYPE THEM IN AS NAMES. While this may be necessary for a transition period until 10646-based editors are available, it definitely should not be seen as a long term solution. It is completely absurd to require human users to enter characters as names of characters. Nobody in their right mind is going to do this for a Japanese or Chinese text, or a Thai or Arabic text for that matter. Instead of coming up with ineffective interim solutions that will never be used, we should be building fully-enabled 10646 systems. A task which I have been working at now for the last three years. For that matter, I would also argue that no human user should be forced to learn anything about SGML to use it. It should simply be an underlying serial representation like RTF. Applications should provide user interfaces that present the functional abstraction of SGML without requiring any specific knowledge of SGML syntax or representations. Glenn Adams Cambridge, Massachusetts ========================================================================= Date: Tue, 20 Oct 1992 11:12:07 CDT Reply-To: "Liam R. E. Quin" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Liam R. E. Quin" Subject: SGML on Mac ----------------------------Original message---------------------------- Welcome back to civilisation :-) Our Author/Editor has been available on the Mac for several years -- a new version should be announced later this year or early next year. There is a discount for TEI-L members, I understand. Author/Editor lets you view and edit SGML documents. You will also need either Rules/Builder (our SGML DTD compiler) or a copy of a rules file from someone else. Contact sales@sq.com for the details, and they can send you brochures and demo versions and so forth. It's under $500, I'm not sure by how much. Lee -- Liam Quin, lee@sq.com, SoftQuad, Toronto, 416 239-4801; the barefoot programmer lq-text (Unix text retrieval package) mailing list: lq-text-request@sq.com HexSweeper NeWS game/OPEN LOOK UI FAQ & program list/Metafont list/XView ports ========================================================================= Date: Tue, 20 Oct 1992 11:12:34 CDT Reply-To: Keld J|rn Simonsen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Keld J|rn Simonsen Subject: Re: Character Names [Encoding IPA] ----------------------------Original message---------------------------- > The question is answered in the forthcoming 10646 standard: if you want > the character LATIN CAPITAL LETTER A WITH ACUTE you can only code it as > one code element. The two code elements LATIN CAPITAL LETTER A and > COMBINING ACUTE ACCENT do not together constitute a character > LATIN CAPITAL LETTER A WITH ACUTE. > > I can't figure from this whether Keld is being intentionally misleading > or whether he isn't aware of the details of 10646. If by "character," > Keld means an element of 10646, then he is correct: a combination of > characters in 10646 does not create another "character," that is, > if by "character" one means an element of 10646. However, if by > "character" one means an element of a writing system (or of an alphabet), > then Keld is quite wrong, since, indeed, one can arbitrarily form a > "character" in the sense of an element of a writing system by combining > code elements in 10646. Well, I cannot figure out if Glenn is being intentionally misleading about the facts of the new 10646. ISO 10646 is a character set standard, it defines characters, and in this respect 'characters' are a well defined concept. It does not mean 'an element of a writing system' and 10646 does not define elements of a writing system. The 'elements of a writing system' is not well defined in ISO standards, and should thus be used with care as a concept. > So, if I have an alphabet which has the element > LATIN CAPITAL LETTER A WITH ACUTE, I am completely free to encode this > as either one or two code elements. In this sense, LATIN CAPITAL LETTER > A WITH ACUTE constitutes a text element in the context of some text > process and writing system. A user of 10646 is quite free to encode > such a text element with more than one code element or with alternative > code element spellings. Well, this is as far as I know a misrepresentation of the forthcoming 10646 standard. You are not 'completely free' to encode LATIN CAPITAL LETTER A WITH ACUTE in the ISO 10646 standard, and you cannot code this letter as two characters according to the standard. At level 1 and 2 of the standard it is explicitely forbidden to use combining characters for this purpose, but in level 3 you are allowed to code the LATIN CAPLITAL LETTER A and a COMBINING ACUTE, but they do not constitute the letter LATIN CAPITAL LETTER A WITH ACUTE, this is also explicitely stated in the standard. Thus there is no way to encode the character LATIN CAPITAL LETTER A WITH ACUTE as two characters according to the approved 10646 standard. And there are good reasons why ISO chose to specify it in this way. Allowing more encodings for the same character would have introduced a very complex and costly need for programming, for example when testing for equality of two strings, a big database specifying all the equalities have to be available, instead of the just byte for byte equality needed with the present standard. And this big specification of equality has not been specified precisely anywhere, not even in previous UNICODE standards. > So you only have to have getc() look at more than one code element, > and you only have to test for one value when you look for this > character, namely the accented character coded as one code element, > and not for the comebined two-code entity. > > Most programs operate on text elements, which, in pre-10646 days > corresponded to code elements. I find this statement a bit out of reality. Most programs today operate on characters, I would only assume a few UNICODE programs to work on text elements. > getc() was designed in a context where > a code element could be equated with a text element. With 10646 > this situation has changed. If an implementation desires to impose > text element/code element equivalence, then it must be prepared to > translate text elements which are spelled out by multiple code elements > into single code elements to be returned by getc(). I believe this is some specifications coming from the previous version of UNICODE and not in line with the 10646 standard. The 10646 standard has removed the very messy definitions of equivalences of combining characters with precomposed characters, which for one purpose could be equivalent and for another not equivalent in the previous UNICODE standard. The definition there did not live up to normal requirements of unique assignments of codes to characters, which is normally the case for character sets. I expect this to be changed in the new UNICODE standard which is to be aligned with the approved 10646 standard. Keld Simonsen ========================================================================= Date: Tue, 20 Oct 1992 11:13:00 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: ISO10646: Code Elements vs Text Elements (Orthographic Units) In-Reply-To: Keld J|rn Simonsen's message of Sat, 17 Oct 1992 04:08:33 +0100 <199210170308.AA25658@dkuug.dk> ----------------------------Original message---------------------------- Date: Sat, 17 Oct 1992 04:08:33 +0100 From: Keld J|rn Simonsen in level 3 you are allowed to code the LATIN CAPLITAL LETTER A and a COMBINING ACUTE, but they do not constitute the letter ^^^^^^^^^^ LATIN CAPITAL LETTER A WITH ACUTE, this is also explicitely stated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ in the standard. And this is where you are incorrect. The standard *absolutely does not* state this. This was never agreed to in Seoul. I ask you to show me where what you claim was agreed to. Perhaps you are operating under the mistaken assumption that the Danish comments asking for such a restriction were adopted. They were not. Thus there is no way to encode the character LATIN CAPITAL LETTER A WITH ACUTE as two characters according to the approved 10646 standard. You are correct, two characters (code elements) of 10646 do not form a character (code element) of 10646; however, two characters (code elements) *may* encode a letter (text element) of any writing system that desires to encode it thus. You continue to conflate the term "character" with "letter." In 10646 terms, a character is *merely* an element of a character set, and has no necessary relation to letterhood. You are correct that Levels 1 & 2 do not allow composing "letters" or "natural characters" in this fashion; however, level 3 allows it at will. The latter is what Unicode sanctions. And there are good reasons why ISO chose to specify it in this way. Allowing more encodings for the same character would have introduced a very complex and costly need for programming, for example when testing for equality of two strings, a big database specifying all the equalities have to be available, instead of the just byte for byte equality needed with the present standard. And this big specification of equality has not been specified precisely anywhere, not even in previous UNICODE standards. Full Unicode systems *will* have to address this issue, as will any level 3 implementation of 10646. Such databases are cannot be avoided in level 3 (you imply they can be avoided). > So you only have to have getc() look at more than one code element, > and you only have to test for one value when you look for this > character, namely the accented character coded as one code element, > and not for the comebined two-code entity. > > Most programs operate on text elements, which, in pre-10646 days > corresponded to code elements. I find this statement a bit out of reality. Most programs today operate on characters, I would only assume a few UNICODE programs to work on text elements. No, most programs today work on code elements that just happen to (mostly) correspond to text elements; full Unicode and level 3 10646 systems will have to deal with the full generality of the code element != text element equation. Given that Microsoft NT, Apple Quickdraw GX, and other systems are being built on Unicode (10646 level 3), I disagree that "few UNICODE programs will work on text elements"; rather, I expect that the most popular systems will over the course of the next few years attain a much higher state of sophistication regarding text, one in which the abstraction of text element over code element is as easy as current (limited) character and text abstractions. If older systems wish to limit themselves to level 1 or 2, then they can safely ignore this issue. [Indeed, if you look closely at the next paragraph which was in my previous message, you will see a technique that even allows such systems to operate with level 3 10646 data. Erik picked up on this quickly, I might add.] > getc() was designed in a context where > a code element could be equated with a text element. With 10646 > this situation has changed. If an implementation desires to impose > text element/code element equivalence, then it must be prepared to > translate text elements which are spelled out by multiple code elements > into single code elements to be returned by getc(). I believe this is some specifications coming from the previous version of UNICODE and not in line with the 10646 standard. No, this is not coming from a previous version of Unicode. It is coming from the current version of Unicode and 10646. I would suggest that you talk with the 10646 editor (or WG2 convenor) if you aren't currently up on where things stand. The 10646 standard has removed the very messy definitions of equivalences of combining characters with precomposed characters, which for one purpose could be equivalent and for another not equivalent in the previous UNICODE standard. I too argued for rather serious revision of the language used in 10646 to describe the notion of code element combination. Thankfully, it has been cleaned up. However, it does not remove the problem as you seem to think (i.e., that the issue of equivalence of different spellings of a letter, text element, or "character" in the naive sense has disappeared - the same issue is still present in Level 3 and in Unicode, that's what makes it level 3). The definition there did not live up to normal requirements of unique assignments of codes to characters, which is normally the case for character sets. Yes, and that is why I argued against it. It gave the incorrect impression that a combination of code elements formed a "character" in the sense of an element of a character set (i.e., a code element), which was patently not true. However, and this seems to be a problem for you to understand, the principal of combining code elements to form "letters," "orthographic units," "text elements," or "characters" (in the naive sense), is still present in Level 3 and Unicode. These "natural characters," if I might call them as such, won't necessarily satisfy the criteria of character-set-character hood, since they won't have a unique codepoint, nor a name of their own. However, they will still be used by level 3 implementations in order to encode orthographic units - natural characters - of writing systems, whose units aren't already present in 10646. And, I might add, level 3 explicitly permits any combination of combining marks to be used to create orthographic units - natural characters - in this fashion. The issue of "natural character equivalence" or "orthographic unit equivalence" will not be specified by 10646; this doesn't mean that other standards can't be created that do specify specific uses of 10646 under level 3 usage, and define equivalence in those terms. I expect this to be changed in the new UNICODE standard which is to be aligned with the approved 10646 standard. No, it will not change, as the approved 10646 does not require such a change. If you would like to discuss all of this in more detail, I'd certainly be glad to see you at the upcoming Unicode/10646 Implementor's Workshop in Sulzbach, Germany. I will be talking about the distinction of code elements and text elements in quite a bit of detail in the introductory tutorial. Others, particularly Mark Davis, will be discussing in a fair amount of detail the issues surround full implementations of combining marks in level 3 systems. Regards, Glenn Adams ========================================================================= Date: Wed, 21 Oct 1992 17:26:02 CDT Reply-To: P.A.Ellison@cen.ex.ac.uk Sender: "TEI-L: Text Encoding Initiative public discussion list" From: P.A.Ellison@cen.ex.ac.uk Subject: 10646 et al. Oh dear! I must be mad joining in this discussion, but here goes. I have been reading the comments, replies and counter-replies about IPA and 10646 for some weeks now, and believe that very many half-truths and probably some untruths are being propounded. I have not seen the latest draft of 10646 -- I understand that no-one will see it until it is published, so my comments relate to the previous version: 1. There is NO definition of a 'character' in 10646, and SC2/WG2 will admit that they do not know what it is! SC18/WG8 has been asking for a definition for some time to delineate between 'glyph' and 'character'. 2. 10646 is no more than yet another interim standard. It amy contain the elements that allow for development of a standard with a long life, but 10646 certainly isn't that stnadard. 3. If it is a character standard, why are any of the ligatures 'fi' 'fl' etc included -- the 'fi'/'fl' ligatures are purely a typographic convenience. 4. Agreed the arabic presentation forms have been moved to a separate section, but why are they there at all? They are only required when outputting to a scripting device, and have no other value. 5. Where are all the Hangul sylables? Great I can write using the basic alphabet, but including only half the sylables is nothing short of a waste of time. 6. The Japanese have accepted 10646 as an international standard, but have forbidden its use within Japan! Great, where is the standard now? 7. I cannot use 10646 to produce a single technical article without using SGML entities for the Math characters that are missing! SC18/WG8 supplied a list of 300 math/chem characters that are missing (but are included in TR9573). The first time the list was supplied, SC2/WG2 decided to ignore it! The second time it was sent, they decided to ignore it, but this time to tell WG8 that they would ignore it! They also decided not to give any reason, as that would generate discussion, and they have decided not to discuss anything any longer! Readers will no doubt appreciate that I feel strongly about the 'cock-up' that is 10646. I believe that it will not last for many years, so I would not recommend people putting too much effort into getting systems working. Clearly, there are going to be lots of problems as people try and use it, and ISO are going to want the job done properly, so it won't last for too many years. Paul Ellison (UK member SC18/WG8, & member WG8/SWG on Fonts) ========================================================================= Date: Sat, 24 Oct 1992 13:13:17 CDT Reply-To: Glenn Adams Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Glenn Adams Subject: 10646 et al. In-Reply-To: P.A.Ellison@cen.ex.ac.uk's message of Wed, 21 Oct 1992 17:26:02 CDT <9210220001.AA14172@sapir.metis.com> ----------------------------Original message---------------------------- Date: Wed, 21 Oct 1992 17:26:02 CDT From: P.A.Ellison@cen.ex.ac.uk I have been reading the comments, replies and counter-replies about IPA and 10646 for some weeks now, and believe that very many half-truths and probably some untruths are being propounded. And I see you are adding a few half-truths and untruths to the fray. 1. There is NO definition of a 'character' in 10646, and SC2/WG2 will admit that they do not know what it is! SC18/WG8 has been asking for a definition for some time to delineate between 'glyph' and 'character'. That isn't true. It is defined in section 4.6: "A member of a set of elements used for the organisation, control, or representation of data." I suppose you want more meat, eh? 2. 10646 is no more than yet another interim standard. Maybe, but I bet someone said EBCDIC was only an interim standard too. Whose crystal ball are you using anyway? I'd like to know: maybe it can also tell us how many coal miners are going to be working in the UK next year. It may contain the elements that allow for development of a standard with a long life, but 10646 certainly isn't that stnadard. Let's put it this way. It won't do very well in the vacuum that currently surrounds it. More standards are needed to augment its usage. But that doesn't invalidate it. 3. If it is a character standard, why are any of the ligatures 'fi' 'fl' etc included -- the 'fi'/'fl' ligatures are purely a typographic convenience. Because enough P-members with votes wanted them. [Sure, I agree its stupid, but, then, who am I to say?] 4. Agreed the arabic presentation forms have been moved to a separate section, but why are they there at all? They are only required when outputting to a scripting device, and have no other value. Ditto above. [Also, backward compatibility with certain, shall we say, less than intelligent implementations of Arabic demanded them.] 5. Where are all the Hangul sylables? Great I can write using the basic alphabet, but including only half the sylables is nothing short of a waste of time. [IMHO *no* - read that as 0, null, zippo, not any, nihil - Hangul syllables should be in the standard. But, if for no other reason, backward compatibility with KSC5601 demanded including at least 2350 of them. Hangul syllables are just ligatures.] 6. The Japanese have accepted 10646 as an international standard, but have forbidden its use within Japan! Great, where is the standard now? Oh? Why did 10 Japanese companies recently introduce Kanji GO Penpoint running 10646 UCS2? 7. I cannot use 10646 to produce a single technical article without using SGML entities for the Math characters that are missing! SC18/WG8 supplied a list of 300 math/chem characters that are missing (but are included in TR9573). The first time the list was supplied, SC2/WG2 decided to ignore it! The second time it was sent, they decided to ignore it, but this time to tell WG8 that they would ignore it! They also decided not to give any reason, as that would generate discussion, and they have decided not to discuss anything any longer! OK, this is a good gripe. But a few million Burmese, Ethiopians, Tibetans, Sinhalese, Cambodians, and Mongolians are probably also griped because none of their scripts are present. They will be, eventually. Everything takes time. Readers will no doubt appreciate that I feel strongly about the 'cock-up' that is 10646. Do you really mean 'cock-up' or 'cook-up'. Yes, it is a venerable stew, which has simmered long enough. Lets chew on it a while before deciding whether to chuck it up. You apparently aren't ready for new tastes. Good Hunting, er, Cooking :-), Glenn Adams ========================================================================= Date: Sat, 24 Oct 1992 15:05:05 CDT Reply-To: "Wendy Plotkin (312) 413-0331" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Wendy Plotkin (312) 413-0331" Subject: New TEI Documents Available from Manuscripts Work Group (TR9) ----------------------------Original message---------------------------- The following documents are now available on the TEI-L fileserver: * TEI TR9 M8: Minutes of the Bergen Meeting, 18-20 September 1992 * TEI TR9 M2 FR: Minutes of the Louvain-La-Neuve Meeting (in French) 26-27 October 1991 * TEI TR9 M2 EN: Minutes of the Louvain-La-Neuve Meeting (in English) 26-27 October 1991 * TEI TR9 W4: Diplomatic Transcription of Modern Manuscripts (by Claus Huitfeldt) 8 November 1991 This paper was prepared by Professor Huitfeldt for a meeting of TEI work group and working committee chairs called to integrate the material prepared for TEI P2, the second version of the TEI Guidelines, through November, 1991. In this paper, Professor Huitfeldt draws upon his experience at the Wittgenstein Archives to identify features of modern manuscripts an encoding scheme must consider, including pages and pagination, spacing, sections and sentences, readability, marginal marks and lines, underlining, and other features. Wendy Plotkin *********************************************************************** These papers may be obtained from the TEI-L Fileserver by sending a note to Listserv@UICVM or Listserv@uicvm.uic.edu with no subject, and the following message line(s): Get TR9M8 XXX Get TR2M2EN XXX Get TR9W4 YYYY where "XXX" and "YYYY" are the filetypes available. The most recent minutes (TR9M8) are available in Postscript (PS), ASCII (Doc), and an intermediate TEI mark-up scheme (P2X). The other documents are available only in ASCII format (Doc). Thus, a typical message to Listserv might read as follows: Get TR9M8 PS Get TR9W4 Doc OBTAINING TEI TR9M8 VIA FTP FROM ENGLAND ---------------------------------------- The most recent minutes -- TR9M8 -- may also be obtained by FTP from the SGML Project at the University of Exeter in England. To obtain them via anonymous FTP from Exeter, your computer system must be on the InterNet. If it is, you should be able to give the command FTP sgml1.ex.ac.uk or FTP 144.173.6.61 This will connect you to the Exeter SGMLbox. In reponse to the User prompt, enter "anonymous" (or "ftp"), as with other anonymous FTP ser- vers. You will be asked for a password: please supply your full e-mail address. Once connected, type the following commands: cd /tei/tech get tr9m8.xxx (where xxx is the filetype as mentioned above; note however that the filename must be given in all lower-case letters -- p2x, doc, ps) If you have any trouble receiving these documents, please contact me directly at U49127@uicvm.uic.edu or U49127@UICVM. ========================================================================= Date: Sat, 24 Oct 1992 18:41:38 CDT Reply-To: "C. M. Sperberg-McQueen" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "C. M. Sperberg-McQueen" Subject: new fascicle available: base tag set for prose * * * * * * * * * * * * * * * * * * TEI P2 * * new fascicle now available * * Chapter 7 (PR) * * Base Tag Set for Prose * * * * * * * * * * * * * * * * * * We are happy to announce that a new fascicle of the second draft of the TEI Guidelines for Electronic Text Encoding and Interchange is now available for public comment. As readers of this list will recall, TEI P2 is being distributed for comment as a series of fascicles or part-issues, each containing a complete chapter of P2, as and when the texts were available. (File TEI ED J8, "Obtaining the Second Version of the TEI Guidelines," has the details, if you have forgotten). The present fascicle contains chapter 7 of P2, which discusses the TEI base tag set for prose. In the TEI DTDs, base tag sets define the overall structural tags for a text, those elemsnts which can occur directly within the element. The base tag set for prose also defines a standard treatment for front and back matter which can be used with any base tag set. It should be noted that many tags commonly associated with prose (paragraphs, notes, quotations, emphatic words and phrases, and the like) are treated by the TEI not as part of the prose tag set but as core tags available with any base tag set; they are treated not here but in chapter 6, which should become available soon. Other fascicles will be announced as they become available. Thank you for your patience! Texts of P2 are being made available in a number of different electronic formats. These include plain screen-readable text (filetype DOC), LaTeX (filetype TEX), PostScript (filetype PS) and of course SGML (filetypes P2X and REF). (Chapter PR is being made available initially only in DOC, PS, and P2X forms; a LaTeX form will follow in a few days.) To get electronic copies of this fascicle from the TEI-L fileserver, all you need do is send an ordinary email note to the address LISTSERV@UICVM (or listserv@uicvm.uic.edu) containing the line GET P2pr xxx (where xxx is one of the filetypes mentioned above) The documents you request will be returned to you automatically as e-mail messages. Beware! some of the files are quite large, and so may be delayed. You will also receive an automatic notification that the file is on its way to you. (If you receive something illegible in a 'Listserv packed format', please contact one of the editors directly to see about getting you the file in a more useful form.) The same files are available via anonymous FTP from the SGML Project at the University of Exeter. (Files may not be available on the Exeter server until a day or two after this announcement is made; please be patient.) To access these files, your computer system must be on the InterNet. If it is, you should be able to give the command FTP sgml1.ex.ac.uk [ or FTP 144.173.6.61] When you are connected to the Exeter SGMLbox, type the following commands: cd tei/p2/drafts get p2pr.xxx (where xxx is the filetype as mentioned above; note however that the filename *must* be given in all lower-case letters) The files may also be obtained from the Markup-L Listserv fileserver in Germany, and from Professor Syun Tutiya in Japan. For more details on these and other sources of TEI information, please order copies of files EDJ8 MEMO (describes how to retrieve electronic copies of TEI P2 and the various formats in which they are available) EDJ9 MEMO (describes how to request paper copies of TEI P2, for those without electronic mail access) (on the Exeter file server, these may be edj8.doc and edj9.doc) -C. M. Sperberg-McQueen Lou Burnard 24 October 1992 ========================================================================= Date: Fri, 30 Oct 1992 13:20:07 CST Reply-To: "Wendy Plotkin (312) 413-0331" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Wendy Plotkin (312) 413-0331" Subject: Susanne Corpus [Geoff Sampson of the TEI Linguistic Description work group forwarded the following information to TEI-L for further distribution.] THE SUSANNE CORPUS [Revised announcement including modified access instructions] 26 October 1992 Geoffrey Sampson School of Cognitive & Computing Sciences University of Sussex Falmer, Brighton BN1 9QH, England geoffs@uk.ac.susx.cogs Colleagues needing the use of a grammatically-analysed corpus of English may like to know that Release 1 of the SUSANNE Corpus is now complete, and is freely available from the Oxford Text Archive via anonymous ftp to any machine connected to the Internet. Instructions for retrieving a copy of the Corpus are given at the end of this announcement. The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English. The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology. The SUSANNE Corpus comprises an approximately 128,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The SUSANNE analytic scheme is defined in detail in a book by myself, ENGLISH FOR THE COMPUTER, forthcoming from Oxford University Press, and briefly in a documentation file which accompanies the Corpus. The Chairman of the Analysis and Interpretation Working Group of the US/EC-sponsored Text Encoding Initiative has proposed the adoption of the scheme as a recognised TEI standard. The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these. Regrettably, Release 1 of the SUSANNE Corpus is not a "TEI-conformant" resource, though aspects of the annotation scheme have been decided in such a way as to facilitate a move to TEI conformance in later releases. The working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for initial release; delaying this release would have been unfortunate. Although the SUSANNE analytic scheme is by now rather tightly defined, Release 1 of the SUSANNE Corpus undoubtedly still contains errors despite considerable proof-checking. It is intended to correct these in later releases; I should be extremely grateful if users discovering errors would notify me, preferably by post rather than e-mail. The SUSANNE Corpus consists of 64 data files (each comprising an annotated version of one Brown text), together with a documentation file. However, the versions held by the Oxford Text Archive are compressed, in order to reduce file transfer time, into single files in two alternative formats, suitable for Unix users and for users who have access only to a PC. The procedure for retrieving a copy of the Corpus in either case is as follows: From a machine on the Internet, type either: ftp black.ox.ac.uk or, since the Archive is not yet in many official name tables: ftp 129.67.1.165 When connected, you will be prompted for an account name, to which you should respond: ftp or: anonymous You will be asked to supply a password, in response to which you should type your e-mail address. After this is accepted, your first command should be to move to the directory containing the Text Archive files, by typing: cd ota To see a list of the files and directories currently available, type: ls All files relating to the SUSANNE Corpus are kept in the directory "susanne", so your next command should be: cd susanne Apart from a README file containing the instructions which you are currently reading, this directory contains the two alternative compressed versons of the SUSANNE Corpus. To retrieve a copy of the corpus, if you are a Unix user, type: get susanne.tar.Z Having successfully transferred a copy of "susanne.tar.Z" to your home system, get the material into a usable state by the successive commands: uncompress susanne.tar.Z and: tar -xf susanne.tar If you are not a Unix user, you need to retrieve the other version of the Corpus, which will be uncompressed using the PKUNZIP software on an IBM-PC. First, set ftp transfer mode to binary by typing the command: bin at the ftp prompt. Then retrieve the appropriate version of the Corpus by typing: get susanne.zip Having transferred a copy of the Corpus to your home machine, uncompress it with the command: pkunzip -x susanne.zip In either case (whether you have followed the Unix or the non-Unix instructions) you should now have the Corpus split up into its 65 files, one of which, "SUSANNE.doc", is a text file describing the format and contents of the 64 data files. To log out of the ftp connexion, type: bye If you encounter any problems, please send an e-mail message to archive@black.ox.ac.uk or archive@uk.ac.oxford.vax.