Accelerating Conversion to SGML via the Rainbow Format David Sklar Director of Applications Electronic Book Technologies *** THE PROBLEM: INEFFICIENCY IN CONVERSION EFFORTS "Up conversion" -- the translation of a document from a proprietary word-processor (WP) format to an SGML document conforming to a useful DTD -- is one of the thorniest problems an organization faces when it adopts SGML. Up conversion is typically performed by building (or hiring a consultant to build) a custom conversion application via a translation-enabling tool like OmniMark, TagWrite, or FastTag. The conversion application typically involves two phases: 1) extraction and interpretation of the formatting codes in the WP format, and 2) identification of content and structure. The second phase is the most sophisticated one, for it involves creating something (structure and true content identification) from "nothing" (WP formats which are typically flat and lacking in content identification). A considerable amount of planning and thought is necessary for implementation of this phase, and it is almost always necessary to custom-build this phase for each organization. But the first phase -- extraction of the WP formatting codes -- is not very sophisticated, for it is simply a translation. The most difficult part of implementing this phase is becoming an expert in the proprietary WP format itself. This is difficult because WP formats are typically under-documented and subject to change without notice; moreover, they are full of idiosyncracies that defy common sense. Obviously, it makes no sense for each SGML adopter to independently become an expert in RTF (or any other WP format), but that redundancy of effort is exactly what's happening in the SGML community today. Hundreds of organizations are redundantly writing scripts to interpret RTF codes, redundantly encountering the same documentation problems, redundantly wrestling with the frustration of working with a proprietary format. The cost of this redundancy -- in terms of money, time, and frustration -- is too much for the SGML community to bear, particularly if the community wishes to continue the growth rate it has enjoyed in the recent past. How can we prevent organizations from redundantly implementing phase one? Imagine a world in which all up-conversion efforts start from a single stable, well-documented input format instead of a variety of unstable proprietary ones. In that world, an organization implementing up conversion would be able to focus all its energies on phase two; it would not waste time researching sundry proprietary formats, and it would not waste time updating its conversion scripts each time a WP vendor modifies its format. *** THE SOLUTION: RAINBOW Electronic Book Technologies, in conjunction with several other key SGML vendors and promoters, has designed a format that is suitable for acting as the starting point for up-conversion efforts. We call that format "Rainbow", because it represents a unification of the wide spectrum of proprietary formats. The Rainbow format is actually an SGML DTD; this design decision (although not strictly necessary) offers several benefits. It ensures that the format is readily understood by all members of the SGML community, and that existing tools can be used for validation, viewing, analysis, and editing of Rainbow files. It also means that a larger variety of tools can be used for up conversion; in particular, SGML transformation tools can play a role that was previously unavailable to them. The Rainbow DTD is publicly available, and can be used and modified by organizations and individuals freely. An FTP server has been created to provide a forum for the distribution of Rainbow-related data, and a mailing list has been created to keep interested parties informed of Rainbow-related developments. To subscribe to the mailing list, send email to rainbow@ebt.com; you will receive a reply with specific information on how to access the FTP server. As described above, the primary purpose of Rainbow is to insulate up-conversion implementations from the dynamic and eccentric world of proprietary formats. Obviously, this noble idea is not practical unless the SGML community has easy access to "Rainbow Makers" (software programs designed to convert WP-format documents to Rainbow documents). It is our hope that members of the SGML community will share Rainbow Makers that the build, thus helping us eliminate redundant efforts. To "jump start" the Rainbow-Maker effort, EBT has decided to develop public-domain Rainbow Makers for four key WP formats. Three of them are available now in beta versions: RTF (Microsoft Word), MIF, and Interleaf. The fourth one (WordPerfect) is in the planning stages. Initially, EBT will distribute only executable versions (PC and several UNIX platforms) of these Makers, but when the versions become more stable, EBT will make the source code available. Independently of EBT, another member of the SGML community is contributing a Rainbow Maker for the Ventura format; that Maker will be released in the form of an OmniMark script via the mailing-list mechanism described above. *** THE SCOPE OF THE RAINBOW FORMAT Quite early in the Rainbow design effort, I decided to restrict the scope of Rainbow, in order to ensure that we did not replace proprietary complexity with standardized complexity ("out of the frying pan, into the fire"). Rainbow's primary goal is to store information useful for the performance of the second phase of up conversion; thus, Rainbow does not attempt to represent all of the formatting and page-layout information inherent in WP formats. Rainbow does retain all the textual data found in the original WP document, but it retains formatting information only if it is useful for recognition of content and structure. For example, information on explicit page breaks and column breaks is maintained, because such information could be useful in recognizing the boundaries of high-level strutures (e.g. chapters). However, information on the number of columns per page is not retained, because column count is rarely an indication of structure and content. Thus, Rainbow is not intended for use in driving typesetting and display engines. Moreover, it is not possible to convert a Rainbow document back into the original WP document from which it came. By keeping the Rainbow design simple, I seek to keep the format stable, well-documented, easy to learn and use, and well insulated from developments in the state of word-processor technology. *** RAINBOW AS A MEANS, NOT AN END Although Rainbow is an SGML format, it is not appropriate for the permanent representation of data. It contains no more structure or content identification than that found in the original WP document; it can be considered to be "content free". And so it should be, for its purpose is to be the *starting* point for up conversion, not the ending point! Some members of the SGML community have expressed concern that Rainbow may be abused -- that some organizations will attempt to satisfy "SGML mandates" by converting documents to Rainbow and going no further. But it is important to note that Rainbow is actually a rather late entry in the field of "content-minimal" DTDs. HTML and the OracleBook viewer's native DTD are two examples of content-minimal DTDs that existed long before Rainbow; HTML is the only one of the three that is being used as an "end" instead of a "means". As always, education is the primary weapon against abuse; that is as true in the promotion of SGML as it is in the fight against drugs. *** FOR MORE INFORMATION... Join the Rainbow mailing list by sending email to rainbow@ebt.com. You will receive a reply that includes information on how to access the Rainbow FTP server that contains the Annotated Rainbow DTD (in PostScript format); that document is essential reading if you wish to successfully use or experiment with Rainbow.