International Support mini-HOWTO

			 2.2.2 - June 29 2000
		      Jim Hall, <jhall1@isd.net>

_____________________________________________________________________
0. ABOUT INTERNATIONAL SUPPORT

Most programmers never think about their programs being used by users
in other countries, or that the program would be unusable by someone
who does not speak English.  By the true nature of hackers, programs
are first written to fill a personal need.  To extend a program to
support multiple languages is something that many programmers never
consider.

However, adding international support to your program makes it more
useful.  Now, your program becomes accessible to more people than just
those who speak your native tongue.

But how to add international support to your programs?  This mini-
HOWTO intends to show you what you need to do to add international
support (also called "internationalization" or i18n) easily to your
projects.

Internationalization (for the end-user) is provided in traditional DOS
systems in two ways:

1. Keyboard support (KEYB or XKEYB) to add keyboard definitions for
   international keyboards.

2. Code page support (CHCP) to change to a code page that specifically
   supports your language characters, if the default code page does not.

These two methods are not enough for most applications.  A third
method was defined by ANSI C and POSIX to assist application
developers write programs that support internationalization:

3. setlocale() to set the locale for your programs.  This affects
   things like printf() and strcoll()

But this does not help your program produce messages in a language
that an international user can understand.  To do that, you need a
fourth method:

4. message catalogs

There are several ways to do this (MSGLIB and "Cats" are only two of
them.)  The "Cats" interface is documented here.

_____________________________________________________________________
1. KEYBOARD

(content needed)

_____________________________________________________________________
2. CODE PAGE

(content needed)

_____________________________________________________________________
3. LOCALE

A locale is a set of language and cultural rules.  These cover aspects
such as language for messages, different character sets, lexigraphic
conventions, numerical formatting, etc.  A program needs to be able to
determine its locale and act accordingly to be portable to different
cultures.

Locale is defined by ANSI C and POSIX, and is not implemented by
"Cats".  This information is provided only for completeness.

The locale consists of two basic interfaces:

  setlocale() to set your program's idea of its locale

  localeconv() get information about number formatting

The function setlocale() has the following usage:

  char *
  setlocale (int category, char *locale);

setlocale() sets the locale category to the new locale, and returns
the value of the previous locale.

There are different categories for local information a program might
need (declared as macros.)  Using them as the "category" argument to
the setlocale() function, it is possible to set one of these to the
desired locale:

  LC_ALL for all of the locale.
  
  LC_COLLATE for the functions strcoll() and strxfrm().
  
  LC_CTYPE for the character classification and conversion routines.
  
  LC_MONETARY for localeconv().
  
  LC_NUMERIC for the decimal character.
  
  LC_TIME for strftime().  NULL if the request cannot not be honored.
  This string may be allocated in static storage.

If the "locale" argument to setlocale() is null, the default locale is
(on many systems) determined using the following rules:

1.  The environment variable LC_ALL, if available

2.  If an environment variable with the same name as one of the
    categories above exists, its value is used for that category.

3.  The environment variable LANG, if available

Be aware that Borland C 3.1 (DOS) only supports the "C" locale, so
invoking setlocale() will have no effect.  Other compilers may or may
not support multiple locales, or may "fake it" like BC31 does.  Check
your compiler's documentation.

Now, let's use setlocale() in a program.  In this example, we have a
program that needs to print the value "1.5" to the user.  In the
English locale, you use a decimal point (".") to separate the integer
from the fractional part of the number.  In another locale (say,
Spanish) you use a comma (",") as the separator.  Making a call to
setlocale() makes this transparent to the program:

  /* lc.c */

  #include <stdio.h>
  #include <locale.h>                     /* setlocale */
  
  int
  main (void)
  {
    float x;
    x = 1.5;
  
    setlocale (LC_NUMERIC, "");
    printf ("%5.2f\n", x);
  
    exit (0);
  }

On a system that fully supports setlocale(), the output might look
like this:

  C:\>lc
   1.50
  
-or-

  C:\>SET LANG=es
  C:\>lc
   1,50

A program may be made portable to all locales by calling
setlocale(LC_ALL, "" ) after program initialization.

_____________________________________________________________________
4. MESSAGE CATALOGS

A long time ago, I had written an implementation of UNIX catgets()
based on an in-memory key-value database.  This library is called
"Cats" (message CATalog System), a DOS implementation of catgets().
"Cats" is available at http://www.isd.net/jhall1/freedos/cats

For those who don't know about the UNIX catgets() function, you just
use it to return a pointer to a localized string based on a message
number.  All your program's messages are stored in a file, with
message numbers and set numbers.  "Hello world" might be message 1 in
set 1.  Your copyright statement might be message 2 in set 1.
"Failure writing to drive A:" might be message 4 in set 7.

Before you use catgets() you must first open a message catalog file.
You do this with the catopen() function:

  nl_catd
  catopen (char *catalog_file, int flags);

"Cats" assumes environment variables to point to the location of the
message catalogs, so that "Cats" uses the environment variables when
you call catopen().  If the "catalog_file" parameter does not contain
a directory separator ("\"), then NLSPATH is the directory in which
you keep your message catalogs, and LANG is the country code
abbreviation.  This is similar to the UNIX implemenation.

The "flags" parameter is ignored in this implementation of "Cats".
The "flag" argument is used (on UNIX systems) to indicate the type of
loading desired. This should be either MCLoadBySet or MCLoadAll, and
control if only the required set from the catalog is loaded into
memory on an as-needed basis or if catopen() should load the entire
catalog into memory.

catopen() returns a message catalog descriptor of type nl_catd on
success.  On failure, it returns -1.  "Cats" can only have one message
catalog open at a time.

Once you have a message catalog available, you use catgets() like
this:

  char *
  catgets (nl_catd catalog, int msg_set, int msg_num,
	   char *default);

That is, you tell catgets() to retrieve a message for you from the
message catalog based on a set number and message number.  catgets()
returns a pointer to that string.  If catgets can't find the
set/message, it returns the default string that you passed it.

The best way to learn how to use "Cats" is to download the library,
and look at the sample programs in src/

Here's a sample C program to show you how to use it:

  /* fail.c */

  #include <stdio.h>
  #include "catgets.h"			/* catopen/catgets */

  int
  main (void)
  {
    char *s;
    nl_catd cat;			/* catalog descriptor */
  
    cat = catopen ("fail", MCLoadAll);  /* MCLoadAll is ignored */
  
    s = catgets (cat, 7, 4, "Failure writing to drive A:");
    printf ("%s\n");
  
    catclose (cat);
    exit (0);
  }

The above sample program would have this execution:

  C:\>fail
  Failure writing to drive A:

-or-

  C:\>SET LANG=es
  C:\>fail
  Incidente que escribe a A:

My message catalogs are simple ascii, and look like this:

(File=FAIL.EN)

  1.1:Hello world
  7.4:Failure writing to drive A:

-or-

(File=FAIL.ES)

  1.1:Hola mundo
  7.4:Incidente que escribe a A:

Since message catalogs are plain ascii files, it will be easy for a
user to take one "catalog" and create a translation that can
*immediately* be used by another user (i.e. you don't need to
"re-compile" the message catalog before you use it.)


*********************************************************************
			      APPENDICES
*********************************************************************

_____________________________________________________________________
A. OTHER SOURCES OF INFORMATION: MSGLIB

Steffen also has a version of MSGLIB that does much the same thing as
"Cats", but uses binary ("compiled") message catalogs.  Steffen Kaiser
writes this about MSGLIB:

  This is an offspring of the internationalization debate back in 1994
  or 1995.
  
  The current release to be used is located at:
  ftp://ftp-fd.inf.fh-rhein-sieg.de/pub/local/ALPHA/msglib.zip
  
  A non-Alpha release is at:
  ftp://ftp-fd.inf.fh-rhein-sieg.de/pub/local/msglib31.zip
  
  It does support locally stored message strings only (see below for
  details).
  
  MSGLIB is desgined to overcome three problems:
  
  1.  where to retrieve the message strings from (msg retriever),
  
  2.  to tweak the message string that the function to display a string
      looks the same regardless what language is choosen (msg
      interpreter),
  
  3.  to attach a semantic when display a message (msg visualisor).
  
  Parts 1 and 2 can be independed on each other; part 3 joins both
  together.  That means one could use the msg retriever stand-alone.
  
  The msg retriever currently supports two methods, a third is currently
  started, but postponed.
  
  Method #1 stores all msg strings statically in the program code. This
  was the easiest method to implement.
  
  Method #2 reads the msg strings from a file, what can be a stand-alone
  file, attached to the executable, even embedded into any data
  file. The "recommended" way (that means the way shown in exmples makes
  no use of the latter ability).
  
  Method #3 joins methods #1 and #2 together and tries to read a
  requested msg string from a file, on failure use an internally stored
  one -- this would be the same as the HP-style "catgets()".
  
  Currently MSGLIB does support the catgets() function already.
  
  But I never thought that to address a msg string by a number a good
  thing.  Instead the user of MSGLIB assigns a symbol to the message,
  like "E_noMem" or "M_red". Regardless what method of the msg rertiever
  is in use, the statement:
  
    char *p;
    
    p = msgLock(E_noMem);
  
  makes sure that 'p' is pointing to the msg string stored in memory. To
  free resources one just calls:
  
    msgUnlock(E_noMem);
  
  -or-
  
    msgUnlockString(p);
  
  Method #2 of the msg retriever also supports that two or more
  languages are packed together into the same binary msg catalogue. When
  the msg retriever scans the msg catalogue the first time, it
  identifies the most fitting language. Because of lack of other source,
  I'll choosed the country ID as my language ID, however, a language ID
  is just a number and the only thing to change is the function that
  creates this number from the system resources.
  
  One thing not mentioned so far is that msg strings are viewable with
  certain codepages only. So besides the language ID MSGLIB associates a
  codepage ID with each msg string, which must match strickly. (Not
  included with any released version, yet.)
  
  The msg catalogues themselves have been designed to be useable for
  other stuff as well, just like the resource block in S Windows. The
  msg catalogue consists of a series of individual chunks, which are
  linked together so that one program need not know the internal format
  of each chunk, but can concentrate to pick the ones necessary for
  itself.
  
  The basic idea was to split msg strings into two categories: local and
  global.
  
  So often used msg strings are declared as "global" and are
  automatically available to all programs using MSGLIB without to write
  them down into the local msg catalogue. So the programmer can
  concentrate on just the program.
  
  The other three parts of MSGLIB are described in the included
  documentation (Postscript file). Within this doc there is a list of
  all functions and a very short description of them. (I really can't
  remember why the msg retriever is not described there?)
  
  I haven't received much response about MSGLIB, yet, so don't feel
  guilty if you don't like MSGLIB; however, it would be a pity if the
  missing/bad documentation or a "can't get this stuff to work" notion
  prevent an examination or critic.
  
  Why is MSGLIB currently postponed?  I was struck by the problem that
  the command line passed to the message compiler became too short, in
  fact the last time I shortend the filenames to bare minimum to keep
  the compiler working. So I decided to make a command line and
  configuration file parser that takes away this ever-annoying problem
  from me. It is currently found in the pre-release of SUPPL (all that
  cfg*.c files); though it stagnates a bit, too, because I'm currently
  making all the missing docs besides comments in the source code for
  SUPPL (er, a thing most programmers really hate I've learnt in the
  past).

Examples of programs using MSGLIB31 are SWSUBST and ASSIGN.

_____________________________________________________________________
B. LANGUAGE/COUNTRY CODES

Steffen Kaiser writes this about language/country codes:

  Defined in the context of internet/WWW/HTML are:
  
  * ISO-639 defines 2-letter codes for languages, though, there are not
    many.  These codes are also used by the locale implementation of
    most traditional Unix systems.
    (http://jargo.itim.mi.cnr.it/documentazione/iso639_codes.html)
  
  * ISO-3166 describes country codes, 2-letter, 3-letter and numerical
    format.
    (http://jargo.itim.mi.cnr.it/documentazione/iso3166_codes.html)
  
  * For bibliographic purpose, a 3-letter code has been invented, I came
    across these ones a while back, but cannot remember where; I guess
    it was in the description of the computer index of some large
    library. But I found this ones: http://www.sil.org/ethnologue/names/
  
    This index is sort of reversed, you need to know the correct and
    full name of the language and this index maps it to a 3-letter code;
    yeah, interessting how many officially registered different variants
    of "German" exist.
  
  The bibliographic codes are invented to be complete and even differ
  among (wider known) dialects of a language.
  
  There exists two "officially" used ways to express a language, one
  derives the code from the English word fo this language (what can be
  expressed with 7-bit ASCII symbols then) and one derived from the word
  used by language itself and then transcripted into Latin characters on
  order to be expressable with 7-bit ASCII.

The "Cats" library doesn't really assume anything about the language
code, except that it must be three or fewer letters.  (In short, this
is because the message catalog is located by using the NLSPATH and
LANG environment variables.  The LANG environment variable defines the
language code, which is implemented as the file extension for the
message catalog.)

_____________________________________________________________________
FURTHER READING:

Useful Links regarding software internationalization 

- http://dmoz.org/Computers/Software/Globalization/Internationalization/
  - The Open Directory Project (at the dmoz site), whose goal is to
  produce the most comprehensive directory of the web, by relying on a
  vast army of volunteer editors.

- http://www.lib.ox.ac.uk/internet/news/faq/by_category.internationalization.html
  - Oxford's I18n FAQ

- http://www.unicode.org/ - UNICODE

- http://www.w3.org/International/ - World Wide Web consortium:
  Non-western Character sets, Languages, and Writing Systems

- http://anubis.dkuug.dk/maits/i18n - Standards for
  Internationalization

- http://www.microsoft.com/globaldev - Microsoft's Software
  Globalization Information

- http://www.microsoft.com/win32dev/uiguide/uigui445.htm - Microsoft's
  I18n Guidelines

- http://gatekeeper.dec.com/pub/DEC/DECinfo/DTJ/v5n3/THE_XOPEN_INTERNATIONALIZATIO_01jan1994DTJB03SC.txt
  - THE X/OPEN INTERNATIONALIZATION MODEL

- http://cns-web.bu.edu/pub/djohnson/web_files/i18n/i18n.html -
  Concepts of C/UNIX Internationalization (paper by Dave Johnson, Boston
  University)

- http://java.sun.com:80/products/jdk/1.1/docs/guide/intl/index.html -
  Java I18n info from Sun

- http://www.ibm.com/java/education/globalapps/ - IBM's Java Global
  Application Guide

- http://www.ibm.com/java/education/international-text/index.html -
  More from IBM on internationalized text in Java 1.2

- http://www.digital.com/info/DTJB00/ - Digital Equipment technical
  paper on I18n (Feb. 1994)

- http://www.cis.ohio-state.edu/hypertext/faq/usenet/internationalization/iso-8859-1-charset/faq.html
  - ISO 8859-1 National Character Set FAQ

- http://www.stri.is/TC304/EURO/default.html - Information and
  Communication Technologies European Localization Requirements

- http://www.gnu.org/manual/glibc-2.0.6/html_chapter/libc_19.html -
  GNU manual on setting the locale

- http://www.gnu.org/manual/gettext-0.10.35/text/gettext.txt and
  http://www.gnu.org/manual/gettext/html_node/gettext_44.html - GNU
  manual about 'gettext' (GNU's library for retrieving localized
  strings from a message catalog.)  Lots of info about
  internationalication and localization (including a discussion of
  what the two mean).

- http://www.cs.ruu.nl/wais/html/na-dir/internationalization/programming-faq.html
  - Introduction to i18n programming.  A MUST-READ!

- http://www.leb.net/archives/reader/csi/0039.html - Programming for
  Internationalization FAQ

- http://www.iso.ch/cate/cat.html - ISO Standards 

- http://www.czyborra.com/ - Unicode on Unix 

- http://www.hut.fi/u/jkorpela/chars.html - A tutorial on character
  code issues

- http://www.w3.org/MarkUp/html-spec/charset-harmful.html - Character
  sets considered harmful

- http://www.iro.umontreal.ca/~pinard/recode/ - GNU recode 

_____________________________________________________________________
HISTORY

Version 1.0 was the FreeDOS International Support Mini-HOWTO, but I
have changed this document to be more specific to the "Cats" library.
I will include the other information at the end as a "Other Sources of
Information."