Character Sets and Character Encoding

Introduction

A multitiered, multiplatform Business BASIC application cannot be successful without the correct use of character sets and character encoding. Inconsistent character encoding results in strange characters being displayed on the screen, in both GUI and CUI programs. Characters such as the Euro or characters with umlauts are replaced by boxes or other symbols. Graphics characters used for drawing lines and boxes appear instead as letters. Edit keys may no longer function. Data stored to disk may become inconsistent or corrupted. It is therefore essential that Business BASIC developers understand the concepts and issues behind character encoding.

Although this discussion explains the basic concepts of character encoding and makes some general recommendations, it is impossible to present detailed solutions for every possible application. It is left to developers and system administrators to examine their own configuration and apply the concepts where necessary. We begin first with a few definitions.

Glossary of Terms

Character – A letter, number, punctuation mark, mathematic or monetary symbol, such as 'A' or '5' or '$'. Characters are represented on a computer display by images (called glyphs) provided by a font. For the purposes of this discussion, characters should not be confused with the font image used to represent them, or with the numeric byte value assigned to them by a particular character set.

Character Set – An arbitrary mapping between characters and byte sequences that are used to represent the characters. Different character sets may use different mappings, or in other words, assign different byte values to the same characters, which is the root of all problems with character encoding. Some character sets assign single byte mappings to characters, while others use multiple bytes per character. The greater the number of bytes per character, the greater the number of characters that can be defined by the character set.

Single byte character sets are typically oriented towards specific human languages or groups of languages, since their maximum number range is not great enough to give each possible character a unique byte code. Most single byte character sets are extensions and modifications of ASCII, which substitute unneeded characters with others that the local language requires. Examples of common single byte character sets include US-ASCII, Cp1252, and the ISO-8859 series such as ISO-8859-1 and ISO-8859-15. Multibyte character sets have a much broader range of numbers to choose from, and can therefore support many more characters and languages. Multibyte character sets such as UTF-8 and UTF-16, based on the Unicode standard, are becoming ubiquitous and will probably displace single byte character sets in the next few years.

Character Encoding – This term is sometimes used interchangeably with 'character set', but also refers to the process of or algorithms for translating characters into their byte mappings according to a specific character set. Unicode character sets such as UTF-8 and UTF-16 are commonly referred to as character encodings, since they are not simple character-to-byte-sequence mappings. They instead define a set of rules and mathematical formulas for translating Unicode-specified integer codes into byte sequences that legacy computer applications can understand.

All characters are encoded before being displayed on a computer screen or written to a file on a disk or printed on paper. This encoding may use the computer platform's default character set, or the character set specified by a system environment variable. Unfortunately, different platforms have different default character sets, as well as different ways of specifying character sets to use in their place. Proper character encoding in Business BASIC depends on knowing which character sets meet the needs of the application, and mastering the various techniques which cause those character sets to be used.

Unicode - An all-encompassing character code standard begun in the late 1980s with the goal of providing character mapping support for all known human languages. In its simplest form, Unicode (and the corresponding standard from the International Organization for Standardization, ISO-10646) provides a set of tables that assign integer numbers to characters.

Character encodings based on Unicode that support all the characters defined in Unicode, such as UTF-8 and UTF-16, rely on multiple bytes to represent each character. Older single byte character sets, such as US-ASCII and ISO-8859-1, can in some cases be regarded as subsets of Unicode because the Unicode designers attempted to remain backwards compatible if at all possible.

Some programming languages developed since the mid-1990s, such as Java, make use of the Unicode standard for internal representation of characters and text strings. A Java character string is a collection of Unicode characters.

Encoded String - A sequence of bytes formed by applying the mapping defined by the character set to each character in a string of text. It is critical to realize that an encoded string does not contain characters. It contains only bytes that represent characters, according to a specific character set. If an encoded string is produced by one character set and then decoded (converted) back into text by another different character set, the characters may be altered. Only when encoding/decoding is performed by the same character set can the integrity of the characters be guaranteed.

Locale – On Linux and UNIX systems, a locale is a configuration setting that specifies country-specific conventions for software behavior, such as the character encoding, the date/time notation, rules for alphabetic sorting, the measurement system, etc. Locale names usually combine a two letter ISO 639-1 language code and a two letter ISO 3166-1 country code, in the format 'll_CC'. For example, en_US for English in the United States, de_DE for German in Germany, de_AT for German used in Austria, pt_BR for Portuguese in Brazil, sv_SE for Swedish in Sweden, fr_FR for French in France, etc. Locales are typically specified as the values of the LANG and LC_ALL environment variables. (See below for more information about using locales with LANG and LC_* environment variables.)

Client – For this discussion, a client is the machine the user interacts with. The client displays the visible parts of an application.

Server – In this discussion, a server is the machine that runs the BBj interpreter, or which reads and writes files from disk.

Causes of Character Encoding Problems

A classic example of a character encoding problem involves the Euro character '€', which is a relatively new invention that represents European currency. Imagine a Business BASIC accounting application running somewhere in Europe on a typical multitiered network: The server is a Linux machine connected to Windows clients. Users are able to enter the '€' currency symbol from their clients just fine, but they discover that in data coming back from the server, the Euro symbol is replaced by a strange box with little dots at each corner. Furthermore, when an important file is transferred from the old server to a newer machine with a more recent Linux distribution, every instance of the Euro character in the file seems to be corrupted. What is going wrong?

This awkward situation could have several causes:

The clients and the server are performing character encoding with different and incompatible character sets. The Windows clients use Cp1252, where the Euro character is mapped to $80$ (hexadecimal 80). The server, a Linux machine, uses ISO-8859-15 where the Euro is mapped to $A4$. The client and the server do not agree on the numeric code that represents a Euro, and therefore cannot properly display or store a Euro character from the other machine.
The old Linux server, using the ISO-8859-15 character set, wrote Euro characters to disk as $A4$. The all-important data file was then copied to the new Linux server, which uses a newer distribution such as SuSE 9.1 or Red Hat 8.0, where the default character encoding has been quietly changed to UTF-8. This Unicode-based character encoding maps the Euro to $E282AC$, and once again there is a disconnect. The bytes stored in the file were encoded with one character set on the old server and are now recovered and decoded by a different, incompatible character set on the new server.
The accounting application itself is flawed, or at least not ready for use in a multiplatform enterprise. Its original authors regarded the value $80$ as synonymous with the Euro character and wrote program logic that specified $80$. They did not foresee environments where the Euro character would have some other value.
Business BASIC itself has deep roots in the single-byte-per-character world. Its functions and verbs expect characters to be represented by single bytes, and multibyte character encodings like UTF-8 are not yet supported. (Correcting this deficiency will not be a trivial undertaking.)

Recommendations for Solving and
Avoiding Character Encoding Problems

The following recommendations apply to versions of BBj prior to and including BBj 4.02. As the language is improved, these suggestions will be updated accordingly.

Chose the right character set. Select a character set that contains mappings for all the characters that an application and its users will want to see. In our above example, the European users of the accounting system would not be happy with the ISO-8859-1 character set because it has no mapping for the Euro character at all.

Avoid multibyte character encodings. As mentioned above, multibyte character encodings such as UTF-8 and UTF-16 are not yet supported by BBj. At this writing, only two Linux distributions, SuSE 9.1 and Red Hat 8.0 and later, including the Fedora Core series, have full support for the UTF character encodings. Using one of these encodings on an earlier distribution may cause subtle problems.

Use the same character set on all clients and servers. Configure each machine to use the same character set. (Some details about how to do this are given below.) A legacy application that insists on a specific mapping for a given character, such as $80$ for the Euro, will function without any problems when all the clients and servers use the same character set.

Avoidreliance on"hard-coded" byte sequences for specific characters. This requires a paradigm shift for Business BASIC programmers, who are used to working with byte encoded strings. It is not safe to assume that any given byte encoded string will always represent a specific character when data is transmitted between different machines or devices that may use different character encodings. A more robust technique in BBj is to handle characters and character strings internally as objects representing Java Strings (which automatically consist of Unicode characters) and then perform explicit encoding when the data is moved between devices. The following code example contains a defined function which demonstrates conversion between Java Strings and the ISO-8859-15, Cp1252 and UTF-8 character sets. Also see Character Encoding for another example of explicit character encoding in BBj.

Example

rem ' UTF-16 representation of euro

euroBytes! = $20AC$

rem ' UTF-16 representation of copyright symbol
copyrightBytes! = $00A9$

rem ' UTF-16 representation of OE diphthong
oeDiphthongBytes! = $0152$

print "Byte Representations of Euro symbol"
bytes! = fnGetBytes!(euroBytes!, "ISO-8859-15")
gosub PRINT_BYTES
bytes! = fnGetBytes!(euroBytes!, "Cp1252")
gosub PRINT_BYTES
bytes! = fnGetBytes!(euroBytes!, "UTF-8")
gosub PRINT_BYTES

print $0A$, "Byte Representations of Copyright symbol"
bytes! = fnGetBytes!(copyrightBytes!, "ISO-8859-15")
gosub PRINT_BYTES
bytes! = fnGetBytes!(copyrightBytes!, "Cp1252")
gosub PRINT_BYTES
bytes! = fnGetBytes!(copyrightBytes!, "UTF-8")
gosub PRINT_BYTES

print $0A$, "Byte Representations of OE Diphthong"
bytes! = fnGetBytes!(oeDiphthongBytes!, "ISO-8859-15")
gosub PRINT_BYTES
bytes! = fnGetBytes!(oeDiphthongBytes!, "Cp1252")
gosub PRINT_BYTES
bytes! = fnGetBytes!(oeDiphthongBytes!, "UTF-8")
gosub PRINT_BYTES
END

PRINT_BYTES:
    len = java.lang.reflect.Array.getLength(bytes!)
    print "bytes length:", len
    print "byte values:",
    for i = 0 to len - 1
        print " 0x" + STR(hta(bin(java.lang.reflect.Array.getByte(bytes!, i), 1))),
    next i
    print $0A$,
return

def fnGetBytes!(utf16Bytes!, newEncoding$)
byteRep! = utf16Bytes!.getBytes("ISO-8859-1")
return new java.lang.String(byteRep!, "UTF-16").getBytes(newEncoding$)
fnend

Specifying a Character Encoding onWindows

To specify use of a particular character encoding on a Microsoft Windows machine, add a line like one of the following to the <bbj install dir>/cfg/BBj.properties file:

basis.java.args.BBjServices=-Dfile.encoding=ISO-8859-15
or

basis.java.args.Default=-Dfile.encoding=Cp1252

Specifying a Character Encoding on Linux, UNIX, and Mac

The character encoding used on a Linux or UNIX system depends on the setting of the LC_ALL, LC_CTYPE or LANG environment variables. (At this writing, setting a basis.java.args=-Dfile.encoding= line in the BBj.properties file has no effect.) These three environment variables accept the name of a locale as their value. It is not typically necessary to explicitly set all three variables, but it is important to understand their hierarchy and what they actually do.

The LC_* environment variables, such as LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, etc, each control a specific aspect of software behavior in a given country-specific locale. They exist to provide a finer degree of control over these behaviors. The LC_CTYPE variable controls character encoding. The LANG environment variable provides a setting for each of these various aspects, but individual behavior categories can be overridden by the LC_* variables. The LC_ALL environment variable supersedes both the LC_* and the LANG variables. When BBj is started on a Linux or UNIX machine, the system sets the character encoding by checking the LC_ALL, LC_CTYPE and LANG environment variables in that order. The first one of these variables that contains a valid setting will be used, and the others will be ignored. This means if a particular setting for LANG is ineffective, check for the presence of LC_CTYPE or LC_ALL variables which could be overriding it. If none of these environment variables are set, the system will use the default locale found in the /usr/lib/locale directory.

Configuring environment variables for BBj can be done in the <bbj install dir>/bin/.envsetup file. Following are some examples:

LC_ALL=en_US.ISO8859-15

In this example a code set modifier, .ISO8859-15, has been appended to the locale name to specify the ISO-8859-15 character set.

LANG=de_DE@euro

The @euro causes the ISO-8859-15 character set to be used, which has a character mapping for the Euro character at $A4$. Setting the de_DE locale without @euro would get the ISO-8859-1 character set, which has no Euro character.

LANG=de_DE.cp1252

Here the code set modifier .cp1252 has been appended to the de_DE locale, specifying that Microsoft's modified version of the ISO-8859-1 character set should be used. This setting for LANG provides $80$ as a mapping for the Euro character, which would solve the difficulty with the accounting system described above. Unfortunately, most Linux systems do not have locales equipped to use the CP1252 character set. Such a new locale would have to be defined before the LANG variable could be assigned to it. See "Defining a New Locale" below for instructions on how to do this.

Displaying the Current Environment Settings

Microsoft Windows:

On a Windows XP system, selecting Start -> Control Panel -> Regional and Language Options displays a dialog that controls various system settings. (This dialog is also available in older versions of Windows via the Control Panel, although its name and path are slightly different.) The dialog does not, however, reveal the specific character encoding that will be used by BBj. Use the BBj info functions described below to determine this.

Linux/UNIX/Mac:

Use the echo command as in echo $LANG or echo $LC_ALL to display the current settings of environment variables that affect character encoding.

The locale command is useful in determining what resources are available for character encoding. The command locale –m displays a list of all the available character sets on a given machine. Use locale charmap to see which character set is currently being used. The locale–a command displays a list of all the available locale definitions. Remember it is likely that not all the character sets have a locale definition that uses them.

Inside BBj:

The info(4, *) and info(7, *) functions show the environment variables set on the server and on the client respectively. In this syntax, the '*' character should be replaced by an integer number from 0 to the maximum number of environment variables minus one. These two functions are typically used inside a loop structure, since it is not possible to know in advance which numbers are linked to which environment variables. The following program demonstrates this technique:

10 print "info(1,2)=",info(1,2)
20 print "file.encoding=",java.lang.System.getProperty("file.encoding")
30 index=0
40 while 1
50 index = index + 1
60 serverEnv$= info(4,index,err=ClientEnv)
70 if serverEnv$(1,4)="LANG" or serverEnv$(1,7)="COUNTRY" or serverEnv$(1,7)="CHARSET" or serverEnv$(1,6)="LC_ALL" then
80 print "serverEnv$: ",serverEnv$
90 endif
100 wend
110 ClientEnv:
120 index=0
130 while 1
140 index = index + 1
150 clientEnv$= info(7,index,err=*stop)
160 if clientEnv$(1,4)="LANG" or clientEnv$(1,7)="COUNTRY" or clientEnv$(1,7)="CHARSET" or clientEnv$(1,6)="LC_ALL" then
170 print "clientEnv$: ",clientEnv$
180 endif
190 wend

The info(1,2) function shows the character encoding being used on the server:

> print info(1,2)

ISO-8859-15

As of BBj 4.02, the info(1,3) function shows the character encoding being used on the client.

Defining a New Locale

The localedef command can be used to create a new locale definition which associates a country-specific configuration with a particular character set for character encoding. (The following instructions are valid on many Linux distributions. For more information regarding new locale definitions for other UNIX environments, contact the vendor.) Before using the localedef command, use locale –m to make sure the desired character set is installed. As mentioned above, few Linux systems come pre-equipped with a locale that specifies Microsoft's Cp1252 character set, even though most of them have this character set installed. If a need exists to force a Linux machine to do Windows-style Cp1252 character encoding (as in the case of the European accounting system), the following command will create the necessary German locale:

localedef -f CP1252 -i de_DE /usr/lib/locale/de_DE.cp1252

The –f parameter is the name of the character set, -i is the name of the input file and /usr/lib/locale/de_DE.cp1252 represents the directory location and name of the new locale file. After this command is executed successfully, the list shown by the locale –a command will include the new locale name, and setting LANG=de_DE.cp1252 will begin character encoding with the CP1252 character set.

References

A vast amount of information about character encoding is available on the Internet. To learn more about the topics mentioned above, visit these websites:

www.unicode.org and www.unicode.org/unicode/faq. The definitive source for information about Unicode.

www.cs.tut.fi/~jkorpela/chars.html Jukka Korpela's exhaustive tutorial on character code issues, containing more than you want to know.