Character Encoding - BBj

Description

When the BBx language was originally defined, the distinction between characters and bytes could often be ignored. The underlying character set was assumed to be ASCII or extended ASCII (Windows-1252, ISO 8859-1, and later MacRoman). BBx string functions like CHR() and ASC() implicitly assume that a character is a single byte within the range 0 to 255, and substring references treat bytes and characters as interchangeable concepts.

Over time, these legacy single-byte character sets were superseded by multi-byte character sets, most of which were based on the evolving Unicode standard. Internally, Java strings are UTF-16, while the Web is almost entirely UTF-8. In Java 17 and earlier, Java defined the default character encoding on startup based on "the locale and charset of the underlying operating system." With the notable exception of Microsoft Windows, most platforms now define the default character encoding to be UTF-8. Starting in Java 18, Java standardized its default character encoding to be UTF-8 on all platforms, including Microsoft Windows.

Because some applications depend on the assumption that they're running in Microsoft Windows with the Windows-1252 character set, a compatibility flag is available to revert to the legacy character encoding.

This sample program shows how character encoding works in different environments:

0010 print info(0,0)," ",info(0,1)
0020 print info(1,0)," ",info(1,1)
0030 print "Server character set: ",info(1,2)
0040 print "Client character set: ",info(1,3)
0050 Euro$ = new String($20ac$,"UTF-16")
0060 print "Euro$: ",Euro$
0070 print "len(Euro$):",len(Euro$)
0080 print "hta(Euro$): ",hta(Euro$)
0090 Euro! = Euro$
0100 print "Euro!.length():",Euro!.length()
0110 print "Euro!.codePointAt(0):",Euro!.codePointAt(0)

If we run that program in Microsoft Windows using Java 17, we see the legacy Windows-1252 single-byte character set:

Windows 11 10.0

Java 17.0.12

Server character set: windows-1252

Client character set: windows-1252

Euro$: €

len(Euro$): 1

hta(Euro$): 80

Euro!.length(): 1

Euro!.codePointAt(0): 8364

If we run that program in Microsoft Windows using Java 21, we see UTF-8, the new Java default character set:

Windows 11 10.0

Java 21.0.4

Server character set: UTF-8

Client character set: UTF-8

Euro$: €

len(Euro$): 3

hta(Euro$): E282AC

Euro!.length(): 1

Euro!.codePointAt(0): 8364

This highlights the byte-oriented nature of the BBx string functions: The Euro$ string variable contains a single character, but UTF-8 encodes that single character in three bytes. The BBx LEN() function reports the length of the string in bytes. The BBjString API can be used to examine that string in terms of Unicode characters, as opposed to raw bytes: BBjString::length reports that the string contains one Unicode character, and BBjString::codePointAt reports that the first character of that string is Unicode 8364, the Euro symbol (€).

If the application depends on the assumption that the underlying character set is Windows-1252, we can add a new compatibility setting, -Dfile.encoding=COMPAT, to all BBj Java Args in Enterprise Manager by going to Java Settings and clicking edit.

After adding that compatibility setting to all BBj Java Args values and restarting BBjServices, the sample program reverts to the original Java 17 behavior:

Windows 11 10.0

Java 21.0.4

Server character set: windows-1252

Client character set: windows-1252

Euro$: €

len(Euro$): 1

hta(Euro$): 80

Euro!.length(): 1

Euro!.codePointAt(0): 8364

See Also

Character Sets and Character Encoding

JEP 400: UTF-8 by Default

Guide to Character Encoding

For more information about the Unicode standard, see unicode.org