Character Encoding - BBj


BBj strings are composed of 8-bit characters. The local encoding format is defined by the host operating system and can be retrieved using the BBj function, INFO(1,2). For example, Microsoft Windows uses the 8-bit character set called Cp1252.

Standard Java library classes, such as java.lang.String and java.lang.Character, work with 16-bit characters encoded in the Unicode standard. Conversion functions are provided to convert between BBj strings, encoded in the local 8-bit character set, and Java strings, encoded in the 16-bit Unicode character set. For example:


0010 REM ' Converting between the local character set and Unicode               
0020 PRINT info(0,0)+" uses the 8-bit character set: "+info(1,2)
0030 REM ' the Euro symbol is $80$ in Cp1252                                    
0040 LET E$=$80$                                                                
0050 REM ' Convert local string to Unicode                                    
0060 LET E!=BBjAPI().toUnicode(E$)                                             
0070 REM ' get the numeric value of the Unicode character                     
0080 LET E=E!.charAt(0)                                                         
0090 PRINT "The Unicode value of $80$ is: '\u"+hta(bin(E,2))+"'"                
0100 REM ' Java library methods work with Unicode                                 
0110 LET TYPE=java.lang.Character.getType(E)                                    
0120 PRINT "The Euro character is Java character type",TYPE                     
0130 REM ' Convert Unicode back to Local                                        
0140 PRINT "The Local encoding for ",E$," is: ",                                
0150 PRINT "$"+hta(BBjAPI().toLocal(E!))+"$"                                    
Windows 2000 uses the 8-bit character set: Cp1252                              
The Unicode value of $80$ is: '\u20AC'                                          
The Euro character is Java character type 26                                    
The Local encoding for € is: $80$

When passing characters or strings to Java library routines such as java.lang.Character and java.lang.String, it is important to consider the character encoding format. The encoding format does not matter with methods like String.length() and String.substring() because they do not examine specific characters. It is significant when working with methods like String.toUpperCase() or Character.getType(), because they work with Unicode characters. Use BBjAPI().toUnicode() to get the Unicode equivalent of any BBj string. Note: This method converts any characters that are undefined in the local character set to the Unicode character '\ufffD'.

Use the BBjAPI().toLocal() method to convert Unicode strings returned from Java library routines to BBj format. Note: This method converts any Unicode characters that are undefined in the local character set to the "?" character ($3F$).

BBj programs and data files use the local character set exclusively. Conversion issues between local and Unicode are only important when interacting with Java library routines.

The following program can be used to determine how closely the local 8-bit character set corresponds to Unicode:


0010 PRINT "Characters that differ between ",info(1,2)," and Unicode:",'LF'    
0020 FOR c=0 TO 255                                                            
0030 LET c$=CHR(c); REM ' Local character in the range $00$..$FF$            
0040 LET u=BBjAPI().toUnicode(c$).charAt(0); REM ' Unicode encoding            
0050 IF u<>c THEN PRINT "$",HTA(c$),"$="+c$+"='\u",HTA(BIN(u,2)),"' ",         
0060 NEXT c                                                                    
Characters that differ between Cp1252 and Unicode:                            
$80$=€='\u20AC' $81$=�='\uFFFD' $82$=‚='\u201A' $83$=ƒ='\u0192' $84$=„='\u201E'
$85$=…='\u2026' $86$=†='\u2020' $87$=‡='\u2021' $88$=ˆ='\u02C6' $89$=‰='\u2030'
$8A$=Š='\u0160' $8B$=‹='\u2039' $8C$=Œ='\u0152' $8D$=�='\uFFFD' $8E$=Ž='\u017D'
$8F$=?='\uFFFD' $90$=?='\uFFFD' $91$='='\u2018' $92$='='\u2019' $93$="='\u201C'
$94$="='\u201D' $95$=•='\u2022' $96$=–='\u2013' $97$=—='\u2014' $98$=˜='\u02DC'
$99$=™='\u2122' $9A$=š='\u0161' $9B$=›='\u203A' $9C$=œ='\u0153' $9D$=?='\uFFFD'
$9E$=ž='\u017E' $9F$=Ÿ='\u0178'                                                                        

See Also

Character Sets and Character Encoding

For more information about the Unicode standard,