Internationalizing commonly used characters e.g. @, &, $

Date:18 September 2006
Product/Release:LANSA Integrator V11.0
Abstract:Compiling string literals do not have the same HEX code across all EBCDIC CCSID's
Submitted By:LANSA Technical Support

Description:

When using string literals in iSeries LANSA, RPG or C programs, certain characters are mapped incorrectly giving undesired results. For example when using the @ sign within the source code say within an email address john@company.com when translated is outputted as john§company.com

Suggestion:

If you have an that contains literal strings then remember that some commonly used characters do not have the same HEX code across all EBCDIC CCSID's.

The @ character in CCSID 37 is HEX 7C, but the same character in CCSID 273 is HEX B5.

If you have an @ character in your source and the source is compiled using CCSID 37, then the HEX code 7C is included in the compiled object.

If you are creating a literal command string to send to SMTPMailService, the command string is encoded in the target compile CCSID.

SEND FROM ( name@company )

If at runtime you are running in CCSID 273, the conversion of the HEX 7C code to Unicode @ does not work.

It needs to be HEX B5 to be converted to Unicode @.

transport: Client-Encoding-CCSID : 273
transport: Client-Language : DEU
transport: Client-Country : DE

===============================================================================

public static void main ( String[] args ) throws Exception
{
String address = "name@company" ; // The Java compiler converts the ISO8859_1 source string to Unicode characters

byte[] byteArray = address.getBytes ( "Cp037" ) ; // The Unicode string to converted to EBCDIC CCSID 37 bytes - This simulates the literal compile

String address2 = new String ( byteArray, "Cp273" ) ; // These bytes are converted back to Unicode using EBCDIC CCSID 273 - This simulates job CCSID

FileOutputStream trace = new FileOutputStream ( "address.txt" ) ; // This simulates the trace output.
trace.write ( address2.getBytes ( "UTF-8" ) ) ;
trace.close () ;
}

===============================================================================

CRTBNDC PGM(MYPGM) TGTCCSID(*SOURCE)

CRTCMOD MODULE(MYMODULE) TGTCCSID(*SOURCE)

TGTCCSID - Specifies the target coded character set identifier used to describe data stored into the resulting module object.

1 to 65535 Valid CCSID
*JOB The current job's CCSID is used.
*SOURCE The root source file's CCSID is used.

===============================================================================

ILE C/C++ Programmer's Guide Chapter 31. Internationalizing a Program

This chapter describes how to:

  • Create a source physical file with a specific Coded Character Set Identifier (CCSID)
  • Change the CCSID of a member in a source physical file to the CCSID of another member in another source physical file
  • Convert the CCSID for specific source statements in a member

The ILE C/C++ compiler recognizes source code that is written in most single-byte EBCDIC CCSID. CCSID 290 is not recognized because it does not have the same code points for the lowercase letters a to z. All of the other EBCDIC CCSIDs do have the same code points for the lowercase letters a-z. String literals can be converted back to CCSID 290 by using the #pragma convert directive. A file with CCSID 290 still compiles because the ILE C/C++ compiler converts the file to CCSID 037 before compiling.

CCSID 905 and 1026 are not recognized because the " character varies on these CCSIDs.

The CRTCMOD/CRTCPPMOD and CRTBNDC/CRTBNDCPP commands do not support the SRCSTMF parameter in a mixed-byte environment.

Double-byte character set (DBCS) source code requires special programming considerations.

Note: You should tag the source physical file with a CCSID value number if the CCSID (determined by the primary language) is other than CCSID 037 (US English).

===============================================================================

Coded Character Set Identifiers

A Coded Character Set Identifier (CCSID) comprises a specific set of an encoding scheme (EBCDIC, ASCII, or 8-bit ASCII), character set identifiers, code page identifiers, and additional coding-related information that uniquely identifies the coded graphic character representation used.

A character set is a collection of graphic characters.

Graphic characters are symbols, such as letters, numbers, and punctuation marks.

A code page is a set of binary identifiers for a group of graphic characters.

Code points are binary values that are assigned to each graphic character, to be used for entering, storing, changing, viewing, or printing information.

Character Data Representation Architecture (CDRA) defines the CCSID values to identify the code points used to represent characters, and to convert the character data as needed to preserve their meanings. 

===============================================================================

Source File Conversions to CCSID

Your ILE C/C++ source program can be made up of more than one source file.

You can have a root source member and multiple secondary source files (such as include files and DDS files).

If any secondary source files are tagged with CCSIDs that are different from the root source member, their contents are automatically converted to the CCSID of the root source member as they are read by the ILE C/C++ compiler.

If the primary source physical file has CCSID 65535, the job CCSID is assumed for the source physical file. If the source physical file has CCSID 65535 and the job is CCSID 65535, and the system has non-65535, the system CCSID value is assumed for the source physical file. If the primary source physical file, job, and system have CCSID 65535, then CCSID 037 is assumed. If the secondary file, job, and system CCSID is 65535, then the CCSID of the primary source physical file is assumed, and no conversion takes place.

The compiler converts DBCS source files to CCSID 037.

===============================================================================

Creating a Source Physical File with a Coded Character Set Identifier

You specify the character set you want to use with the CCSID parameter when you create a source physical file.

The default for the CCSID parameter is the CCSID of the job.

This figure shows you what happens when you create a program object that has a root source member with CCSID 273 and include files with different CCSIDs.

The ILE C compiler converts the include files to CCSID 273.

The program object is created with the same CCSID as the root source member.

See attached image - Source File CCSID Conversion

Note: Some combinations of the root source member CCSID and the include file CCSID are not supported.

Example:

The following example shows you how to specify CCSID 273 for the source physical file QCSRC in library MYLIB.

To create a source physical file with CCSID 273, type:

CRTSRCPF FILE(MYLIB/QCSRC) CCSID(273)

===============================================================================

Changing the Coded Character Set Identifier (CCSID)

To change the CCSID of the source physical member from one CCSID to another, use the command CPYF with parameter FMTOPT(*MAP) to obtain the copy of the source physical member in another CCSID.

The following example shows you how to change a member in a source file with CCSID 037 to CCSID 273.

Example:

CRTSRCPF FILE(MYLIB/NEWCCSID) CCSID(273)

CPYF FROMFILE(MYLIB/QCPPSRC) TOFILE(MYLIB/NEWCCSID) FROMMBR(HELLO) TOMBR(HELLO)

MBROPT(*ADD) FMTOPT(*MAP)

Notes:

  1. The first command creates CCSID 273.
     
  2. During the copy file operation, the character data in the from-member is converted between the from-file field CCSID and the to-file field CCSID as long as a valid conversion is defined.
     
  3. The HELLO member in the file NEWCCSID is copied to QCSRC with CCSID 273. If CCSID 65535 or *HEX is used, it indicates that character data in the fields is treated as bit data and is not converted.

===============================================================================

Converting String Literals in a Source File

You can convert the string literals in a source program from the point that the #pragma convert directive is specified to the end of the program.

The #pragma convert directive specifies the CCSID to use for converting the string literals from that point onward in the program.

The conversion continues until the end of the source or until another #pragma convert directive is specified.

If a CCSID with the value 65535 is specified, the CCSID of the root source member is assumed.

If the source file CCSID value is 65535, CCSID 037 is assumed.

The CCSID of the string literals before conversion is the same CCSID as the root source member.

The CCSID can be either EBCDIC or ASCII.

===============================================================================

Targeting a CCSID

The TGTCCSID parameter allows the compiler to:

  • Process source files from a variety of CCSIDs or code pages (in the case of a source stream file)
  • Target a module CCSID different from that of the root source file, as long as the translation between the source character set and the target module CCSID is installed into the operating system.

Target CCSID (TGTCCSID) is a parameter used with the following ILE C/C++ commands:

  • Create C Module (CRTCMOD)
  • Create C++ Module (CRTCPPMOD)
  • Create Bound C (CRTBNDC)
  • Create Bound C++ (CRTBNDCPP)

===============================================================================

How the ILE C/C++ Compiler Converts a Source File to a Target CCSID

When the TGTCCSID differs from the source file's CCSID, the ILE C compiler converts the source files to the TGTCCSID and processes files.

This ensures that the target module and all it's character data components (for example, listing, string pool) are in the desired TGTCCSID.

You can then develop in one character set and target another.

The argument defaults to the source file's character set so the default behavior is backward compatible (with the exception of 290, 930 and 5026).

Note: C++ language only

C++ converts only the string literals (not the source) to the TGTCCSID.

Providing support for more source character sets, increases the NLS usability of the compilers.

CCSIDs 290, 930 and 5026 are now supported.

The TGTCCSID parameter provides solutions to more complex NLS programming issues.

For example, several modules with different module CCSIDs may be compiled from the same source by simply recompiling the source with different TGTCCSID values.

===============================================================================

Literals, Comments, and Identifiers

The TGTCCSID parameter allows you to choose the CCSID of the resulting module.

The module's CCSID identifies the coded character set identifier in which the module's character data is stored, including character data used to describe literals, comments and identifier names described by the source (with the exception of identifier names for CCSIDs 5026, 930 and 290).

For example, if the root source file has a CCSID of 500 and the compiler parameter TGTCCSID default value is not changed (that is, *SOURCE), the behavior is as before with the resulting module CCSID of 500. All string and character literals, both single and wide-character, are as described in the source file's CCSID. Translations may still occur as before for literals, comments and identifiers.

However, if the TGTCCSID parameter is set to 37 and the same source recompiled, the resulting module CCSID is 37; all literals, comments, and identifiers are translated to 37 where needed and stored as such in the module.

Regardless of what CCSID the root source and included headers are, the resulting module is defined by the TGTCCSID, and all of its literals, comments, and identifiers are stored in this CCSID.

===============================================================================

Debug Listing View

Introduction of the TGTCCSID parameter removes the limitation preventing the compilation of source with CCSIDs 5026, 930 or 290 without the loss of DBSC characters in literals and comments.

However, a lesser limitation is introduced for these CCSIDs; when using listing view to debug a module compiled with TGTCCSID equal to CCSDI 5026, 930, or 290, substitution characters appear for all characters not compatible with CCSID 37.

Format Strings

When coding format strings for C run-time I/O functions (for example, printf("%d\n", 1234);) the format string must be compatible with CCSID 037.

When targetting CCSIDs 290, 930, 5026 which are not CCSID 037 compatible, a #pragma convert(37) is required around the format string literal to ensure that the run-time function processes the format string correctly.

Valid Target Encoding Schemes

TGTCCSID values are limited to CCSIDs with encoding schemes 1100 or 1301. An error message is issued by the command if any other value is entered.

1100 = EBCDIC, single-byte, No code extension is allowed, Number of States = 1.

1301 = EBCDIC, mixed single-byte and double-byte, using shift-in (SI) and shift-out (SO) code extension method, Number of States = 2. 

===============================================================================