Do you speak Japanese?

March 04, 2001

MurphysLaw-March 04, 2001

Do you speak Japanese?


Not speaking Japanese is just one of the obstacles between you and a successful translation.

Last month we looked at some of the challenges that are part of translating a user interface into different languages. This month we're going to look at the exact format of the translation files in a bit more detail. Then we'll discuss the problems caused by double-byte character mappings. I will also point out a few programming practices to avoid when internationalizing a product.

Code files and translation files

First, I'd like to revisit some definitions from last month. The code file is a source-code file (often a set of C string declarations) containing the strings in a particular language. The translation file is a file that contains the English strings and a column for the strings in the target language. We need a means to convert a source code file into a translation file that can be sent to professional translators. There are really two types of code files. The initial code file is in English, but later on we are going to generate code files in a number of foreign languages.

The English code file is written by hand as the product is developed. It is important that all of the strings are kept in one file, since maintaining the strings sprinkled throughout your application is a messy business. Using a program to replace bits of your source-a process I will describe later-involves risk, so we want to restrict it to files whose sole purpose is to hold translatable strings.

These strings can be accessed from the rest of the program via extern declarations. C provides many ways of doing this, and you may well be working in a different language, so I am not going to dwell overmuch on the exact syntax. We will assume that a header has something of the form:


extern const char *
		TextWelcomeScreen; 
extern const char * 
		TextMainMenu;

The tag TextWelcomeScreen can then be used from anywhere in the program to refer to a particular string. In other implementations, the tag may be an index or an enumerated type. I generally prefix all of the tags for translatable strings with text, in an effort to avoid conflicting with names of other variables or functions.

The .c file then contains the definitions of those strings, for example:



const char * TextWelcomeScreen = 
		"Welcome to this Gadget";
const char * TextMainMenu = 
		"Main Menu";

It's important to keep the format of the C file regular, since we are going to need to parse it. So be consistent in your use of spaces and tabs. One approach to parsing is to load the text file into a spreadsheet using the inverted comma as a separator. If the spreadsheet does not allow you to select the column separator, then substitute the const char * for nothing, and substitute the inverted comma for whatever separator your spreadsheet does support. In either case you can quite easily come up with a spreadsheet with columns arranged as shown in Table 1.

Table 1: Spreadsheet columns before translation
Tag English Foreign Language
TextWelcomeScreen Welcome to this gadget
TextMainMenu Main Menu

We have now managed to extract the important information from the file and left behind the C-specific stuff, such as type declarations and semicolons. The foreign language heading can now be substituted for each language and sent out to the individual translators. The tag information is not necessary for the translator, but you will find it useful later for generating the foreign language code file. The tag column and English column should be locked or password protected if possible, to prevent the translator accidentally changing these columns. The file you receive back should look something like Table 2.

Table 2: Spreadsheet columns after translation
Tag English French
TextWelcomeScreen Welcome to this gadget Bienvenue la gadget
TextMainMenu Main Menu Menu Principal
At this stage a number of checks can be made. Did the translator use the same policy for punctuation and capitalization as the English version? Was every English string translated?

When you are happy with the French translation file, it needs to be converted into something that can be compiled into your program. This is a bit trickier than the original conversion from English to translation file. You may not be able to parse the format of the spreadsheet, so it is useful to save the file as a Comma Separated Values (CSV) file, which is a plain text file that most spreadsheets can generate. I like to write my parsers in Perl, which has a lot of powerful string handling abilities. In any case, the output should look like the following:


const char *TextWelcomeScreen  = 
		"Bienvenue a le Gadget"; 
		// Welcome to this Gadget  
const char *TextMainMenu = 
		"Menu Principal"; // Main Menu

Putting the English at the end of the line as a comment is useful for troubleshooting. If we call this file French.c, we can now change languages by using French.c instead of English.c-and you can easily control this by using different targets in the makefile.

This representation of the French language assumes that your means of output can print any of the characters used by the French translator. The French may contain characters (such as ?) which are not available in the font used by your embedded system. I mentioned last month that such characters have to be converted to an octal value in a C string. If we had to represent the letter ? with letter four of our font, the French.c that is generated should contain the line:


const char *TextWelcomeScreen =
		"Bienvenue \04 le Gadget"; 
		// Welcome to this Gadget

Doing a substitution has one danger. If the character following ? happened to be the digit 5, the compiler would see the sequence \045 and interpret that as a single octal value, generating a single byte. The representation we want is a byte with the value 4 followed by the ASCII representation of 5 (which happens to be 35 hex). If we put a space between the 4 and the 5, then each byte is interpreted correctly, but you end up with a space that was not intended, which may affect the appearance. The problem is even worse if we use hexadecimal constants instead of octal, since the first six letters of the alphabet are valid hexadecimal digits.

The very ugly fix that I use is to close the string and open it again. In C, no difference exists between the compiled result of:

char *str = "abcdef";

and

char *str = "abc""def";

When the compiler sees one string end and another start right after it, it simply concatenates them. You do not see this done very often, but in this case it allows us to separate our explicit value for one byte and the following digit. So now we generate the following string:


const char *TextWelcomeScreen =
		"Bienvenue \04"" le gadget";
		// Welcome to this Gadjet

If you think this looks a bit ugly, you could get your conversion program to check if the following character is a digit, and only insert the extra inverted commas if they are needed.

Turning Japanese

Once we move to a language that is not within the Latin [1] mapping, we are likely to have a choice of mappings, and it is important that the translator uses the same one consistently. While Japanese is often the most important market that has double-byte requirements, a number of other languages, such as Chinese, are becoming more significant. My own experience is with Japanese, and that is what I discuss in the rest of this column, but before I do I will warn you that this topic involves far more than I will be addressing here. Further information is available from an astonishingly comprehensive book written by Ken Lunde. CJKV Information Processing is the definitive reference for anyone doing development work using double-byte character sets.1 The CJKV of the title stands for Chinese, Japanese, Korean, and Vietnamese.

The first thing that you need to know about Japanese is that it has three significant character sets. The first two, Katakana and Hiragana, are phonetic and, combined, contain 121 characters. They do not present any significant programming challenges beyond what we have to do for ASCII. Kanji is a Japanese character set that is historically derived from the Chinese character set. Kanji consists of about 5,000 characters and each one represents a meaning, rather than a sound. For example a wavy line represents a river, though most of the interpretations are not so obvious.

Katakana and Hiragana represent the same set of sounds, but are typically used for different purposes. Hiragana is used to phonetically write native Japanese words, while Katakana is used to write foreign words. Either set can represent anything that can be said in the Japanese language. On some restricted interfaces, where Kanji is not possible, one of these sets may suffice. Bear in mind that the user will see this as less than ideal-a bit like an English interface that is implemented using capital letters exclusively. HD44780-controlled text displays often have a Katakana character set available at values above 128. This allows a Japanese interface on a device that can't support Kanji. Apart from this restrictive solution, most applications will require a graphics display to implement a full Japanese solution using Kanji.

The fact that Kanji symbols represent meaning, rather than sound, does not make programming any more difficult because your software just needs to display them and does not really care how they are interpreted. The difficulty that does exist is twofold. The first problem is the complexity of the symbols. A 16-by-16 grid is the smallest size at which many of the symbols can be drawn, which may cause difficulty if the layout of the user interface was originally planned for Latin characters, which can be represented in as few as five by seven pixels. The second difficulty is that the 5,000 characters cannot be uniquely identified in a single byte, and so we are led to double-byte encoding methods. While ASCII has always been the predominant encoding for English, different operating systems use different encodings for some Asian languages. The trend towards Unicode may simplify matters over time, but you may still find a reason to convert between character sets. For example, your embedded system may use a Unicode font, but the translator may provide you with a Microsoft spreadsheet file using a Shift-JIS encoding. Now you need to convert from one to the other. This conversion may be performed as part of the mapping from translation file to code file.

Windows 9x uses the Shift-JIS encoding, but it's also capable of supporting Unicode. Japanese Windows NT is a pure Unicode environment. Your translator may be working in either one. As Microsoft commits newer versions of its OS to Unicode, it appears that Shift-JIS will disappear, so it is preferable to work in Unicode if you have the option. The translator might not realize how significant this is, since they will not see the hexadecimal representation of the strings they edit. The OS may protect the translator from needing to know what mapping their environment uses. The embedded system programmer, however, must know the mapping, because the values encoded into the product have to be compatible with whatever font the product uses.

It might appear that the decision to use Unicode is a simple one. It covers almost every language you are likely to encounter, so you don't have to change encoding methods as you proceed from one language to another. However, covering so many languages leads to a character set that contains almost 40,000 characters. This could consume a few hundred kilobytes of ROM-no problem for a desktop OS, but embedded developers may not feel that generous. This means you will only want to convert the subset of Unicode that matches your needs. On one project, in which I was feeling the end of my ROM chips getting uncomfortably close, I chose to limit the set of characters that I actually used. The program that converted translation files to code files maintained a list of the characters used, which enabled me to strip the font of all characters that were not in that set. This saved a considerable amount of memory. While the Japanese Kanji character set contains about 5,000 characters, only about 200 were used on that particular user interface. This method would not work if you wanted to allow character input, since you cannot predict what characters a user might input, but many simple embedded devices have no need for text input in Japanese or any other language.

The conversion program run on the Japanese translation file differs in a few crucial ways to the single-byte character program I described previously. All characters are now read as pairs, and it is important that you do not accidentally miss a byte and end up reading the double-byte pairs on odd boundaries. I tend to leave the string in inverted commas, and point at it with a char*, but since none of the bytes have meaning on their own I encode them all as octal or hexadecimal constants. An array of shorts (assuming your shorts are 16 bits) could work as well.

Once you have the strings in your program, you still need a font that will allow you to display those strings. Rendering the font is another issue in and of itself. My fonts web page is one starting point to look for international fonts that you may be able to use.[2]

Double check the double-bytes

Since I do not read Japanese I make a habit of getting a fax of the translations. I compare the fax to the final interface as a check that my conversions worked. This is one case where fax is preferable to e-mail. You cannot be sure that your version of the e-mail will use the same character encoding as was used by the sender, while the appearance of a fax will not change when transmitted.

Programming issues

Double-byte character sets introduce additional challenges. Strings are no longer terminated by the first null byte encountered, since that null might be the second half of a double-byte, non-null value. So functions such as strlen() have to be altered. If your environment is already double-byte-aware, special versions of the string functions may be available.

You also need to watch out for any place in your code where you assume that the characters are one byte wide. For example you may use array indexing to get at a character in the middle of an array of chars-but the nth element of the array is no longer the nth character. For example, the title of this column (which translates as "Do You Speak Japanese?") uses 14 characters, but it would occupy 28 bytes, plus a null terminator

Pointer arithmetic is no longer equivalent to character arithmetic. If I have two char *s pointing into an ASCII string, I can subtract them to find out how many characters lie between them. This is not true for double-byte strings. First I have to be sure that the char *s point to characters (and not to the middle of a character). Then I have to subtract them and divide the answer by two.

Always make sure that buffers are large enough. If a number of strings get concatenated together into a buffer, you need to be sure that the buffer does not overflow. Just because you have established that an overflow cannot happen in English provides no guarantee that it will not happen in other languages. This is a danger whether the foreign language is using single-byte or double-byte characters.

Another general rule is to avoid having language specific #ifdefs in your code to handle the layout requirements of different languages. You are much better off with a layout system that is flexible enough to handle strings of varying length. Features such as automatic centering and line wrapping help in this regard. This will avoid the need to maintain a different set of coordinates for each language, which can turn into a serious hassle.

Replacement parameters

Replacement parameters are elements within the string that can be replaced at runtime. The printf function does this for %d (numbers), %s (strings), and a number of other formatting elements. The weakness of this mechanism is that if a translator chooses to order the elements differently, the arguments to printf will be in the wrong order. For example the statement "%d messages were processed in %d seconds" may be rearranged to become "It took %d seconds to process %d messages," once it is translated. The two numbers have changed places-a printf statement which fills in the %d replacement parameters will put them in the wrong slots.

For this reason I have always restricted myself to one replacement parameter per string. If you need more flexibility then you should look at the Windows message formatting. Each replaceable parameter is referenced with a number, so the string above is represented as: "%1 messages were processed in %2 seconds." Handling this in your output routines is a little bit of work but if you use a lot of replacement parameters it will probably be worth the effort.

Conclusion

The number of companies that exist solely to provide localization services for desktop applications indicates that interface conversion is a large and wide-ranging topic. In embedded systems the emphasis is a bit different from the desktop. The requirements are often simpler, but the tools and run-time environment often give us little or no support. In some cases, localization issues alone are enough to cause a project to choose a third-party graphics library, which can ease a lot of the pain in this area.

I have just scratched the surface of localization. A range of localization issues exist beyond the strings themselves, such as date format, currency, and units of measure. Most of these issues have already been addressed on desktop systems. While you may not need a solution as elaborate as the desktop system, you can certainly learn from them.[3] Keep an eye out for internationalization issues when you are doing your initial design, so that they do not cause too much pain when you try to retarget your design to a foreign market.

Niall Murphy has been writing software for user interfaces and medical systems for ten years. He is the author of Front Panel: Designing Software for Embedded User Interfaces. Murphy's training and consulting business is based in Galway, Ireland. He welcomes feedback and can be reached at nmurphy@panelsoft.com. Reader feedback to this column can be found at www.panelsoft.com/murphyslaw.

References

1. Lunde, Ken. CJKV Information Processing. Sebastapol, CA: O'Reilly & Associates, 1999.

2. Font and Bitmap Guide: www.panelsoft.com/july99.htm

3. Kano, Nadine. Developing International Software for Windows 95 and Windows NT. Redmond, Washington: Microsoft Press, 1995.

Return to Table of Contents

Loading comments...