xslt, i18n, internationalization, multiple languages
1. | Multi-lingual support |
You can use pure XML/XSLT, although not by dynamically selecting from a master file of all local language info, not dynamically including just the language you need. A very brief example to demonstrate principles: <?xml version="1.0"?> <!-- this is XMLNumberData.xml --> <XMLNumberData> <numberData desc="albedo">0.39</numberData> <numberData desc="pi">3.1415926</numberData> </XMLNumberData> <?xml version="1.0"?> <!-- this is LanguageData.xml --> <!-- forgive my completely made-up Spanish! --> <phrases> <phrase key="albedo" xml:lang="en">The albedo of the earth is: </phrase> <phrase key="albedo" xml:lang="es">El albedo de la terra es: </phrase> <phrase key="pi" xml:lang="en">The value of pi is: </phrase> <phrase key="pi" xml:lang="es">El valor de pi es: </phrase> </phrases> <?xml version="1.0"?> <!-- This is LangTest.xsl --> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" indent="yes"/> <xsl:param name="LanguageSelected" select="'en'"/> <xsl:variable name="phrases" select="document('LanguageData.xml')/phrases"/> <xsl:template match="/"> <Result> <xsl:apply-templates select="XMLNumberData/numberData"/> </Result> </xsl:template> <xsl:template match="numberData"> <xsl:value-of select="concat(' ',$phrases/phrase[@key=current()/@desc and lang($LanguageSelected)],.)"/> <!-- note instead of lang($foo) I could've said @xml:lang=$foo --> </xsl:template> </xsl:stylesheet> Select the desired language by passing a parameter to the stylesheet, using the mechanism your XSL processor allows. Default will be 'en'. Example with Saxon: saxon XMLNumberData.xml LangTest.xsl LanguageSelected=es Output: <?xml version="1.0" encoding="utf-8" ?> <Result> El albedo de la terra es: 0.39 El valor de pi es: 3.1415926</Result> I would have tried to use <xsl:key/> and key() to make the phrase selection more efficient, but I couldn't get it to work with the separate LanguageData document. | |
2. | Multi-lingual webpages |
My task was to simplify including multi-language expressions in stylesheet body, because <xsl:value-of select="concat('',$phrases/phrase[@key=current()/@desc and lang($LanguageSelected)],.)"/> seems to be complex, making unclear the main stylesheet task and algorithm. I used following approach. In this test I have common stylesheet b.xsl that can be used in two ways 1) Include it from simple stylesheet with defined language, e.g. russian.xsl 2) Use it alone with programmaticaly specifing lang parameter. ============ russian.xsl: ============ <?xml version='1.0' encoding='windows-1251'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="html" version="4.0" encoding="windows-1251" indent="yes" /> <xsl:param name="lang" select="'rus'" /> <xsl:template match="/"> <xsl:call-template name="Main" /> </xsl:template> <xsl:include href="b.xsl" /> </xsl:stylesheet> ============= b.xsl ============= <?xml version='1.0' encoding='windows-1251'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="html" version="4.0" encoding="windows-1251" indent="yes" /> <xsl:variable name="msg"> <xsl:choose> <xsl:when test="$lang='eng'"> <title text="My english title" /> ... </xsl:when> <xsl:otherwise> <title text="Moy anglijskij zagolovok" /> ... </xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:template match="/" name="Main"> <xsl:param name="msg" select="$msg" /> <html> <body> <xsl:value-of select="$msg/title/@text" /><br /> <xsl:value-of select="$lang" /> </body> </html> </xsl:template> </xsl:stylesheet> | |
3. | How to display an attribute in a language other than english |
I am solving language problems in following way: I do not use attributes for anything what needs translation, but I declare e.g. <!ELEMENT text (#PCDATA|trans)*> Then I can either write <text>blablabla</text> or <text>blablaaaa<trans lang="cs">asasssa</trans><trans lang="en">sdsdd</trans></text> Then you can use in your XSL constructs like: <xsl:choose> <xsl:when test='trans[@lang=$language]'> <xsl:value-of select="trans[@lang=$language]"/> </xsl:when> <xsl:when test='trans[not(@lang)] and /*/@translang=$language'> <xsl:value-of select="trans[not(@lang)]"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="./text()"/> </xsl:otherwise> </xsl:choose> | |
4. | Internationalization and localization of XSLT output |
XML has a facility for communicating language-specific information in a standard way: the xml:lang attribute, which you can put on any element, and it will apply to that element and all its descendants, until redeclared, just like namespaces. XSLT/XPath have a function for accessing the language that is in effect for a given element, whether it is declared on that element on or one of its ancestors: lang(). Some examples: lang('en') will be true if the language for the current node is 'en-US' or 'en-GB' lang('en-US') will be true if the language is 'en-US' but not if just 'en' The rules for constructing language identifiers are covered in IETF RFC 1766. Most of the time they are a hyphenated combination of an ISO 639-1 language code and an ISO 3166-1 country code -- resembling the "programmatic name of the entire locale" as returned by Java's Locale.toString(), however there are differences and the locale should not be considered to be synonymous with the language identifier that you would use in xml:lang. Anyway, if you had data like: <strings> <str xml:lang="en" name="Results of database query">Results of database query</str> <str xml:lang="de" name="Results of database query">Ergebnis der Datenbankabfrage</str> <str xml:lang="en" name="next page">next page</str> <str xml:lang="de" name="next page">naechste Seite</str> </strings> And if you obtained a language identifier via an external top-level parameter like before (nice to set a default, of course)... <xsl:param name="Lang" select="'en-GB'"/> then you could achieve maximum flexibility like this: <xsl:variable name="StringFile" select="document('strings.xml')"/> <xsl:variable name="PrimaryLang" select="substring-before($Lang,'-')"/> ... <xsl:call-template name="getString"> <xsl:with-param name="stringName" select="'next page'"/> </xsl:call-template> ... <!-- given the info needed to produce a set of candidates ($str), pick the best of the bunch: 1. $str[lang($Lang)][1] 2. $str[lang($PrimaryLang)][1] 3. $str[1] 4. if not($str) then issue warning to STDERR - --> <xsl:template name="getString"> <xsl:param name="stringName"/> <xsl:variable name="str" select="$StringFile/strings/str[@name=$stringName]"/> <xsl:choose> <xsl:when test="$str[lang($Lang)]"> <xsl:value-of select="$str[lang($Lang)][1]"/> </xsl:when> <xsl:when test="$str[lang($PrimaryLang)]"> <xsl:value-of select="$str[lang($PrimaryLang)][1]"/> </xsl:when> <xsl:when test="$str"> <xsl:value-of select="$str[1]"/> </xsl:when> <xsl:otherwise> <xsl:message terminate="no"> <xsl:text>Warning: no string named '</xsl:text> <xsl:value-of select="$stringName"/> <xsl:text>' found.</xsl:text> </xsl:message> </xsl:otherwise> </xsl:choose> </xsl:template> This takes the set of all strings that match the given name as candidates, then it picks the best match based on language. It will let you do things like match 'en' strings when 'en-GB' is given as the parameter, and will gracefully fall back on the first candidate, if there are none that match the language you want. If that's not the behavior you desire, replace the last xsl:when with: <xsl:when test="$str"> <xsl:message terminate="no"> <xsl:text>Warning: at least 1 string named '</xsl:text> <xsl:value-of select="$stringName"/> <xsl:text>' found, but none matched the language '</xsl:text> <xsl:value-of select="$PrimaryLang"/> </xsl:text>'.</xsl:text> </xsl:message> </xsl:when> Tony Graham offers If you do use xml:lang, then you can also do: <xsl:template name="gettext"> <xsl:param name="string-name"/> <xsl:choose> <xsl:when test="$string-name='next page'"> <xsl:choose> <xsl:when test="lang()='de'">naechste Seite</xsl:when> <xsl:otherwise>next page</xsl:otherwise> </xsl:choose> </xsl:when> <xsl:when test="$string-name='Results of database query'"> <xsl:choose> <xsl:when test="lang()='de'">Ergebnis der Datenbankabfrage</xsl:when> <!-- Sometimes you do need to select based on subfields --> <xsl:when test="lang()='zh'"> <xsl:variable name="lang" select="translate(ancestor-or-self::*/@xml:lang, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')"/> <xsl:choose> <xsl:when test="$lang='zh-tw'">Traditional Chinese text</xsl:when> <xsl:when test="$lang='zh-cn'">Simplified Chinese text</xsl:when> </xsl:choose> </xsl:when> <!-- You can use the default string as the parameter and output it as the default when no other language matched --> <xsl:otherwise><xsl:value-of select="$string-name"/></xsl:otherwise> </xsl:choose> </xsl:when> </xsl:choose> </xsl:template> From your template, all you need is an xsl:call-template with the right parameter value: <xsl:template match="/"> <html> <body> <h1><xsl:call-template name="gettext"> <xsl:with-param name="string-name" select="'Results of database query'"/> </xsl:call-template></h1> </body> </html> </xsl:template> You could also do similar stuff when the language is a parameter to the stylesheet. Actually, you could use a second stylesheet to generate the named template above from the sample data shown in other messages in this thread. Generating the "zh-tw" and "zh-cn" cases would be tricky if you are using xml:lang, but those cases wouldn't need the interior xsl:choose if the language identifier is a parameter to the stylesheet, since a simple comparison to the language identifier variable would be sufficient. | |
5. | parsing and translating xml:lang attribute |
> the values for this attribute are conforming to > the two-digit language abbreviations according to ISO 639, but my target > DTD uses three-digit language strings according to ISO 639-2 (e. g. 'de' > would be translated into 'ger'). I do have a list of both, but I wonder how > to technically best achieve the mapping using XSL. xml:lang values must be RFC 1766 'language tags' ('tag' being a most unfortunate choice of word in an XML context... I prefer 'identifier'). RFC 1766 mandates, essentially, that if the identifier is just 2 characters, or if the 3rd character is '-' then the first 2 characters must be an ISO 639:1988 2-letter language code. The author recently clarified that the intent was to refer to ISO 639:1988 and its successors, so you should be using the most up-to-date list of 2-letter language ccodes from ISO 639-1. RFC 1766 does not allow 3-letter codes at all. It was a little short-sighted in this regard and is being revised to address this issue (and the fact that ISO 639-2 codes are far more complete!) Example useage: <?xml version="1.0" encoding="utf-8"?> <!-- langCodeMap.xml --> <langCodeMap> <langCode iso639-1="de" iso639-2="ger"/> <langCode iso639-1="en" iso639-2="eng"/> ... </langCodeMap> and in the XSLT... <xsl:variable name="langCodes" select="document('langCodeMap.xml')/langCodeMap/langCode"/> <xsl:variable name="langIn" select="Language/@xml:lang"/> <LanguageOut xml:lang="{$langCodes[@iso639-1 = $langIn]/@iso639-2}"/> There are of course various ways to do it.. this is just one. | |
6. | Multi-language support |
Tokushige Kobayashi references: antenna house In the whitepaper, I overviews the problems on multilingual formatting. At isogen.com (pdf document) Mr. Kimber reports on 'Using XSL Formatting Objects for Production-Quality Document Printing'. Eliot adds. Kobayashi-san has already pointed you at my paper on using XSL-FO for internationalized documents, but here is a quick list of the things convered in that paper: 1. Management of generated text strings. Any text generated by your XSLT process will need to be translated into each language you are processing. The obvious approach is to use a lookup table to map from the base language to the various translations for a particular string. You can do this in XSLT or, as I did, with an external system that can then be integrated with XSLT as well as with other processes (such as editors). 2. Selection of appropriate fonts and font sizes for a given national language. Different national languages may require different fonts, either so that you can simply get the glyphs you need (i.e., from a Unicode font for non-Latin languages) or so that you get the appropriate versions of glyphs (using a Chinese font for Chinese and a Japanese font for Japanese, which both use the same ideographic characters but use different renderings of those characters). 3. Management of writing mode--right to left, left to right, top-to-bottom, etc. Due to an ambiguity in the current XSL-FO spec, how you manage writing mode globally is different for different implementations (this ambiguity will be corrected in the next release of the XSL-FO specification). The biggest challenge here is probably managing directionality at the detail level within inlines for embedded left-to-right text within right-to-left text (e.g., English within Hebrew)--this is a challenge on a good day and is complicated, at least under windows, by bugs and inconsistencies in Windows' built-in implementation of the Unicode bidirectional algorithm. 4. Ensuring that your FO implementation implements any language- or script-specific glyph composition needed: e.g., Arabic glyph shaping and Thai glyph composition. XSL Formatter and XEP both do Arabic glyph shaping. Thai is a much bigger challenge and as far as I know, only XSL Formatter does Thai properly today. The key thing here is that it is not sufficient to simply "support Unicode" so that you can map, for example, Thai characters to glyphs in a Thai font--the rendering of Thai text requires context-specific glyph shaping that can only be done by the rendering software. 5. Making sure that your operating system and PDF creation environment are appropriately internationalized (e.g., install all the Windows regional settings and all the Adobe font packs). 6. Implementing language-specific collation rules if you need to do things like generate back-of-the-book indexes or auto-generated glossaries. As far as I know, only Saxon provides a mechanism for integrating custom collators in an XSLT 1.0 environment (XSLT 2.0 provides more features for collation management and customization, so other XSLT 2.0 implementations might have useful features in this area but I haven't tried any of them.) 7. Using or defining language-specific hyphentation rules if you require hyphenation. None of my clients use hyphenation so I can't comment on the depth of hyphenation support in any of the FO tools, but XSL Formatter and XEP do provide configurable hyphenation facilities. 8. Controlling language- or local-specific line breaking rules. 9. Understanding additional language- or locale-specific formatting requirements that are not addressed by the current XSL-FO spec. For example, XSL Formatter provides some extensions specifically for Japanese. You should gain some understanding of how Unicode encodings work and what it means for data to be in UTF-8 vs. UTF-16 vs. native encodings. You should gain some understanding of the relationship between character sets and fonts. You should acquire editing tools that make it easier to work with non-ASCII data. We have found that the Unipad editor (www.unipad.org) is an excellent tool for working with data in a variety of encodings. I find it to be indispensible for i18n system implementation. I can say that we at Innodata Isogen have been very successful in helping our clients implement FO-based production systems for technical documents translated into 40+ national languages, including Arabic, Hebrew, Thai, and all the modern Asian languages. The inherent internationalization features of XML and its supporting infrastructure as well as the quality of the existing FO implementations have made this not only possible but relatively easy--making something that would have been prohibitively expensive only a couple of years ago quite affordable, both in terms of software cost and implementation effort. All of our tough issues have been with very detailed langauge-specific FO implementation problems, such as issues with processing Thai, rather than with any architectural limitations in the tools we're using. The biggest challenge, and the greatest joy, has been learning the details of all the different national languages we've had to support. |