Php Localization With Tmx Standard

Posted on 27 Feb 2006

in Code

by Nicola Asuni (nicolaasuni)

Rated 4.05 (Ratings: 4)

Want more?

More articles in Code

Nicola Asuni

Member info

User since: 15 Jul 2003

Articles written: 3

Foreword

TMX - Translation Memory eXchange
- TM - Translation Memory
- LISA - Localization Industry Standards Association

TMX PHP Bridge
- TMXResourceBundle.php Class
- Testing Class (tmxtest.php)

References

Foreword

One of the main concerns of internationalization consists of separating the main source code from the texts, the labels, the messages and all the other objects related to the specific language in use. This facilitates the translation process as such as all the resources related to the local language context are well identified and separated.

In a global market the costs to translate and update the texts (including labels, messages, menu elements and so on) can easily become quite high. This is the context where the TMX standard comes to help by applying to the translation and management process of these texts the concepts of reuse, increase of consistency, and the shortening of the production cycle. All this with the added bonus of cutting the development costs.

TMX - Translation Memory eXchange

http://www.lisa.org/tmx/

TMX is an open standard that uses XML for the archiving and mutual exchange of the Translation Memories (TM). These memories are created by using specific translation and localization software called CAT software (Computer Aided Translation). TMX is the result of a project developed by one of the Special Interest Groups of LISA, known as OSCAR (Open Standards for Container/Content Allowing Re-use).

The goal of TMX is to provide a neutral system to exchange data between different translation systems, while minimizing or eliminating the loss of critical data. The TMX format is supported by the majority of the translation software in the market today.

The specifics of the TMX standard are available for free in the website http://www.lisa.org/tmx/, together with several related links, documents, articles and software tools.

TMX file example

sample_tmx.xml


<?xml version="1.0" ?>
<tmx version="1.4">
	<header
		creationtool="XYZTool"
		creationtoolversion="1.01-023"
		datatype="PlainText"
		segtype="sentence"
		adminlang="en-us"
		srclang="EN"
		o-tmf="ABCTransMem">
	</header>
	<body>
		<tu tuid="hello" datatype="plaintext">
			<tuv xml:lang="en">
				<seg>hello</seg>
			</tuv>
			<tuv xml:lang="it">
				<seg>ciao</seg>
			</tuv>
		</tu>
		<tu tuid="world" datatype="plaintext">
			<tuv xml:lang="en">
				<seg>world</seg>
			</tuv>
			<tuv xml:lang="it">
				<seg>mondo</seg>
			</tuv>
		</tu>
	</body>
</tmx>

where:

tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).
tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).
seg: segment, it contains the translated text.

TM - Translation Memory

http://www.opentag.com/tm.htm

The Translation Memories (TM), also known as Translation Database, consist of a database in which the various sentences written in a reference language are linked to the associated translations in one or more languages.

A reference sentence together with its translations is called translation memory unit (record of the database).

The applications that use TM's are helpful tools for language translations, intended to improve the quality and the efficiency of the human translation process and not to substitute it. Whenever a new sentence is entered from the TM application, the application will search for it among the reference sentences in the database and will calculate a corresponding specific value according to the match (matching value). When the matching value is 100%, meaning exact match, the corresponding translation found in the database will be assumed to be correct and it will be directly utilized to build the translated text. When the matching value is smaller than 100% but bigger than a certain threshold (fuzzy match), the corresponding translation found in the database will be proposed to a human translator, so as to be judged and possibly corrected. For the sentences whose score falls under the threshold there will not be any proposed translation, and they will have to be entirely translated by hand. The new sentences for which a translation has been entered will be stored in the database and used for future searches.

Several software houses offer complex commercial products that work similarly to these concepts.

LISA - Localization Industry Standards Association

http://www.lisa.org

Founded in 1990, LISA is the premier no-profit worldwide organization for GILT (Globalization, Internationalization, Localization and Translation). LISA includes different subjects as individuals, businesses, associations and organizations involved in languages, technologies for languages, and standards for languages.

Over 400 leading IT manufacturers and services providers, along with industry professionals representing corporations with an international business focus, have helped establish LISA's best practice guidelines and language-technology standards for enterprise globalization.

LISA serves as a nexus between the many organizations engaged in helping businesses to become global enterprises. This includes customers, governments, technical and industry-specific standards organizations, research and consulting firms, language technology developers and service providers.

LISA offers services in the form of standards initiatives, Special Interest Groups, conferences and training programs to provide GILT support to businesses.

LISA partners and affiliate groups include the International Organization for Standardization (ISO Liaison Category A Members of TC 37 and TC 46), The World Bank, OASIS, IDEAlliance, AIIM, The Advisory Council (TAC), Fort-Ross, €TTEC, the Japan Technical Communicators Association, the Society of Automotive Engineers (SAE), the European Union, the Canadian Translation Bureau, TermNet, the American Translators Association (ATA), IWIPS, Fédération Internationale des Traducteurs (FIT), Termium, JETRO, the Institute of Translating and Interpreting (ITI), The Unicode Consortium, OpenI18N, and other professional and trade organizations.

LISA members and co-founders include some of the largest and best-known companies in the world, including Adobe, Avaya, Cisco Systems, CLS Communication, EMC, Hewlett Packard, IBM, Innodata Isogen, Fuji Xerox, Microsoft, Oracle, Nokia, Logitech, SAP, Siebel Systems, Standard Chartered Bank, FileNet, LionBridge Technologies, Lucent, Sun Microsystems, WH&P, PeopleSoft, Philips Medical Systems, Rockwell Automation, The RWS Group, Xerox Corporation and Canon Research, among others.

TMX PHP Bridge

With the arrays, PHP provides a useful solution for localization. Indeed, it's possible to extract the textual elements from the original source code and isolating them as array elements. An array in PHP is actually an ordered map. A map is a type that maps values to keys. This type is optimized in several ways, so you can use it as a real array, or a list (vector), hashtable (which is an implementation of a map), dictionary, collection, stack, queue and probably more. This solutions offer several advantages to the programmer but can become very complicated for the translator, especially in terms of reusability of the translation.

A better option consists of the archiving of the textual resources in the exchange format TMX (XML file). This enables the translators to export and import the translations to and from their preferred translation tools (there are several compatible with TMX) in a way completely independent from the programming language utilized.

The best solution to implement the TMX standard in PHP applications consist of creating a bridge class so that it can directly read data from XML files complying with the TMX standard and fill a PHP array (PHP array ==> TMX file <== translation program). This allows us to take advantage of all the aspects of the PHP arrays and to simplify the porting process toward external TMX applications.

The disadvantages of this technique are mainly related to the time and the memory necessary to load the entire TMX file.

With the intention to simplify our explanation, we will consider just those TMX elements necessary to translate a simple text (see sample_tmx.xml):

tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).
tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).
seg: segment, it contains the translated text.

for example:


	<tu tuid="hello" datatype="plaintext">
		<tuv xml:lang="en">
			<seg>hello</seg>
		</tuv>
		<tuv xml:lang="it">
			<seg>ciao</seg>
		</tuv>
	</tu>

TMXResourceBundle.php Class

With the constructor we specify the name and path of the file in TMX format that contains the translations and the ISO code of the reference language.

Inside the class constructor we define our TMX parser using the PHP XML Parser Functions which are enabled by default, using the bundled expat library. These functions lets you create non-validating XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The methods startElementHandler and endElementHandler do the following: the first one sets the values of some state variables when it meets the opening of a TMX element, the second one resets the values when the closing elements occur. The state variables keep the information related to the processed TMX element.

The segContentHandler processes character data appearing between elements in an XML document. If while processing the content of a seg node we find out that the value of the xml:lang attribute of the parent tuv node is the same as the $language parameter specified in the constructor, then the segContentHandler method associates the selected string to an element of the array resource that will be getting the value of the attribute tuid of the parent node tu as index.

At the end of the parsing process, the array resource will be containing all the translations in the desired language, indexed by the respective identifiers (tuid).

Source Code

(download the full project from Sourceforge).

Testing Class (tmxtest.php)

This class shows how to instantiate the class TMXResourceBundle with the example file sample_tmx.xml. In this example the language code (it = Italian) is explicitly specified, but it can also be obtained from a locale's information.

Source Code

References

Asuni N, "Java Localization with TMX standard" [online] 2004-10-14, http://evolt.org/Java-Localization-with-TMX-standard.

Asuni N, "TMXResourceBundle - TMX PHP Bridge" [online] 2005-01-08, http://tmxphpbridge.sourceforge.net.

Asuni N, "TMXResourceBundle - TMX Java Bridge" [online] 2005-01-08, http://tmxjavabridge.sourceforge.net.

Itagaki M, "Use XML as a Java Localization Solution" [online] 2000-11-10, http://www.ftponline.com/javapro/archives/mi0011/default.asp.

O'Conner J, "Java Internationalization: Localization with ResourceBundles" [online] 1998-10-01, http://java.sun.com/developer/technicalArticles/Intl/ResourceBundles/.

OSCAR - LISA, "TMX - Translation Memory eXchange" [online] 2004-10-01, http://www.lisa.org/standards/tmx.

OSCAR - LISA, "TMX 1.4b Specification" [online] 2005-03-26, http://www.lisa.org/standards/tmx/tmx.html.

W3C, "Extensible Markup Language (XML)" [online] 2005-08-02, http://www.w3.org/XML/.

Nicola Asuni is the founder and president of Tecnick.com S.r.l., a leading provider of award-winning Web Software.
He has been a freelance programmer since 1993 and he actively contributed to several web-related Open-Source Projects.
He is the founder of Technick.net site website, since 1998 the largest connector and cable pinout archive on the web.
He is also member and co-founder of Java User Group Sardegna Onlus, and a member of GULCh - Gruppo Utenti Linux Cagliari.

For a complete Curriculum Vitae please browse: http://nicolaasuni.tecnick.com

Start of page header

Other Fine Evolt.org Sites

Navigation Starts

Submit

Article Categories

Highest rated articles

Help Support evolt.org

Main Page Content

Php Localization With Tmx Standard

Want more?

Nicola Asuni

Foreword

TMX - Translation Memory eXchange

TMX file example

TM - Translation Memory

LISA - Localization Industry Standards Association

TMX PHP Bridge

TMXResourceBundle.php Class

Source Code

Testing Class (tmxtest.php)

Source Code

References