Localization
This article shows how collaboration between software developers, UI designers and translators is essential for successful localization, and how (and why) you should go about organizing such a team. The first article shows some common mistakes software developers tend to do unintentionally.
by Miloš Průdek prudek@bvx.cz
Defining localization
According to Wikipedia, localization is the process of translation of messages which a program presents to a user into various languages. Sounds simple enough. Newcomers to localization often use the following approach: export all strings from the finished application, send the strings to a bunch of translators who do not know each other, wait one or two months (you can do a lot of work on another project while the translators toil away), receive the translated texts, pay ridiculous sums for the translations, import and recompile, and pronto! Instant translation, right?
Wrong. Such an approach results in horribly distorted texts, sometimes outright wrong sentences and many other problems. Sadly, such an approach is quite often used in practice. It degrades the software as a whole and the result has to be thrown away or repaired, incurring expenses that equal or exceed the original price of the translation.
If you do not know any language well other than your mother tongue, you simply cannot imagine the range of problems that occur in translation. For instance, two translators can produce totally different text (different words) although the meaning stays the same. True localization requires collaboration between software developers and translators, very close and very intensive collaboration. Or, if you work for a large company, it probably employs UI developers and these UI guys and gals can act as an interface between application developers and translators.
Time flies...
And you need to have your localization ready yesterday. No time to help translators. All of your application's messages are clearly written and present no problem to a qualified professional translator... sentences like "Time flies“ are unambiguous. Or are they? Let's see:
Time flies: A "fly“ is an insect. "Time flies" are insects that travel through time without using a time machine (here "time" functions as an adjective – use of a noun as an adjective is common in English)..
Time flies: I want you to time my flies! Please measure the time it takes them to get from point A to point B! Well, okay, but what flies are we talking about? Perhaps I'm an entomologist and I'm asking you to time how long it takes my flies to get from A to B. Or I'm a baseball trainer and I want you to time how long the flies [=fly balls] hit by members of the team take to get from home-plate to some point in the field. (Here "time" functions as a verb.)
Time flies: You say “Time flies...” to express your relative perception of how fast you feel time is passing. (Here time functions as a noun.)
Note: You might feel that "Time flies" is as unlikely a sentence to appear in a technical text as possible. I used this short sentence purposely because of titles (column and row headings) and instructions displayed in software, which are often very short, as you will see later in this article.
Of course, the third meaning is the one most likely to be correct. But you can't be sure without contextual remarks or background information – a few explanatory words. that will not go into the translation itself but will enable the translator to choose the right vocabulary and grammar in the target language. Providing background information is expensive in terms of time required to type the information and even to discover what type of information is required. Nonetheless, background information is necessary. Let me show you what kind and what amount of context is needed with a couple of examples.
[Note: The context generally is made up not of explanatory words but lived structures that (a) our lives are full of; and (b) a translator is, should be, has to be implicitly familiar with. Recognition of this context is, for the most part, a non-verbal, intuitive matter. Nonetheless, explanatory notes can provide considerable help to the translator.]
Exporting text
A simple text export is a flat file that contains all text messages used in your program, such as prompts, database column names, descriptions, window titles. Look at this example:
resolution
number of <item>
number of tables
from
to
April
image counter
stretching center
enter new password
Do you want it to be created?
<item> number (where <item> is one of [Operator, Table, Chair])
Rocker
Process sequences
time flies
Can you guess how many of the above lines can be translated without background information ? Let's look at them one by one:
|
Source word |
Problem |
Solution |
|
Resolution |
At least two possible meanings: 1. resolution is a decision; 2. resolution is the number of dots per square inch on a computer screen. Any reasonable dictionary lists at least 3 more meanings |
Add a context remark such as "number of dots on a screen“, unless the translator is without a doubt aware of the context because he knows that the text is about copmuter screens. Note that simply saying it is "computer context" is not enough. Your software could ask for the user's resolution (decision) of any issue it is facing. |
|
number of <item> |
Calculated sentence, where <item> is a variable. The translator does not know the content of this variable. In many languages, nouns such as "table“ or "ball“ have gender (feminine, masculine or, in some languages, neuter), and words are inflected differently according to gender. The translator does not know the gender of <item>. Therefore, he does not know how to inflect <item> and he also cannot inflect the expression "number of“ if the target language requires this. The result is an incorrect translation. |
You should spell out all source sentences: number of tables, number of balls etc. and have all of them translated |
|
number of tables |
The previous example's solution said this source should be OK, no context needed. Well, some languages can use a different word for table as "furniture“ and table as "list of data“, and these two words can have different gender (In German, for example, they do: "table" as furniture is masculine; "table" as a list is feminine.). |
Add a context remark such as "table = piece of furniture“. Always check in a dictionary if the word to be translated is a homonym. |
|
From ... to... |
Standalone prepositions are impossible to translate. Period. |
Spell out the whole sentence, either in the source text or in context remarks |
|
April |
Capitalization rules differ from language to language. In English, all month names are capitalized. In other languages, they are not. |
You need to specify where this word appears: in a column name, or in the middle of a sentence, since standing alone as a column name it would always be capitalized, whereas in some languages in the middle of a sentence it wouldn't. |
|
image counter |
Let's say that each of these two words has only one possible meaning (which is not the case). When used together, however, it's not clear whether the two words describe "a counter that counts images“ or "a counter that looks like an image", i.e. it looks like an image rather than like text. Translation of these two cases may be different. At least they are different in my mother tongue, Czech. |
You need to explain the exact meaning in a context remark. |
|
stretching center |
Possibly: "The center is currently being stretched“, i.e. the process of stretching is being talked about, your software says it is busy stretching the center. Or, the location of the stretching is being talked about, i.e. "The center is being stretched" (not some other location). Or, "This is the center of the stretching", i.e. it is not a center for some other operation. Or perhaps the phrase refers to "a gym where you can exercise stretching“. |
The fourth possible meaning is somewhat stretched (pun intended), given the context, but the first three are perfectly possible and it is improbable that they can all be translated by the same phrase in the target language. (What we have here is that in English placing two words in this manner next to each other can express a variety of different meanings, depending on the syntactic and semantic relationship between the words. In another language it is highly unlikely that these same relationships can be expressed in the same way. The translator has to know which meaning is intended in order to be able to translate the phrase correctly.) |
|
enter new password |
No problem here. |
This is a whole sentence, no ambiguity here. |
|
Do you want it to be created? |
What is to be created here? Nouns have genders in some languages, and the word "created“ needs (in some languages) to be inflected according to the gender of "it“. |
You need to tell the translator what "it“ is. |
|
<item> number (where <item> is one of [Operator, Table, Chair]) |
Calculated sentence, where the programmer mandated a fixed word order: <item> is first, "number“ is second. Unfortunately, some languages require a reversed word order or <item> inflection or both. |
You should spell out all source sentences: Operator number, Table number, Chair number |
|
Process sequences |
Is this an order? "You must process the sequences now!“ Or is "process“ an adjective that describes the type of sequences referred to? |
You need to clarify this in your context remarks. |
You can see that of all these sentences and sentence fragments, only one (perhaps) can be translated without context. All the rest require supporting documentation to be translated correctly.
It is difficult, or downright impossible, to provide perfect contextual remarks without feedback from the translator. Therefore you should provide all the context you can think of, and tell your translators that they are free to ask for more context at any time.
Writing all this supportive text is time-consuming. Alternatively, you could provide screenshots and no other explanation, but in such a case the translator would spend quite some time reading the screenshots and trying to understand the workflow (and you will need to reimburse him or her for the effort) and he can still make mistakes. It's better to provide both context and screenshots.
Perhaps the best approach is if the
translator can actually use the
software or website in question. You will have to budget for the time the
translator will spend coming to understand the software, but you will not need
to provide much context. There are disadvantages in this approach: to
investigate a website thoroughly, the translator must have a good and
affordable internet connection, which very often is not the case. For
locally-installed software (not web-based), producers are rarely willing to
provide the translator with a copy of the localized software, even when the
translator specifically asks for it.
Calculated sentences
The table at the end of the previous installment introduced a popular programming technique that I call "calculated sentences“: these are sentences (or sentence forms) that consist of some fixed, static text, and one of more variables. As a software developer, you have been trained to algorithmize everything that seems to lend itself to algorithmization. You do this to save work, both for yourself and your client. It's a noble goal, no question about that. The problem is that when you construct text sentences, an algorithm that works linguistically in your language will probably fail in most other languages. Even if you check that your procedure works for, say, English, Spanish and German, it may fail for Slavic languages, or some other language group, or a specific language. In my experience (for translation from English or German to Czech or another Slavic language) the likeliness of failure is actually well over 70% .
Some structures look like very enticing, attractive subjects for "calculated sentence“ algorithmization. Look at the following example:
1 table
2 tables
3 tables
4 tables
5 tables
etc.
You (as the creator of the software) do not know the number of tables, it could be 1, 2, or 2034 tables: that is the advantage of the use of variables, you don't need to know the number of tables to write the program. Perhaps you know that foreign languages create plurals in different ways than, say, English. Perhaps you have also been told that for the target language there are about 12 ways of creating a plural, depending on the noun that is being counted (i.e. the syntactic form for building the plural of "table" can be different from that for "chair" in the target language, whereas it is the same in English – addition of –s).
Note: It's generally thought that English is simple here. In the overwhelming number of cases the plural is formed by addition of –s, but there are common exceptions - for example, for some animals: mouse-mice and louse-lice (but not house- hice, so one couldn't form a general rule: a noun ending in –ouse has –ice as a plural), for others the plural is identical with the singular: fish, deer, moose. Then there are words that take Latin or Greek plurals and words ending in silibants (s sounds) take –es not –s. Other languages (German, for example) are far more complicated here.
So you ask the translator to provide a translation of "table“ and a translation for "tables“ and you get "stul“ and "stoly“ (that's in Czech). Then you can write the following code:
if x = 1:
print x, word_singular
else:
print x, word_plural
This results in the following output:
1 stul
2 stoly
3 stoly
4 stoly
5 stoly
Now let's say that, unfortunately, you've never heard that your target language actually has not one, but two different plural forms for any specific noun, such as "table". One plural form is used when there are 2, 3 or 4 items. The other plural form is used when there are 5 or more items. For Czech, the other plural form is "stolu“. So now you have a program that works for numbers 1-4, and fails for any other number.
If you extrapolate this, you can have an unlimited number of plural forms for a single word. Still, you must provide the output "<x> tables“ without having to purchase 1000 or 1,000,000 translations. There are two possible solutions: Firstly, you could have a branch for Czech that prints one plural for numbers 2, 3 and 4, and the other plural for numbers 5 and above, and similar branches for languages with similar issues. Secondly, you could work around the whole problem by rewriting your original sentence as follows: "number of tables: <x>“. This second solution works for Czech, Slovak, Russian and probably any other Slavic language. Probably. As far as I know. Does it work for other languages? I have no idea. You would need to talk to your translator or his agency. For example, there might be a language in which 1 (and/or 0) cannot be referred to by the same expression "number of" that is used for 2 and up. In everyday language they would use a different term for our "number of" when the number (in our sense) is 0 or 1.
Translation discounts
Calculated sentences described in the previous chapter are intended to save translation costs. I've shown that they can bring disastrous results. Does it mean that you have to pay through the nose if you need translation of 30 sentences that only differ in one word? Not at all. For more than a decade, Computer Aided Translation software has been widely used by the translation community. CAT software is not machine translation; it's merely a database of the whole text of the translation, which is indexed and automatically searched. Let's say you have a sentence like this one:
To open the Properties window, please click the Properties button.
Translated into a foreign language, it would look symbolically like this:
aa bbbb ccc DDDDDD eeeeee, ffffff gggg hhh KKKKKKKK jjjjj.
A few paragraphs later, the translator encounters a very similar sentence:
To open the TASKS window, please click the TASKS button.
The CAT software will automatically step in, find the closest sentence, and present it to the translator as a possible translation:
aa bbbb ccc DDDDDD eeeeee, ffffff gggg hhh KKKKKKKK jjjjj.
The software will even highlight "DDDDDD“ and "KKKKKKKK“ to tell the translator that these words are the only ones that need changing. To change the sentence offered by the CAT software to the translation of the new sentence, only a fraction of the time required for normal translation is needed. Consequently, translators offer a discount if there is a significant number of similar sentences (more than 10%) in the whole translation job. Generally, you will be offered: 60%-80% discount for sentences with similarity rate 99-95%, 30-60% discount for sentences with similarity rate 94-50%, and no discount for sentences with similarity rate below 50%. Discounts lag behind similarity rates simply because not only the words that are different must be translated, but also the surrounding words may need to be edited due to inflectional and similar issues, and it takes quite some time to do the editing. So much time in fact that if there is less than 50% similarity it is actually better to translate from scratch, i.e. not using the sentence suggested by CAT software.
Some sentences in your source may be identical; this is called 100% match in CAT terminology. You should not expect to get such repeated sentences translated for free. Be ready to pay about 10% of the normal price for the translation of totally identical sentences, because the translator will then check whether the translation should really be identical – which is not always the case. Moreover, if you change translators in the middle of the job – a perfectly reasonable thing to do since the CAT translation memory will guarantee vocabulary consistency – and your former translator translated some sentences incorrectly and your new translator recognizes this when they show up as proposals for a 100% match, he may be tempted not to correct the mistakes if he is not paid for a 100% match.
Vocabulary consistency is another important point, particularly in technical translations – it's not infrequent for a single translator to use different translations for the same term in the course of a translation: for example, the term in German "symmetrisches Paar" can be translated either as "symmetric pair" or "balanced pair" with the same meaning and it's easy to forget which term one has already used, if one is familiar with both. It's better to remain with one term throughout a text, especially since the creator(s) of a text often have a preference for one term over an equivalent one. CAT software solves this problem by letting the translator create a vocabulary list that is specific for the given source text. Such a vocabulary list contains pairs like "symmetrisches Paar" = "balanced pair". Whenever the translator (and the CAT software which is silently watching in the background) encounters "symmetrisches Paar" anywhere in a sentence, CAT steps in and highlights "symmetrisches Paar" in the source text, and it also politely offers "balanced pair" as a translation. The translator can press a hotkey to paste the "balanced pair"; in other words, he does not have to retype it. The CAT software intervention is very unobtrusive and the translator can always turn down the software suggestion by simply ignoring it - he does not have to click "Cancel"...
CAT tools bring the additional benefit that I touched upon in the previous paragraph: if the translator delivers the flat file translation database (so called "translation memory") with the resulting translation text, you will be able to change translators on the fly, or add more workpower to the translation team, without compromising the consistency of the translation vocabulary.
To sum up, CAT provides most of the savings that you would have achieved by using "calculated sentences“ without compromising translation quality.
Allocated space and line feeds
You cannot predict the length of a translated text if you do not know the language yourself and even if you do know the language a precise prediction is not easy. . If translation to several languages is needed, even an informed guess is unlikely, because the length will be different in each language. These differences are sometimes unexpectedly large – two or three times the length of the original text (or correspondingly shorter, depending on which direction one is going in). It's surprising how many localization projects are submitted with no regard for this.
The translator can often abbreviate his translation if the length requirements are conveyed to him – which rarely happens, however. Occasionally, abbreviation may be impossible: your two word expression with a text-cell length of 9 characters may need to be translated into four words, and no one can abbreviate four words in the space of 9 characters. You should be ready to change your layout if there is no other linguistic solution.
The source text sometimes contains line feeds (hard line-breaks). This can be accommodated in the translation, but bear in mind that the word order will almost certainly be different; furthermore, each line must provide sufficient space. What you originally wrote like this:
xxx xxxxx xxxxx
xxxxxxxxx xx
xxx xxxxx xxxxxx xxxx xxx.
… might wind up looking like this when translated (and you might not like the new appearance, or the text might run over a figure):
xx xxxxxxxxxxxxxxxxxxxxx
xxx xxxx xxxx xxxxx xxxxx
xxx.
Symbols and proper names
Symbols or pictograms are widely believed to be internationally comprehensible. This is true provided that you think carefully when choosing them. You can choose a misleading symbol. For instance, it's a safe bet to use a question mark symbol for anything related to "question“, although I imagine that there are languages which use something else than a question mark to mark question sentences. On the other hand, you can act in good faith and still choose incorrectly. Some pictures could have no meaning abroad, because the thing depicted does not exist there. Want an example? A picture of a piggy bank is clear in a Euro-American context, but do all kids throughout the world save money inside a pig? Might this even be insulting to a Muslim, for instance?
Could you use a dollar note to illustrate money? Probably, since the "$“ symbol is internationally recognizable, but it could also be interpreted to mean "American Dollars“ or even "Foreign Exchange“ instead of simply "money“.
Try to remove proper names (company brand names) that are used in your source language to represent a class of products. For example, in American English, Band-Aid is used to refer to adhesive patches for covering wounds in general, yet it is a brand name virtually unknown in most countries of the world.
Chained translation
It is often cheaper to translate from English than from any other language, simply because in any language there are more translators who know English than who know any other language. This can tempt you to chain translations and lead to a disaster. A practical example will clarify this:
Suppose your source language is Finnish, and you need translations to English and German, but also to Arabic, Chinese, Japanese etc. It is either impossible to find a Finnish/Chinese translator, or if you can find such translator but he is prohibitively expensive. Translators from English to Chinese are, on the other hand, plentiful and much cheaper. Therefore you decide to cut costs by first translating the whole text into English, and then using English as the source text.
You have all the background information (contextual remarks) in your source text, in Finnish. Your Finnish-English translator uses these remarks to create a perfect translation to English, but he does not translate the background information. Will you get good results from the translators who then use the perfect English translation? Of course not. This is obvious once you realize that you just transported the problem we've been discussing here (lack of background information) "one language up".
Even if the Finnish-English translator translates all the background information, it is still possible that all other translations will have somewhat substandard quality or errors, unless you make sure that the source translation (the Finnish-English translation) is of the utmost quality, triple-checked and proofread and based on consultation with the software developers.
Conclusion
The examples above are just that: examples of what can go wrong with translations based on seemingly innocuous input. I tried to list everything that I have ever faced in my work as a translator and localizer. But I only know three languages well and have a passing knowledge of two others, all of which are European languages. What it boils down to is that you should constantly talk to your translators to help them deliver the best possible result, and you should budget both funds and time for that.
One approach to achieve a high-quality localization would be to provide a detailed context for each sentence. The amount of this context may and most likely will exceed the amount of translated text. You should offer to pay a premium to translators for studying this context. Furthermore, even if you provide contextual information for all messages, you should also provide all screens and windows that appear in your program or website as a reference.
Sadly, quite a few translation agencies and freelance translators market localization but when you provide them with insufficient context, they do not have the guts to ask you for more material, perhaps fearing that you might consider them incompetent. One fundamental problem here is that usually the authors of texts are simply unaware of the difficulties involved with translation. Or even worse, knowing that the result will come to haunt you later on the translators find it more profitable to sell a quick and dirty translation and hope that you will come months later to have it translated again and properly, for much more money.
And one last, often overlooked, fact: translations tend to uncover errors in the original text - typos, terminology inconsistencies, and even text that is confused or difficult (or impossible) to comprehend. Most translators will deal with these kinds of problems without telling you for fear of hurting your feelings and consequently losing any chance of getting more jobs. This is not good for you or your customers. The translators will correct typos, but instead of correcting terminology inconsistencies and unclear text, they are likely to transport these problems into the translation. Deep down inside they are bothered by having to translate a low quality text into a low quality result, and they will be happy to tell you about these issues in your text – probably at no extra charge. If you are smart, you will tell them unambiguously that you want all the feedback they can provide. Opening all communication channels is the key to successful localization.
Miloš Průdek (prudek@bvx.cz) is an English-Czech translator and localization specialist.
