Ideally, you would not want to edit the MT output heavily, quite simply because that defeats the purpose of using MT in a way. How much post-editing is required will depend on the quality of the MT raw output and the level of quality required. Post-editing is used to fill in the gap between these two factors.
To make post-editing fruitful, sit down with your language service provider (LSP) providing the service so that everyone is on the same page. Your LSP can also provide you structured feedback which will help you understand the types of errors creeping up in the output and go towards improving the engine over time.
In general, there are two levels of post-editing your LSP will offer: light and heavy or full.
Light post-editing is used when the raw output is not of very poor quality or when “good enough” quality will do. In this approach, the purpose is to i
Around the world, people have evolved their own conventions or practices to do things: whether it is how they number their streets or how they address each other. For a company entering a new market, not getting these small things right is the easiest way to be singled out as a “foreigner” who does not know their way about the local market and perhaps does not care either.
You don’t want to be that company. You want to get everything right. I18n is the way you do it.
Date/Time:
08-06-20
Look at the above numbers in a date. You could think of many ways to read the date, and they could all be right, but just in different places around the world.
For instance, if you read the above date as 8 June, 2020, it’d be absolutely correct in India. In the US, it’d be read as 6 August, 2020, while in Japan it’d be 20 June 2008.
Not only do date formats vary, but so do the numeral separators in the date. This is how dates are written out in the following locales, and there could be more variations as you add more countries:
2020-08-06 (Canada)
08/06/20 (Brazil)
06/08/2020 (UAE, Ireland)
08.06.20 (Germany)
Time is denoted in 12-hour clocks in some places and in 24-hour clocks in others. It’s also written with different separators.
Numbers
Numbers are also written in varying formats. Take a look at the below:
123,456,789.00 (US)
123.456.789,00 (Germany)
123 456 789,00 (France)
123’456’789.00 (Switzerland)
12,34,56,789.00 (India)
Currency
Not only are the currencies different in various countries, but so are their display formats. For instance, Germany uses the currency symbol after the amount.
Mail address
Many countries follow their own address formats. And the minutest details matter. Given below is the address format for the United Kingdom:
Line 1: First name + last name
Line 2: Street name
Line 3: Postal town/city
Line 4: county – not always needed
Line 5: Postal code
Line 6: Country name
However, this will vary widely from country to country.
Getting date, time, number, and mail address formats right may not sound like a big deal, but imagine an e-commerce website not being able to deliver goods to its customers, because it did not record their addresses right. Many a time, if your forms are not internationalized, the customer may not even be able to place the order.
Or, if your products display their prices in the wrong formats, they may be thought to have outlandishly high prices or deemed to be too cheap, depending on which decimal indicator is used.
Sort and search
If your app has a lot of sort and search functions built into it, make sure that your users from anywhere in the world will be able to use it and get the desired results. While some may type in block letters, others may input search terms in more than one language. Some users may prefer to navigate through your app by voice, too.
Hard-coded strings
Do not leave content in source code. If strings are hard-coded into the source code, it can be difficult to extract them for translation and localization. Remember that sometimes you will have to extract content from thousands of lines of code and then have to do this for every language. It will create many source code branches and result in a waste of the developer’s time.
Hard-coded strings should be externalized to a resource file and keys should be used in the code to avoid these issues.
Unicode or double-byte or multi-byte character support
In most languages, one character is represented by one byte, which is how computers see them. However, some languages such as Chinese, Japanese, and Korean (CJK) need two or more bytes to represent one character.
If you’d like to offer many languages on your website, you must support unicode, double-byte, or multi-byte encodings.
Bi-di support or right-to-left support
Some languages such as Chinese can be written either vertically or horizontally. Middle Eastern languages are written from right-to-left. In all these cases, users must be able to input text easily in their languages.
Punctuation marks should appear in the right places, numbers must run in the right direction with the text, and visuals and user interface elements must be placed appropriately.
Embedding text in image
This is a strict no-no, just as placing content in the source code is. It creates a lot of pain to developers and translators alike. If text is hard-coded into image, translators may not see it and as a result, it may appear in the source language on translated websites and apps. It makes for a jarring user experience.
GUI elements
All languages may not take up the same amount of space, whether be it because of the size of their characters or because of word length. German, for example, takes up to 30% more space than English. The English “cancel” translates into “Abbrechen” which is three letters longer than the English word. Hence, plan for text expansion.
In Indian languages, there are combination characters and these sometimes break up in the graphic user interface (GUI). The Thai language does not have separators or word delimiters. For example, the Thai translation of ‘writing’, การเขียน, might be regarded as a single word (kānkhīan) or as two (kān khīan). Hence, it’s often difficult to break up such words without losing meaning.
Your i18n manager must be able to take care of all such and other issues that may be specific to different languages.
nform the reader, not create flawless content. Generally, the types of errors that need to be corrected in this approach are lexical and syntactical. Lexical errors refer to wrong word usage, while errors in syntax refer to faulty sentence structure.
In heavy or full post-editing, the goal is to equal human-quality translation. As such, apart from the above mentioned error types – lexical and syntax – style, fluency, and less obvious errors too are corrected.