Machine Translation: The Basics of Quality Estimation

A good point to make before we go into more detail on QE is the difference between evaluation and estimation. You can evaluate the quality of MT output in two main ways: human evaluation (a person will check the translation and provide feedback) and automatic evaluation (there are different methods that can provide a score on the translation quality without human intervention).

Traditionally, automatically evaluating the quality of any given MT output has required a reference translation created by a human translator. The differences and similarities between the MT output and the reference translation can then be turned into a score to determine the quality of said output. This is the approach followed by certain methods, such as BLEU and NIST.

The main differentiator of quality estimation is that it does not require a human reference translation.

QE is a prediction of the quality based on certain features. These features can be, for example, the number of noun or prepositional phrases in the source and target (and their difference), the number of named entities (names of places, people, companies, etc.), and many more. With these features, using techniques like machine learning, a QE model can be created to obtain a score that represents the estimation of the translation quality.

At eBay, we use MT to translate search queries, item titles and item descriptions; an earlier post discussed our work with search queries. To train our MT systems, we work with vendors that help us post-edit content. Due to the challenging nature of our content (user-generated, diversity of categories, millions of listings, etc.), a method to estimate the level of effort required for post-editing definitely adds value. QE can help you obtain important information on this effort in an automated manner. For example, one can estimate how many segments have a very low-quality translation and could be just discarded instead of post-edited.

So, what can you do with the help of QE? First and foremost, you can estimate the quality of translations at the segment and file level. Segment-level scores can help you target post-editing, focusing only on content that makes sense to post-edit. You can also estimate post-editing effort/time – it would be rather safe to assume that segments with a low-quality score take more time to post-edit. It is also possible to compare MT systems based on QE scores and see which one performs better. This last application example is especially helpful if you are trying to decide which engine you should use, or whether a new version of an engine is working better than its predecessor.