Human Evaluation of Machine Translation


Machine translation (MT) evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required post-editing, negotiating the price, and setting reasonable expectations. Machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges.

The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.

Human evaluation, however, also has a number of disadvantages. Primarily, it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation.

Let us explore the most commonly used types of human evaluation.


Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).

The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.


Adequacy, according to the Linguistic Data Consortium, is defined as “how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation.” The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.

A typical scale used to measure adequacy is based on the question “How much meaning is preserved?”

5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none


Fluency refers to the target only, without taking the source into account; criteria are grammar, spelling, choice of words, and style. A typical scale used to measure fluency is based on the question “Is the language in the output fluent?”

5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible


Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation.

Error Analysis

Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are “missing words,” “incorrect word order,” “added words,” “wrong agreement,” “wrong part of speech,” and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.

When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.

Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.

Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowdsourcing evaluations tend to produce unreliable results. How then can we save money without compromising the validity of our findings?

  • Start with a pilot test — a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
  • Monitor response patterns to remove judges whose answers are outside the expected range.
  • Use dynamic judgments — a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
  • Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.

Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment.

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.