Word-level machine translation for bag-of-words text analysis: Cheap, fast, and surprisingly good | Amsterdam University Press Journals Online
Volume 5, Issue 2
  • E-ISSN: 2665-9085


The quality of automated machine translation is rapidly approaching that of professional human translation. However, the best methods remain costly in terms of money, computational resources, and/or time, particularly when applied to large volumes of text. In contrast, word-level translation is both free and fast, simply mapping each word in a source language deterministically to a target language. This paper demonstrates that high-quality word-level translation dictionaries can be generated cheaply and easily, and that they produce translations that can serve reliably as inputs into some of the most common automated text analysis methods. It advances the field on two fronts: it assesses different techniques for creating word-level translation dictionaries, and it systematically compares the similarity of word-level translations against those produced by either state-of-the-art neural machine translation or professional human translation. Comparisons are performed for three common text analysis tasks — sentiment analysis, dictionary-based content analysis, and topic modeling — across a total of eleven different source languages and two target languages (English and French). Across all languages and tasks, word-level dictionaries perform sufficiently well to make them an attractive alternative when resource constraints make neural machine translation inaccessible. The translation dictionaries as well as the code used to generate and validate them are available on Github.


Article metrics loading...

Loading full text...

Full text loading...

  • Article Type: Research Article
Keyword(s): computational social science; machine translation; text-as-data; word embeddings
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error