Greasing the wheels for comparative communication research: Supervised text classification for multilingual corpora | Amsterdam University Press Journals Online
2004
Volume 3, Issue 3
  • E-ISSN: 2665-9085

Abstract

Abstract

Employing supervised machine learning for text classification is already a resource-intensive endeavor in a monolingual setting. However, facing the challenge to classify a multilingual corpus, the cost of producing the required annotated documents quickly exceeds even generous time and financial constraints. We show how tools like automated annotation and machine translation can not only efficiently but also effectively be employed for the classification of a multilingual corpus with supervised machine learning. Our findings demonstrate that good results can already be achieved with the machine translation of about 250 to 350 documents per category class and language and a dictionary in just one language, which we perceive as a realistic scenario for many projects. The methodological strategy is applied to study migration frames in seven languages (news discourse in seven European countries) and discussed and evaluated for its usability in comparative communication research.

Loading

Article metrics loading...

/content/journals/10.5117/CCR2021.3.001.LIND
2021-10-01
2024-04-20
Loading full text...

Full text loading...

/deliver/fulltext/26659085/3/3/CCR2021.3.001.LIND.html?itemId=/content/journals/10.5117/CCR2021.3.001.LIND&mimeType=html&fmt=ahah

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … & Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265-283).
    [Google Scholar]
  2. Alkhair, M., Meftouh, K., Smaïli, K., & Othman, N. (2019, October). An arabic corpus of fake news: Collection, analysis and classification. In K.Smaïli (Ed.), Proceeding of the 7th Conference on Arabic Language Processing (pp. 292-302).
    [Google Scholar]
  3. Baden, C., Schonvelde, M., Pipal, V., & an der Velden, M., (2020). Three gaps in computational methods for social sciences: A research agenda. (Working paper).
    [Google Scholar]
  4. Baden, C., & Stalpouskaya, K. (2015). Common methodological framework: Content analysis. A mixed-methods strategy for comparatively, diachronically analyzing conflict discourse. INFOCORE Working Paper 2015/10. www.infocore.eu/results/working-papers/
    [Google Scholar]
  5. Balahur, A., & Turchi, M. (2014). Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Computer Speech & Language, 28(1), 56–75. https://doi.org/10.1016/j.csl.2013.03.004
    [Google Scholar]
  6. Banducci, S., de Vreese, C. H., Semetko, H. A., Boomgarden, H. G., Luhiste, M., Peter, J., … & Xezonakis, G. (2014). European Parliament election study, longitudinal media study 1999, 2004, 2009. GESIS Data Archive, Cologne. ZA5178 Data file Version 1.0.0. doi:10.4232/1.5178
    [Google Scholar]
  7. Banea, C., Mihalcea, R., Wiebe, J., & Hassan, S. (2008). Multilingual subjectivity analysis using machine translation. In M.Lapata & H.Tou Ng (Eds.), Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 127–135). Association for Computational Linguistics. https://www.aclweb.org/anthology/D08-1
    [Google Scholar]
  8. Baumgartner, F. R., Breunig, C. & GrossmanE. (2019). Comparative policy agendas: theory, tools, data. Oxford University Press. 2441/2pt196maou9nurq8q8eudu9p2i
    [Google Scholar]
  9. Boomgaarden, H. G., & Song, H. (2019). Media use and its effects in a cross-national perspective. KZfSS Kölner Zeitschrift für Soziologie und Sozialpsychologie, 71(1), 545–571. https://doi.org/10.1007/s11577-019-00596-9
    [Google Scholar]
  10. Boot, P. (2021, January12). Machine-translated texts as an alternative to translated dictionaries for LIWC. (Working Paper). https://doi.org/10.31219/osf.io/tsc36
    [Google Scholar]
  11. Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), 8–23. https://doi.org/10.1080/21670811.2015.1096598
    [Google Scholar]
  12. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
    [Google Scholar]
  13. Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8(3), 190–206. https://doi.org/10.1080/19312458.2014.937527
    [Google Scholar]
  14. Caviedes, A. (2015). An emerging ‘European’news portrayal of immigration?Journal of Ethnic and Migration Studies, 41(6), 897–917. https://doi.org/10.1080/1369183X.2014.1002199
    [Google Scholar]
  15. Chan, C. H., Zeng, J., Wessler, H., Jungblut, M., Welbers, K., Bajjalieh, J. W., … & Althaus, S. L. (2020). Reproducible extraction of cross-lingual topics (rectr). Communication Methods and Measures, 14(4), 285–305. https://doi.org/10.1080/19312458.2020.1812555
    [Google Scholar]
  16. Chang, C., & Masterson, M. (2020). Using word order in political text classification with long short-term memory models. Political Analysis, 28(3), 395-411.
    [Google Scholar]
  17. Chollet, F. (2018). Keras: The python deep learning library. Astrophysics Source Code Library, ascl-1806.
    [Google Scholar]
  18. Cohen, A. A. (2012). Benefits and pitfalls of comparative research on news: Production, content, and audiences. In I.Volkmer (Ed.), The handbook of global media research (pp.533–27). Blackwell Publishing Ltd.doi:10.1002/9781118255278
    [Google Scholar]
  19. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
    [Google Scholar]
  20. Courtney, M., Breen, M., McMenamin, I., & McNulty, G. (2020). Automatic translation, context, and supervised learning in comparative politics. Journal of Information Technology & Politics, 17(3), 208-217. https://doi.org/10.1080/19331681.2020.1731245
    [Google Scholar]
  21. De Vries, E., Schoonvelde, M., & Schumacher, G. (2018). No longer lost in translation: Evidence that Google Translate works for comparative bag-of-words text applications. Political Analysis, 26(4), 417–430. doi:10.1017/pan.2018.26
    [Google Scholar]
  22. Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78(53), 3797–3816. https://doi.org/10.1007/s11042-018-6083-5
    [Google Scholar]
  23. Eberl, J. M., Meltzer, C. E., Heidenreich, T., Herrero, B., Theorin, N., Lind, F., … & Strömbäck, J. (2018). The European media discourse on immigration and its effects: A literature review. Annals of the International Communication Association, 42(3), 207–223. https://doi.org/10.1080/23808985.2018.1497452
    [Google Scholar]
  24. Esser, F., & Hanitzsch, T. (2012). On the why and how of comparative inquiry in communication studies. In F.Esser & T.Hanitzsch (Eds.), Handbook of comparative communication research (pp. 3–22). Routledge. https://doi.org/10.4324/9780203149102
    [Google Scholar]
  25. Esser, F., & Vliegenthart, R. (2017). Comparative research methods. The international encyclopedia of communication research methods, 1-22. https://doi.org/10.1002/9781118901731.iecrm0035
    [Google Scholar]
  26. Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis lectures on human language technologies, 10(1), 1-309. https://doi.org/10.2200/S00762ED1V01Y201703HLT037
    [Google Scholar]
  27. Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. doi:10.1093/pan/mps028
    [Google Scholar]
  28. Gründl, J. (2020). Populist ideas on social media: A dictionary-based measurement of populist communication. New Media & Society. Advance online publication. https://doi.org/10.1177/1461444820976970
    [Google Scholar]
  29. Heidenreich, T., Lind, F., Eberl, J.-M., Galyga, S., Edie, R., Herrero-Jiménez, B., Gómez Montero, E.L., Berganza, R., & Boomgaarden, H.G. (2020). REMINDER: Short Term Media Analysis on Migration 2017-2018 (OA edition), [Dataset and documentation]. AUSSDA Dataverse. doi: 10.11587/LBSMPQ
    [Google Scholar]
  30. Hopmann, D. N., Esser, F., de Vreese, C. H., Aalberg, T., van Aelst, P., Berganza, R., … & Strömbäck, J. (2016). How we did it: Approach and methods. In C.De Vreese, F.Esser, & D. N.Hopmann (Eds.), Comparing political journalism (pp. 10–21). Routledge. https://doi.org/10.4324/9781315622286
    [Google Scholar]
  31. Japkowicz, N., & Shah, M. (2011). Evaluating learning algorithms: A classification perspective. Cambridge University Press. https://doi.org/10.1017/CBO9780511921803
    [Google Scholar]
  32. Kananovich, V. (2018). Framing the Taxation-Democratization link: An automated content analysis of cross-national newspaper data. The International Journal of Press/Politics, 23(2), 247–267. https://doi.org/10.1177/1940161218771893
    [Google Scholar]
  33. Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. Springer Science & Business Media.
    [Google Scholar]
  34. Karan, M., Šnajder, J., Širinić, D., & Glavaš, G. (2016, August). Analysis of policy agendas: Lessons learned from automatic topic classification of croatian political texts. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 12–21). https://www.aclweb.org/anthology/W16-2100
    [Google Scholar]
  35. Kothiya, Y. (2019). How I handled imbalanced text data. Blueprint to tackle one of the most common problems in AI. Blog post. https://towardsdatascience.com/how-i-handled-imbalanced-text-data-ba9b757ab1d8
    [Google Scholar]
  36. Lind, F., Eberl, J. M., Eisele, O., Heidenreich, T., Galyga, S., & Boomgaarden, H. G. (2021). Building the Bridge: Topic Modeling for Comparative Research. Communication Methods and Measures. https://doi.org/10.1080/19312458.2021.1965973
    [Google Scholar]
  37. Lind, F., Eberl, J. M., Heidenreich, T., & Boomgaarden, H. G. (2019). When the journey is as important as the goal: A roadmap to multilingual dictionary construction. International Journal of Communication, 13, 4000–4020.
    [Google Scholar]
  38. Livingstone, S. (2003). On the challenges of cross-national comparative media research. European Journal of Communication, 18(4), 477–500. https://doi.org/10.1177/0267323103184003
    [Google Scholar]
  39. Loftis, M. W., & Mortensen, P. B. (2020). Collaborating with the machines: A hybrid method for classifying policy documents [Supplemental material]. Policy Studies Journal, 48(1), 184–206. https://doi.org/10.1111/psj.12245
    [Google Scholar]
  40. Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D. (2015). Computer-assisted text analysis for comparative politics. Political Analysis, 23(2), 254–277. http://nrs.harvard.edu/urn-3:HUL.InstRepos:38057808
    [Google Scholar]
  41. Mirończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36–54. https://doi.org/10.1016/j.eswa.2018.03.058
    [Google Scholar]
  42. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
    [Google Scholar]
  43. Peter, J., & Lauf, E. (2002). Reliability in cross-national content analysis. Journalism & Mass Communication Quarterly, 79(4), 815-832. https://doi.org/10.1177/107769900207900404
    [Google Scholar]
  44. Pilny, A., McAninch, K., Slone, A., & Moore, K. (2019). Using supervised machine learning in automated content analysis: An example using relational uncertainty. Communication Methods and Measures, 13(4), 287–304. https://doi.org/10.1080/19312458.2019.1650166
    [Google Scholar]
  45. Proksch, S.-O., Lowe, W., Wäckerle, J., & Soroka, S. (2019). Multilingual sentiment analysis: A new approach to measuring conflict in legislative speeches. Legislative Studies Quarterly, 44(1), 97–131. https://doi.org/10.1111/lsq.12218
    [Google Scholar]
  46. Reber, U. (2019). Overcoming language barriers: Assessing the potential of machine translation and topic modeling for the comparative analysis of multilingual text corpora. Communication Methods and Measures, 13(2), 102–125. https://doi.org/10.1080/19312458.2018.1555798
    [Google Scholar]
  47. Rössler, P. (2012). Comparative content analysis. In F.Esser & T.Hanitsch (Eds.), The handbook of comparative communication research (pp. 481–490). Routledge. https://doi.org/10.4324/9780203149102
    [Google Scholar]
  48. Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M. (2018). More than bags of words: Sentiment analysis with word embeddings. Communication Methods and Measures, 12(2-3), 140–157. https://doi.org/10.1080/19312458.2018.1455817
    [Google Scholar]
  49. Scharkow, M. (2013). Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Quality & Quantity, 47(2), 761–773. https://doi.org/10.1007/s11135-011-9545-7
    [Google Scholar]
  50. Schuck, A. R., Vliegenthart, R., Boomgaarden, H. G., Elenbaas, M., Azrout, R., van Spanje, J., & De Vreese, C. H. (2013). Explaining campaign news coverage: How medium, time, and context explain variation in the media framing of the 2009 European parliamentary elections. Journal of Political Marketing, 12(1), 8–28. https://doi.org/10.1080/15377857.2013.752192
    [Google Scholar]
  51. Sebők, M., & Kacsuk, Z.2020. The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach. Political Analysis. Advance online publication. doi: 10.1017/pan.2020.27
    [Google Scholar]
  52. Shi, L., Mihalcea, R., & Tian, M. (2010, October). Cross language text classification by model translation and semi-supervised learning. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 1057–1067). https://www.aclweb.org/anthology/D10-1000
    [Google Scholar]
  53. Song, H., Tolochko, P., Eberl, J. M., Eisele, O., Greussing, E., Heidenreich, T., … & Boomgaarden, H. G. (2020). In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication, 37(4), 550–572. https://doi.org/10.1080/10584609.2020.1723752
    [Google Scholar]
  54. Strömbäck, J., Andersson, F., & Nedlund, E. (2017). Invandring i medierna. Hur rapporterade svenska tidningar åren 2010–2015? [Immigration in the media. How did Swedish newspapers report in the years 2010–2015?] Stockholm: Delmi/Statens Offentliga Utredningar.
    [Google Scholar]
  55. van Atteveldt, W., van der Velden, M. A. C. G. & Boukes, M. (2021). The validity of sentiment analysis: comparing manual annotation, crowd-coding, dictionary. approaches, and machine learning algorithms. Communication Methods and Measures. Advance online publication. https://doi.org/10.1080/19312458.2020.1869198
    [Google Scholar]
  56. Volkens, A., Lehmann, P., Matthieß, T., Merz, N., Regel, S., & Werner, A. (2015). The manifesto data collection: Manifesto project (MRG/CMP/MARPOR, Version 2015a) [Computer software]. Wissenschaftszentrum Berlin für Sozialforschung. https://doi.org/10.25522/manifesto.mpds.2015a
    [Google Scholar]
  57. Watanabe, K. (2020). Latent semantic scaling: A semisupervised text analysis technique for new domains and languages. Communication Methods and Measures. Advance online publication. https://doi.org/10.1080/19312458.2020.1832976
    [Google Scholar]
  58. Wijffels, J., Straka, M., & Strakov, J. (2019). Udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the UDPipe NLP toolkit. (R Package Version 0.6). [Computer software]. Available at https://cran.r-project.org/web/packages/udpipe/index.html
    [Google Scholar]
  59. Young, L., & Soroka, S. (2012). Affective news: The automated coding of sentiment in political texts. Political Communication, 29(2), 205–231. https://doi.org/10.1080/10584609.2012.671234
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.5117/CCR2021.3.001.LIND
Loading
/content/journals/10.5117/CCR2021.3.001.LIND
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error