Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research

Akos Mate; Miklós Sebők; Lukasz Wordliczek; Dariusz Stolicki; Ádám Feldmann

doi:10.5117/CCR2023.2.6.MATE

E-ISSN: 2665-9085

oa Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research
Authors: Akos Mate¹, Miklós Sebők, Lukasz Wordliczek, Dariusz Stolicki & Ádám Feldmann
View Affiliations Hide Affiliations

¹ Center for Social Science, Budapest
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 5, Issue 2, Jan 2023, p. 1
DOI: https://doi.org/10.5117/CCR2023.2.6.MATE
Language: English

Abstract

While large language models have revolutionised computational text analysis methods, the field is still tilted towards English language resources. Even as there are pre-trained models for some smaller languages, the coverage is far from universal, and pre-training large language models is an expensive and complicated task. This uneven language coverage limits comparative social research in terms of its geographical and linguistic scope. We propose a solution that sidesteps these issues by leveraging transfer learning and open-source machine translation. We use English as a bridge language between Hungarian and Polish bills and laws to solve a classification task related to the Comparative Agendas Project (CAP) coding scheme. Using the Hungarian corpus as training data for model fine-tuning, we categorise the Polish laws into 20 CAP categories. In doing so, we compare the performance of Transformer-based deep learning models (monolinguals, such as BERT, and multilinguals such as XLM-RoBERTa) and machine learning algorithms (e.g., SVM). Results show that the fine-tuned large language models outperform the traditional supervised learning benchmarks but are themselves surpassed by the machine translation approach. Overall, the proposed solution demonstrates a viable option for applying a transfer learning framework for low-resource languages and achieving state-of-the-art results without requiring expensive pre-training.

Article metrics loading...

/content/journals/10.5117/CCR2023.2.6.MATE

2023-01-01

2025-05-31

Full text loading...

/content/journals/10.5117/CCR2023.2.6.MATE

Article Type: Research Article

Keyword(s): classification; Comparative Agendas Project; deep learning; machine learning; natural language processing; policy topics

Most Cited Most Cited RSS feed

- oa A framework for privacy preserving digital trace data collection through data donation
  
  Authors: Laura Boeschoten, Jef Ausloos, Judith E. Möller, Theo Araujo & Daniel L. Oberski
- oa The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research
  
  Authors: Stijn Peeters & Sal Hagen
- oa Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video
  
  Authors: Benjamin Guinaudeau, Kevin Munger & Fabio Votta
- oa OSD2F: An Open-Source Data Donation Framework
  
  Authors: Theo Araujo, Jef Ausloos, Wouter van Atteveldt, Felicia Loecherbach, Judith Moeller, Jakob Ohme, Damian Trilling, Bob van de Velde, Claes de Vreese & Kasper Welbers
- oa Conversational Agent Research Toolkit
  
  By Theo Araujo
- oa Computational observation
  
  Authors: Mario Haim & Angela Nienierza
- oa Detecting Impoliteness and Incivility in Online Discussions
  
  Authors: Anke Stoll, Marc Ziegele & Oliver Quiring
- oa The Pervasive Presence of Chinese Government Content on Douyin Trending Videos
  
  Authors: Yingdan Lu & Jennifer Pan
- oa Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment
  
  Authors: Chung-hong Chan, Joseph Bajjalieh, Loretta Auvil, Hartmut Wessler, Scott Althaus, Kasper Welbers, Wouter van Atteveldt & Marc Jungblut
- oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models
  
  Authors: Daniel Maier, Andreas Niekler, Gregor Wiedemann & Daniela Stoltenberg
More Less

oa Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

oa A framework for privacy preserving digital trace data collection through data donation

oa The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

oa Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video

oa OSD2F: An Open-Source Data Donation Framework

oa Conversational Agent Research Toolkit

oa Computational observation

oa Detecting Impoliteness and Incivility in Online Discussions

oa The Pervasive Presence of Chinese Government Content on Douyin Trending Videos

oa Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models