Word Embedding Enrichment for Dictionary Construction: An Example of Incivility in Cantonese

Hai Liang; Yee Man Margaret Ng; Nathan L.T. Tsang

doi:10.5117/CCR2023.1.10.LIAN

E-ISSN: 2665-9085

oa Word Embedding Enrichment for Dictionary Construction: An Example of Incivility in Cantonese
Authors: Hai Liang¹, Yee Man Margaret Ng² & Nathan L.T. Tsang³
View Affiliations Hide Affiliations

¹ The Chinese University of Hong Kong ² Department of Journalism, University of Illinois Urbana-Champaign ³ Department of Sociology, The University of Southern California
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 5, Issue 1, Jan 2023, p. 1
DOI: https://doi.org/10.5117/CCR2023.1.10.LIAN
Language: English

Abstract

Dictionary-based methods remain valuable to measure concepts based on texts, though supervised machine learning has been widely used in much recent communication research. The present study proposes a semi-automatic and easily implemented method to build and enrich dictionaries based on word embeddings. As an example, we create a dictionary of political incivility that contains vulgarity and name-calling words in Cantonese. The study shows that dictionary-based classification outperforms supervised machine learning methods, including deep neural network models. Furthermore, a small number of random seed words can generate a highly accurate dictionary. However, the uncivil content detected is only weakly correlated with uncivil perceptions, as we demonstrate in a population-based survey experiment. The strengths and limitations of dictionary-based methods are discussed.

Article metrics loading...

/content/journals/10.5117/CCR2023.1.10.LIAN

2023-01-01

2025-06-01

Full text loading...

/content/journals/10.5117/CCR2023.1.10.LIAN

Article Type: Research Article

Keyword(s): Cantonese; dictionary construction; machine learning; political incivility; swearing

oa Word Embedding Enrichment for Dictionary Construction: An Example of Incivility in Cantonese

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

A framework for privacy preserving digital trace data collection through data donation

The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video

OSD2F: An Open-Source Data Donation Framework

Conversational Agent Research Toolkit

Computational observation

Detecting Impoliteness and Incivility in Online Discussions

The Pervasive Presence of Chinese Government Content on Douyin Trending Videos

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models