2004
Volume 8, Issue 1
  • E-ISSN: 2665-9085

Samenvatting

This paper presents a novel approach to classifying news articles by topic using only their URLs, addressing growing challenges in accessing article text due to paywalls and scraping restrictions. By fine-tuning a DistilBERT transformer model on URL data alone, I demonstrate topic classification performance that matches or exceeds traditional approaches requiring article text. Across three benchmark datasets spanning multiple languages and over 660,000 articles from more than 11,000 news domains, this URL-based topic classifier achieved superior F1 scores compared to both conventional machine learning methods and existing URL-based techniques. While this method requires more computational resources than simpler topic classification approaches, it dramatically reduces data collection requirements, offering researchers a practical alternative when text access is limited. These findings suggest that news article URLs contain richer semantic information than previously recognized, opening new possibilities for large-scale news content analysis in increasingly restrictive digital environments.

Loading

Article metrics loading...

/content/journals/10.5117/CCR2026.1.1.HAGA
2026-01-01
2026-02-09
Loading full text...

Full text loading...

/content/journals/10.5117/CCR2026.1.1.HAGA
Loading
Dit is een verplicht veld
Graag een geldig e-mailadres invoeren
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error