150 years of written Dutch: The construction of the Dutch Corpus of Contemporary and Late Modern Periodicals

Jozefien Piersoul; Robbert De Troij; Freek Van de Velde

doi:10.5117/NEDTAA2021.3.002.PIER

ISSN: 1384-5845
E-ISSN: 2352-1171

oa 150 years of written Dutch

The construction of the Dutch Corpus of Contemporary and Late Modern Periodicals
Authors: Jozefien Piersoul¹, Robbert De Troij², Freek Van de Velde³
View Affiliations Hide Affiliations

Affiliations: ¹ KU Leuven ² KU Leuven & Radboud University ³ KU Leuven
Publisher: Amsterdam University Press
Source: Nederlandse Taalkunde, Volume 26, Issue 3, Dec 2021, p. 339 - 362
DOI: https://doi.org/10.5117/NEDTAA2021.3.002.PIER
Language: English

Abstract

In this article, we present a new corpus spanning 163 years of written Dutch. This Dutch Corpus of Contemporary and late Modern Periodicals (Dutch C-CLAMP) comprises 47,738 part-of-speech tagged articles published in Dutch periodicals from 1837 until 1999, totaling approximately 200 million tokens in size. We explain the measures we took to overcome the shortcomings of existing corpora of historical Dutch covering the same period. We provide a detailed description of how the corpus has been compiled and enriched. Several aspects are covered: text-markup, preprocessing of the data, including foreign language recognition and spelling normalization, and the enrichment of both textual data as well as metadata of the authors of the corpus files. We also carry out two case studies to illustrate the reliability of the corpus.

Article metrics loading...

/content/journals/10.5117/NEDTAA2021.3.002.PIER

2021-12-01

2024-07-27

Full text loading...

/deliver/fulltext/13845845/26/3/NEDTAA2021.3.002.PIER.html?itemId=/content/journals/10.5117/NEDTAA2021.3.002.PIER&mimeType=html&fmt=ahah

References

Al-Rfou’, R., B.Perozzi & S.Skiena (2013). Polyglot: Distributed Word Representations for Multilingual NLP. Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.
[Google Scholar]
Anthonissen, L. & P.Petré (2019). Grammaticalization and the linguistic individual: new avenues in lifespan research. Linguistics Vanguard5.
[Google Scholar]
Baayen, R.H. (2008). Analyzing linguistic data: a practical introduction to statistics using R. Cambridge: Cambridge University Press.
[Google Scholar]
Baxter, G. & W.Croft (2016). Modeling language change across the lifespan: individual trajectories in community change. Language Variation and Change28(2), 129-173.
[Google Scholar]
Booij, G. E. & A.van Santen (1998). Morfologie: dewoordstructuur van het Nederlands. Amsterdam: Amsterdam University Press.
[Google Scholar]
Buchstaller, I. (2015). Exploring linguistic malleability across the life-span: age-specific patterns in quotative use. Language in Society44(4), 457-496.
[Google Scholar]
Butler, C. (1985). Statistics in linguistics. Oxford: Blackwell.
[Google Scholar]
Claridge, C. (2008). Historical corpora. In: A.Lüdeling & M.Kytö (red.). Corpus linguistics. An international handbook, Volume 1. Berlin: Mouton de Gruyter, 242-259.
[Google Scholar]
Clyne, M. (1992). Pluricentric languages – Introduction. In: M.Clyne (red.), Pluricentric languages: Differing norms in different nations. Berlin: Mouton de Gruyter, 1-9.
[Google Scholar]
Cowpertwait, P.S.P. & A.W.Metcalfe (2008). Introductory time series with R. Dordrecht: Springer.
[Google Scholar]
Coussé, E. (2010). Een digitaal compilatiecorpus historisch Nederlands. Lexikos20, 123-142.
[Google Scholar]
De Caluwe, J. (2017). Van AN naar BN, NN, SN… Het Nederlands als pluricentrische taal. In G.De Sutter (red.), De vele gezichten van het Nederlands in Vlaanderen. Een inleiding tot de variatietaalkunde. Leuven: Acco, 119-139.
[Google Scholar]
Field, A., J.Miles & Z.Field (2012). Discovering statistics using R. London : Sage.
[Google Scholar]
Geeraerts, D. & H.Van de Velde (2013). Supra-regional characteristics of colloquial Dutch. In: F.Hinskens & J.Taeldeman (red.), Language and space: Dutch. Berlin: De Gruyter, 532-556.
[Google Scholar]
Gries, S. (2015). The most underused statistical method in corpus linguistics: multi-level (and mixed-effects) models. Corpora10(1), 95-125.
[Google Scholar]
Gries, S. (2013). Statistics for linguistics with R. A practical introduction. 2deed.Berlin: de Gruyter.
[Google Scholar]
Gysseling, M. (1977-1987). Corpus Gysseling. Leiden: Martinus Nijhoff.
[Google Scholar]
Honnibal, M. & I.Montani (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
[Google Scholar]
Haeseryn, W., K.Romijn, G.Geerts, J.de Rooij & M.C.van den Toorn (1997). Algemene Nederlandse Spraakkunst. 2de ed. Groningen/Deurne: Martinus Nijhoff/Wolters Plantyn.
[Google Scholar]
Haeseryn, W. (1999). Normatieve studies. In: W.Smedts & P.C.Paardekooper (red.,) De Nederlandse taalkunde in kaart. Leuven: Acco, 237-247.
[Google Scholar]
Hendrickx, E. (2013). Het effect van lexicale taaladvisering ophet Belgisch-Nederlandse taalgebruik: Een diachroon corpusonderzoek naar factoren van invloed. Doct. Diss., KU Leuven.
[Google Scholar]
Hendrickx, I., A.van den Bosch, M.van Gompel & K.van der Sloot (2016). Frog, A Natural Language Processing Suite for Dutch. Language and Speech Technology Technical Report Series. Radboud University Nijmegen.
[Google Scholar]
Hoeksema, J. (2002). Polarity-sensitive scalar particles in early modern and present-day Dutch: distributional differences and diachronic developments. Belgian Journal of Linguistics16, 53-64.
[Google Scholar]
Hopper, P. & E.C.Traugott (2003). Grammaticalization. 2de ed. Cambridge: Cambridge University Press.
[Google Scholar]
Impe, L. (2010). Mutual intelligibility of national and regional varieties of Dutch in the Low Countries. Doct. Diss. KU Leuven.
[Google Scholar]
Janda, L. (red.) (2013). Cognitive Linguistics – The Quantitative Turn. The Essential Reader. Berlin, Boston: De Gruyter Mouton.
[Google Scholar]
Keuleers, E., M.Brysbaert & B.New (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods42(3), 643-650.
[Google Scholar]
Levshina, N. (2015). How to do linguistics with R. Data exploration and statistical analysis. Amsterdam: John Benjamins.
[Google Scholar]
Loonen, N. (2003). Stante pede gaande van dichtbij langs AF bestemming @. Doct. Diss. Universiteit Utrecht.
[Google Scholar]
Manning, C. & H.Schütze (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
[Google Scholar]
McEnery, T. & A.Hardy (2012). Corpus linguistics. Cambridge: Cambridge University Press.
[Google Scholar]
Oostdijk, N., M.Reynaert, V.Hoste & I.Schuurman (2013). The construction of a 500-million-word Reference Corpus of Contemporary Written Dutch. Essential Speech and Language Technology for Dutch: resources, tools and application. Springer, 219-247.
[Google Scholar]
Ordelman, R., F.de Jong, A.van Hessen & H.Hondorp (2007). TwNC: a Multifaceted Dutch News Corpus. ELRA Newsletter12, 3-4. <http://doc.utwente.nl/68090/>
[Google Scholar]
Petré, P., L.Anthonissen, S.Budts, E.Manjavacas, E.L.Silva, W.Standing & O.Strik (2019). Early modern multiloquent authors (EMMA): Designing a large-scale corpus of individuals’ languages. ICAME Journal43, 83-122.
[Google Scholar]
Petré, P. & F.Van de Velde (2018). The real-time dynamics of the individual and the community in grammaticalization. Language94(4), 867-901.
[Google Scholar]
Raumolin-Brunberg, H. (2009). Lifespan changes in the language of three early modern gentlemen. In: A.Nurmi, M.Nevala & M.Palander-Collin (red.), The language of daily life in England (1400-1800). Amsterdam: John Benjamins, 165-198.
[Google Scholar]
Reenen, P. van and M.Mulder (2000). ‘Un corpus linguistique de 3000 chartes en Moyen Néerlandais du 14e siècle’. In: M.Bilger (red.), Corpus, méthodologie et applications linguistiques. Paris: Champion et Presses Universitaires de Perpignan, 209-217.
[Google Scholar]
Rutten, G. & M.van der Wal (2014). Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch. Amsterda: John Benjamins.
[Google Scholar]
Sankoff, G. (2018). Language change across the lifespan. Annual Review of Linguistics4, 297-316.
[Google Scholar]
Speelman, D., K.Heylen & D.Geeraerts. (red.) (2018). Mixed-effects regression models in linguistics. Cham: Springer.
[Google Scholar]
TEI Consortium (red.) (2020). TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.1.0 (18/08/2020). TEI Consortium. <http://www.tei-c.org/Guidelines/P5/>
[Google Scholar]
Van Boven, E. (1998). Het pseudoniem als strategie. Pseudoniemen van vrouwelijke auteurs 1850-1900, Nederlandse Letterkunde3, 309-326.
[Google Scholar]
Van Canegem-Ardijns, I. (2006). The extraposition of prepositional objects of adjectives in Dutch. Linguistics44, 425-457.
[Google Scholar]
Van den Bosch, A., G.J.Busser, W.Daelemans & S.Canisius (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands,ILK Research group.
[Google Scholar]
Van der Horst, J.M. & K.Van der Horst (1999). Geschiedenis van het Nederlands in de twintigste eeuw. ’s-Gravenhage: Sdu Uitgevers.
[Google Scholar]
Van der Horst, J.M. (2008). Geschiedenis van de Nederlandse syntaxis. Leuven: Universitaire Pers Leuven.
[Google Scholar]
Van der Horst, J.M. (2010). Met (het) oog op morgen: opstellen over taal, taalverandering en standaardtaal. Leuven: Leuven University Press.
[Google Scholar]
Van der Horst, J.M. (2013). Taal op drift. Amsterdam: Meulenhoff.
[Google Scholar]
Van der Sijs, N. (2019). Historische taalkunde en Digital Humanities: samen naar een mooie toekomst. Tijdschrift voor Nederlandse Taal- en Letterkunde135 (4), 384-405.
[Google Scholar]
Van de Velde, F, J.Piersoul & I.De Smet (2020). De wervelkolom van taalverandering. Nederlandse Taalkunde25(2-3), 371-385
[Google Scholar]
Van de Velde, F. & P.Petré (2020). Historical linguistics. In: S.Adolphs & D.Knight (red.), The Routledge handbook of English language and digital humanities. London: Routledge.
[Google Scholar]
Van de Velde, F., K.Franco, & D.Geeraerts (2019). Reality check voor de kwantitatieve Nederlandse taalkunde: laveren tussen de Scylla van het conservatisme en de Charybdis van de zelfgenoegzaamheid. Tijdschrift voor Nederlandse Taal- en Letterkunde135(4), 329-343.
[Google Scholar]
Van de Velde, F. & P.Petré (2017). Linking grammaticalization to historical demography. Paper presented at Examining the Social in Historical Sociolinguistics: Methods and Theory. New York City, 6-7april2017.
[Google Scholar]
Van de Velde, F. & T.Ruette (2013). Moroccorp: tien miljoen woorden uit twee Marokkaans-Nederlandse chatkanalen. Lexikos23, 1-21.
[Google Scholar]
Van de Velde, F. (2009). De nominale constituent. Structuur en geschiedenis. Leuven: Universitaire Pers Leuven.
[Google Scholar]
Van de Velde, H., M.Kissine, E.Tops, S.van der Harst & R.van Hout (2010). Will Dutch become Flemish? Autonomous developments in Belgian Dutch. Multilingua29, 385-416.
[Google Scholar]
Van Eerten, L (2007). Over het Corpus Gesproken Nederlands,Nederlandse Taalkunde12 (3), 194-215.
[Google Scholar]
van Gompel, M. (2012). FoLiA: Format for Linguistic Annotation. ILK Technical Report 12-03. Tilburg University
[Google Scholar]
Van Hoof, S. & J.Jaspers (2012). Hyperstandaardisering,Tijdschrift voor Nederlandse Taal- en Letterkunde128(1), 97-125.
[Google Scholar]
Van Olmen, D. (2019). A diachronic corpus study of prenominal zo’n ‘so a’ in Dutch: pathways, analogy and (inter)subjectification. Functions of Language26(2), 217-248.
[Google Scholar]
Van Eynde, F. (2004). Part of Speech tagging and Lemmatizing of the Corpus Gesproken Nederlands (Spoken Dutch Corpus). Centrum voor Computerlinguïstiek K.U.Leuven: 62.
[Google Scholar]
Van Halteren, H. & N.Oostdijk (2015). Word distributions in Dutch tweets. A quantitative appraisal of the distinction between function and content words. Tijdschrift voor Nederlandse Taal- en Letterkunde131(3), 189-226.
[Google Scholar]
Vis, K., J.Sanders & W.Spooren (2012). Diachronic changes in subjectivity and stance – a corpus-based linguistic study of Dutch news texts. Discourse, Context & Media1, 95-102.
[Google Scholar]
Weerman, F. & P.de Wit (1998). De ondergang van de genitief. Nederlandse Taalkunde3(1), 18-46.
[Google Scholar]
Weighted-Levenshtein. <https://pypi.org/project/weighted-levenshtein/>
Winter, B. (2019). Statistics for linguistics. An introduction using R. Abingdon-on-Thames: Routledge.
[Google Scholar]
Wouters, C. (2007). Informalization: manners and emotions since 1890. Los Angeles: Sage.
[Google Scholar]
Xiao, R. (2008). Well-known and influential corpora. In: A.Lüdeling & M.Kytö (red.), Corpus linguistics: an international handbook, Volume1. Berlin: De Gruyter, 38-45.
[Google Scholar]
Zwart, J.W. (2011). The syntax of Dutch. Cambridge: Cambridge University Press.
[Google Scholar]

http://instance.metastore.ingenta.com/content/journals/10.5117/NEDTAA2021.3.002.PIER

150 years of written Dutch

NedTaal 26, 339 (2021); https://doi.org/10.5117/NEDTAA2021.3.002.PIER

/content/journals/10.5117/NEDTAA2021.3.002.PIER

Data & Media loading...

Keyword(s): corpus compilation; cultural periodicals; historical linguistics; Late-Modern Dutch; part-of-speech tagger

oa 150 years of written Dutch

The construction of the Dutch Corpus of Contemporary and Late Modern Periodicals

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Leve hun! Waarom hun nog steeds hun zeggen

Tussentaal wordt omgangstaal in Vlaanderen

Understanding grammar at the community level requires a diachronic perspective

Expressive markers in online teenage talk

Goed of fout

Feiten en fictie - Taalvariatie in Vlaamse televisiereeksen vroeger en nu

Connectieven in de rechterperiferie - Een contrastieve analyse van dus en donc in gesproken taal

Perceptie van tussentaal in het gesproken Nederlands in Vlaanderen

Pragmatische partikels in de rechterperiferie

Expeditie Tussentaal - Leeftijd, identiteit en context in “Expeditie Robinson”