2004
Volume 26, Issue 3
  • ISSN: 1384-5845
  • E-ISSN: 2352-1171

Abstract

Abstract

In this article, we present a new corpus spanning 163 years of written Dutch. This () comprises 47,738 part-of-speech tagged articles published in Dutch periodicals from 1837 until 1999, totaling approximately 200 million tokens in size. We explain the measures we took to overcome the shortcomings of existing corpora of historical Dutch covering the same period. We provide a detailed description of how the corpus has been compiled and enriched. Several aspects are covered: text-markup, preprocessing of the data, including foreign language recognition and spelling normalization, and the enrichment of both textual data as well as metadata of the authors of the corpus files. We also carry out two case studies to illustrate the reliability of the corpus.

Loading

Article metrics loading...

/content/journals/10.5117/NEDTAA2021.3.002.PIER
2021-12-01
2022-01-20
Loading full text...

Full text loading...

References

  1. Al-Rfou’, R., B.Perozzi & S.Skiena (2013). Polyglot: Distributed Word Representations for Multilingual NLP. Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.
    [Google Scholar]
  2. Anthonissen, L. & P.Petré (2019). Grammaticalization and the linguistic individual: new avenues in lifespan research. Linguistics Vanguard5.
    [Google Scholar]
  3. Baayen, R.H. (2008). Analyzing linguistic data: a practical introduction to statistics using R. Cambridge: Cambridge University Press.
  4. Baxter, G. & W.Croft (2016). Modeling language change across the lifespan: individual trajectories in community change. Language Variation and Change28(2), 129-173.
    [Google Scholar]
  5. Booij, G. E. & A.van Santen (1998). Morfologie: dewoordstructuur van het Nederlands. Amsterdam: Amsterdam University Press.
  6. Buchstaller, I. (2015). Exploring linguistic malleability across the life-span: age-specific patterns in quotative use. Language in Society44(4), 457-496.
    [Google Scholar]
  7. Butler, C. (1985). Statistics in linguistics. Oxford: Blackwell.
  8. Claridge, C. (2008). Historical corpora. In: A.Lüdeling & M.Kytö (red.). Corpus linguistics. An international handbook, Volume 1. Berlin: Mouton de Gruyter, 242-259.
    [Google Scholar]
  9. Clyne, M. (1992). Pluricentric languages – Introduction. In: M.Clyne (red.), Pluricentric languages: Differing norms in different nations. Berlin: Mouton de Gruyter, 1-9.
    [Google Scholar]
  10. Cowpertwait, P.S.P. & A.W.Metcalfe (2008). Introductory time series with R. Dordrecht: Springer.
  11. Coussé, E. (2010). Een digitaal compilatiecorpus historisch Nederlands. Lexikos20, 123-142.
    [Google Scholar]
  12. De Caluwe, J. (2017). Van AN naar BN, NN, SN… Het Nederlands als pluricentrische taal. In G.De Sutter (red.), De vele gezichten van het Nederlands in Vlaanderen. Een inleiding tot de variatietaalkunde. Leuven: Acco, 119-139.
    [Google Scholar]
  13. Field, A., J.Miles & Z.Field (2012). Discovering statistics using R. London : Sage.
  14. Geeraerts, D. & H.Van de Velde (2013). Supra-regional characteristics of colloquial Dutch. In: F.Hinskens & J.Taeldeman (red.), Language and space: Dutch. Berlin: De Gruyter, 532-556.
    [Google Scholar]
  15. Gries, S. (2015). The most underused statistical method in corpus linguistics: multi-level (and mixed-effects) models. Corpora10(1), 95-125.
    [Google Scholar]
  16. Gries, S. (2013). Statistics for linguistics with R. A practical introduction. 2deed.Berlin: de Gruyter.
  17. Gysseling, M. (1977-1987). Corpus Gysseling. Leiden: Martinus Nijhoff.
  18. Honnibal, M. & I.Montani (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
  19. Haeseryn, W., K.Romijn, G.Geerts, J.de Rooij & M.C.van den Toorn (1997). Algemene Nederlandse Spraakkunst. 2de ed. Groningen/Deurne: Martinus Nijhoff/Wolters Plantyn.
  20. Haeseryn, W. (1999). Normatieve studies. In: W.Smedts & P.C.Paardekooper (red.,) De Nederlandse taalkunde in kaart. Leuven: Acco, 237-247.
    [Google Scholar]
  21. Hendrickx, E. (2013). Het effect van lexicale taaladvisering ophet Belgisch-Nederlandse taalgebruik: Een diachroon corpusonderzoek naar factoren van invloed. Doct. Diss., KU Leuven.
  22. Hendrickx, I., A.van den Bosch, M.van Gompel & K.van der Sloot (2016). Frog, A Natural Language Processing Suite for Dutch. Language and Speech Technology Technical Report Series. Radboud University Nijmegen.
    [Google Scholar]
  23. Hoeksema, J. (2002). Polarity-sensitive scalar particles in early modern and present-day Dutch: distributional differences and diachronic developments. Belgian Journal of Linguistics16, 53-64.
    [Google Scholar]
  24. Hopper, P. & E.C.Traugott (2003). Grammaticalization. 2de ed. Cambridge: Cambridge University Press.
  25. Impe, L. (2010). Mutual intelligibility of national and regional varieties of Dutch in the Low Countries. Doct. Diss. KU Leuven.
  26. Janda, L. (red.) (2013). Cognitive Linguistics – The Quantitative Turn. The Essential Reader. Berlin, Boston: De Gruyter Mouton.
  27. Keuleers, E., M.Brysbaert & B.New (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods42(3), 643-650.
    [Google Scholar]
  28. Levshina, N. (2015). How to do linguistics with R. Data exploration and statistical analysis. Amsterdam: John Benjamins.
  29. Loonen, N. (2003). Stante pede gaande van dichtbij langs AF bestemming @. Doct. Diss. Universiteit Utrecht.
  30. Manning, C. & H.Schütze (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
  31. McEnery, T. & A.Hardy (2012). Corpus linguistics. Cambridge: Cambridge University Press.
  32. Oostdijk, N., M.Reynaert, V.Hoste & I.Schuurman (2013). The construction of a 500-million-word Reference Corpus of Contemporary Written Dutch. Essential Speech and Language Technology for Dutch: resources, tools and application. Springer, 219-247.
    [Google Scholar]
  33. Ordelman, R., F.de Jong, A.van Hessen & H.Hondorp (2007). TwNC: a Multifaceted Dutch News Corpus. ELRA Newsletter12, 3-4. <http://doc.utwente.nl/68090/>
    [Google Scholar]
  34. Petré, P., L.Anthonissen, S.Budts, E.Manjavacas, E.L.Silva, W.Standing & O.Strik (2019). Early modern multiloquent authors (EMMA): Designing a large-scale corpus of individuals’ languages. ICAME Journal43, 83-122.
    [Google Scholar]
  35. Petré, P. & F.Van de Velde (2018). The real-time dynamics of the individual and the community in grammaticalization. Language94(4), 867-901.
    [Google Scholar]
  36. Raumolin-Brunberg, H. (2009). Lifespan changes in the language of three early modern gentlemen. In: A.Nurmi, M.Nevala & M.Palander-Collin (red.), The language of daily life in England (1400-1800). Amsterdam: John Benjamins, 165-198.
    [Google Scholar]
  37. Reenen, P. van and M.Mulder (2000). ‘Un corpus linguistique de 3000 chartes en Moyen Néerlandais du 14e siècle’. In: M.Bilger (red.), Corpus, méthodologie et applications linguistiques. Paris: Champion et Presses Universitaires de Perpignan, 209-217.
    [Google Scholar]
  38. Rutten, G. & M.van der Wal (2014). Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch. Amsterda: John Benjamins.
  39. Sankoff, G. (2018). Language change across the lifespan. Annual Review of Linguistics4, 297-316.
    [Google Scholar]
  40. Speelman, D., K.Heylen & D.Geeraerts. (red.) (2018). Mixed-effects regression models in linguistics. Cham: Springer.
  41. TEI Consortium (red.) (2020). TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.1.0 (18/08/2020). TEI Consortium. <http://www.tei-c.org/Guidelines/P5/>
  42. Van Boven, E. (1998). Het pseudoniem als strategie. Pseudoniemen van vrouwelijke auteurs 1850-1900, Nederlandse Letterkunde3, 309-326.
    [Google Scholar]
  43. Van Canegem-Ardijns, I. (2006). The extraposition of prepositional objects of adjectives in Dutch. Linguistics44, 425-457.
    [Google Scholar]
  44. Van den Bosch, A., G.J.Busser, W.Daelemans & S.Canisius (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands,ILK Research group.
    [Google Scholar]
  45. Van der Horst, J.M. & K.Van der Horst (1999). Geschiedenis van het Nederlands in de twintigste eeuw. ’s-Gravenhage: Sdu Uitgevers.
  46. Van der Horst, J.M. (2008). Geschiedenis van de Nederlandse syntaxis. Leuven: Universitaire Pers Leuven.
  47. Van der Horst, J.M. (2010). Met (het) oog op morgen: opstellen over taal, taalverandering en standaardtaal. Leuven: Leuven University Press.
  48. Van der Horst, J.M. (2013). Taal op drift. Amsterdam: Meulenhoff.
  49. Van der Sijs, N. (2019). Historische taalkunde en Digital Humanities: samen naar een mooie toekomst. Tijdschrift voor Nederlandse Taal- en Letterkunde135 (4), 384-405.
    [Google Scholar]
  50. Van de Velde, F, J.Piersoul & I.De Smet (2020). De wervelkolom van taalverandering. Nederlandse Taalkunde25(2-3), 371-385
    [Google Scholar]
  51. Van de Velde, F. & P.Petré (2020). Historical linguistics. In: S.Adolphs & D.Knight (red.), The Routledge handbook of English language and digital humanities. London: Routledge.
    [Google Scholar]
  52. Van de Velde, F., K.Franco, & D.Geeraerts (2019). Reality check voor de kwantitatieve Nederlandse taalkunde: laveren tussen de Scylla van het conservatisme en de Charybdis van de zelfgenoegzaamheid. Tijdschrift voor Nederlandse Taal- en Letterkunde135(4), 329-343.
    [Google Scholar]
  53. Van de Velde, F. & P.Petré (2017). Linking grammaticalization to historical demography. Paper presented at Examining the Social in Historical Sociolinguistics: Methods and Theory. New York City, 6-7april2017.
    [Google Scholar]
  54. Van de Velde, F. & T.Ruette (2013). Moroccorp: tien miljoen woorden uit twee Marokkaans-Nederlandse chatkanalen. Lexikos23, 1-21.
    [Google Scholar]
  55. Van de Velde, F. (2009). De nominale constituent. Structuur en geschiedenis. Leuven: Universitaire Pers Leuven.
  56. Van de Velde, H., M.Kissine, E.Tops, S.van der Harst & R.van Hout (2010). Will Dutch become Flemish? Autonomous developments in Belgian Dutch. Multilingua29, 385-416.
    [Google Scholar]
  57. Van Eerten, L (2007). Over het Corpus Gesproken Nederlands,Nederlandse Taalkunde12 (3), 194-215.
    [Google Scholar]
  58. van Gompel, M. (2012). FoLiA: Format for Linguistic Annotation. ILK Technical Report 12-03. Tilburg University
  59. Van Hoof, S. & J.Jaspers (2012). Hyperstandaardisering,Tijdschrift voor Nederlandse Taal- en Letterkunde128(1), 97-125.
    [Google Scholar]
  60. Van Olmen, D. (2019). A diachronic corpus study of prenominal zo’n ‘so a’ in Dutch: pathways, analogy and (inter)subjectification. Functions of Language26(2), 217-248.
    [Google Scholar]
  61. Van Eynde, F. (2004). Part of Speech tagging and Lemmatizing of the Corpus Gesproken Nederlands (Spoken Dutch Corpus). Centrum voor Computerlinguïstiek K.U.Leuven: 62.
  62. Van Halteren, H. & N.Oostdijk (2015). Word distributions in Dutch tweets. A quantitative appraisal of the distinction between function and content words. Tijdschrift voor Nederlandse Taal- en Letterkunde131(3), 189-226.
    [Google Scholar]
  63. Vis, K., J.Sanders & W.Spooren (2012). Diachronic changes in subjectivity and stance – a corpus-based linguistic study of Dutch news texts. Discourse, Context & Media1, 95-102.
    [Google Scholar]
  64. Weerman, F. & P.de Wit (1998). De ondergang van de genitief. Nederlandse Taalkunde3(1), 18-46.
    [Google Scholar]
  65. Weighted-Levenshtein. <https://pypi.org/project/weighted-levenshtein/>
  66. Winter, B. (2019). Statistics for linguistics. An introduction using R. Abingdon-on-Thames: Routledge.
  67. Wouters, C. (2007). Informalization: manners and emotions since 1890. Los Angeles: Sage.
  68. Xiao, R. (2008). Well-known and influential corpora. In: A.Lüdeling & M.Kytö (red.), Corpus linguistics: an international handbook, Volume1. Berlin: De Gruyter, 38-45.
    [Google Scholar]
  69. Zwart, J.W. (2011). The syntax of Dutch. Cambridge: Cambridge University Press.
http://instance.metastore.ingenta.com/content/journals/10.5117/NEDTAA2021.3.002.PIER
Loading
/content/journals/10.5117/NEDTAA2021.3.002.PIER
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error