Een corpus waar alle constructies in gevonden zouden moeten kunnen worden?*: Corpusonderzoek met behulp van automatisch gegenereerde syntactische annotatie

Jelke Bloem

doi:10.5117/NEDTAA2020.1.003.BLOE

ISSN: 1384-5845
E-ISSN: 2352-1171

oa Een corpus waar alle constructies in gevonden zouden moeten kunnen worden?^*

Corpusonderzoek met behulp van automatisch gegenereerde syntactische annotatie
By Jelke Bloem
Publisher: Amsterdam University Press
Source: Nederlandse Taalkunde, Volume 25, Issue 1, Apr 2020, p. 39 - 71
DOI: https://doi.org/10.5117/NEDTAA2020.1.003.BLOE
Language: English
- Published online: 01 Apr 2020

Abstract

In this contribution, I discuss the use of automatic syntactic annotation in Dutch corpus research, using a case study of five-verb clusters. Large amounts of text can be annotated automatically, but the parser makes mistakes, while correct annotation is very important in linguistic research. How much of a problem is this, and how can we learn about the extent of these parsing mistakes? There are several approaches to evaluating the quality of automatic annotation for specific research questions. I demonstrate these approaches for the case study at hand, which will help us to make claims based on automatically annotated corpus data with greater confidence.

Article metrics loading...

/content/journals/10.5117/NEDTAA2020.1.003.BLOE

2020-04-01

2025-07-29

The full text of this item is not currently available.

References

Augustinus, Liesbeth(2015). Complement raising and cluster formation in Dutch. A treebank-supported investigation. Doctoraal proefschrift KU Leuven.
[Google Scholar]
Augustinus, Liesbeth, VincentVandeghinste, InekeSchuurman & FrankVan Eynde(2013). Example-based treebank querying with GrETEL – now also for spoken Dutch. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), Linköping University Electronic Press, 423-428.
[Google Scholar]
Augustinus, Liesbeth, VincentVandeghinste & FrankVan Eynde(2012). Example-based treebank querying. In: Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), 3161-3167.
[Google Scholar]
Bastiaanse, Roelien & GosseBouma(2007). Frequency and linguistic complexity in agrammatic speech production. Brain and Language103(1), 18-28.
[Google Scholar]
van der Beek, Leonoor, GosseBouma & Gertjanvan Noord(2002). Een brede computationele grammatica voor het Nederlands. Nederlandse Taalkunde7(4), 353-374.
[Google Scholar]
Blasi, Damian, RyanCotterell, LawrenceWolf-Sonkin, SabineStoll, BalthasarBickel & MarcoBaroni(2019). On the distribution of deep clausal embeddings: A large cross-linguistic study. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, 3938-3943.
[Google Scholar]
Bloem, Jelke (2016a). Evaluating automatically annotated treebanks for linguistic research. In: Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC-4). Portorož, Slovenia: Institut für Deutsche Sprache, 8-14.
[Google Scholar]
Bloem, Jelke (2016b). Testing the processing hypothesis of word order variation using a probabilistic language model. In: Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), 174-185.
[Google Scholar]
Bloem, Jelke, ArjenVersloot & FredWeerman(2014). Applying automatically parsed corpora to the study of language variation. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin: Dublin City University and Association for Computational Linguistics, 1974-1984.
[Google Scholar]
Bloem, Jelke, ArjenVersloot & FredWeerman(2017). Verbal cluster order and processing complexity. In: EnochAboh (red.), Complexity in human languages: A multifaceted approach. Elsevier, 94-119.
[Google Scholar]
Bloemhoff, Henk(1979). Heranalyse van een Stellingwerver oppervlaktestructuur. Us Wurk, 28(1-4), 31-38.
[Google Scholar]
Bouma, Gosse(2017). Finding long-distance dependencies in the Lassy corpus. Crossroads Semantics: Computation, experiment and grammar, 39-56.
[Google Scholar]
Bouma, Gosse & JenniferSpenader(2008). The distribution of weak and strong object reflexives in Dutch. LOT Occasional Series12, 103-114.
[Google Scholar]
Clark, Kevin & ChristopherD Manning(2016). Improving coreference resolution by learning entity-level distributed representations. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 643-653
[Google Scholar]
Coussé, Evie, MonaArfs & GertDe Sutter(2008). Variabele werkwoordsvolgorde in de Nederlandse werkwoordelijke eindgroep: een taalgebruiksgebaseerd perspectief op de synchronie en diachronie van de zgn. rode en groene woordvolgorde. In: GudrunRawoens (red.), Taal aan den lijve: het gebruik van corpora in taalkundig onderzoek en taalonderwijs. Academia Press, 29-47.
[Google Scholar]
De Schutter, Georges(2012). De werkwoordelijke eindgroep en nog steeds geen einde?Verslagen & Mededelingen van de Koninklijke Academie voor Nederlandse Taal- en Letterkunde122(1), 1-38.
[Google Scholar]
De Sutter, Gert(2005). Rood, groen, corpus! Een taalgebruiksgebaseerde analyse van woordvolgordevariatie in tweeledige werkwoordelijke eindgroepen. Doctoraal proefschrift KU Leuven.
[Google Scholar]
Gibson, Edward & EvelinaFedorenko(2013). The need for quantitative methods in syntax and semantics research. Language and Cognitive Processes28(1-2), 88-124.
[Google Scholar]
Haeseryn, Walter, KirstenRomijn, GuidoGeerts, Jaapde Rooij & MaartenCornelis van den Toorn(1997). Algemene Nederlandse Spraakkunst. Tweede editie. Groningen: Martinus Nijhoff.
[Google Scholar]
Hinrichs, Erhard & KathrinBeck(2013). Auxiliary fronting in German: A walk in the woods. In: The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), 61-72.
[Google Scholar]
Kakkonen, Tuomo(2005). Dependency treebanks: methods, annotation schemes and tools. In: Proceedings of NODALIDA 2005, 94-104.
[Google Scholar]
de Kok, Daniël(2010). Dact [Decaffeinated Alpino Corpus Tool]. <rug-compling.github.com/dact>
[Google Scholar]
Lin, Yuri, Jean-BaptisteMichel, Erez LiebermanAiden, JonOrwant, WillBrockman & SlavPetrov(2012). Syntactic annotations for the Google Books Ngram corpus. In: Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 169-174.
[Google Scholar]
Müller, Frank Henrik.2004. Stylebook for the Tübingen partially parsed corpus of written German (TüPP-D/Z). In: Sonderforschungsbereich 441, Seminar für Sprachwissenschaft, Universität Tübingen28.
[Google Scholar]
Napoles, Courtney, MatthewGormley & BenjaminVan Durme(2012). Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. Association for Computational Linguistics, 95-100.
[Google Scholar]
Nivre, Joakim, Marie-CatherineDe Marneffe, FilipGinter, YoavGoldberg, JanHajic, ChristopherD Manning, RyanT McDonald, SlavPetrov, SampoPyysalo, NataliaSilveira, et al.(2016). Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 1659-1666.
[Google Scholar]
van Noord, Gertjan & GosseBouma(2009). Parsed corpora for linguistics. In: Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? Association for Computational Linguistics, 33-39.
[Google Scholar]
van Noord, Gertjan, GosseBouma, FrankVan Eynde, Daniëlde Kok, JelmerLinde, InekeSchuurman, Erik TjongKim Sang & VincentVandeghinste(2013). Large scale syntactic annotation of written Dutch: Lassy. In: PeterSpyns & JanOdijk (red.), Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing. Berlin: Springer, 147-164.
[Google Scholar]
van Noord, Gertjan, InekeSchuurman & GosseBouma(2010). Lassy Syntactische Annotatie. Revision 21780, November 13, 2019.
[Google Scholar]
van Noord, Gertjan(2009). Huge parsed corpora in LASSY. In: Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7)12. LOT, 115-126.
[Google Scholar]
van Noord, Gertjan, PietMertens, CédricFairon, A.Dister & P.Watrin(2006). At Last Parsing Is Now Operational. In: TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles. LeuvenUniversity Press, 20-42.
[Google Scholar]
Odijk, Jan, Gertjanvan Noord, PeterKleiweg & Erik TjongKim Sang(2017). The Parse and Query (PaQu) application. In: JanOdijk & Arjanvan Hessen (eds.), CLARIN in the Low Countries. Ubiquity Press, 281-297.
[Google Scholar]
Odijk, Jan, Martijnvan der Klis & SheeanSpoel(2018). Extensions to the GrETEL Treebank Query Application. In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16). Praag: Charles University, 46-55.
[Google Scholar]
Odijk, Jan(2015). Linguistic research with PaQu. Computational Linguistics in the Netherlands journal5, 3-14.
[Google Scholar]
Pullum, Geoffrey K(2007). Ungrammaticality, rarity, and corpus use. Corpus Linguistics and Linguistic Theory3(1), 33-47.
[Google Scholar]
Pullum, Geoffrey K(2017). Theory, data, and the epistemology of syntax. Grammatische Variation. Empirische Zugänge und theoretische Modellierung, 283-298.
[Google Scholar]
Schäfer, Roland & FelixBildhauer(2012). Building large corpora from the web using a new efficient tool chain. In: LREC, 486-493.
[Google Scholar]
Stroop, Jan(1970). Systeem in gesproken werkwoordsgroepen. Taal en Tongval22, 128-147.
[Google Scholar]
Stroop, Jan(2009). Twee- en meerledige werkwoordsgroepen in gesproken Nederlands. In: EgbertBeijk e.a. (red.), Fons Verborum. Feestbundel voor prof. dr. Fons Moerdijk. Leiden: Instituut voor Nederlandse Lexicologie, 459-469.
[Google Scholar]
Vandeghinste, Vincent & KoenMertens(2020). GrETEL @ INT: Querying Very Large Treebanks by Example. Computational Linguistics in the Netherlands30.
[Google Scholar]
van Wierst, Pauline, AriannaBetti, StevenHofstede, ThomCastermans, MichelWestenberg, YvetteOortwijn, ShenghuiWang & RobKoopman(2018). BolVis: Visualization for Text-based Research in Philosophy. In: 3rd Workshop on Visualization for the Digital Humanities. Berlin.
[Google Scholar]
Willems, Annelore & GertDe Sutter(2015). Reassessing the effect of the complexity principle on PP Placement in Dutch. Nederlandse Taalkunde20(3), 339-367.
[Google Scholar]

/content/journals/10.5117/NEDTAA2020.1.003.BLOE

Een corpus waar alle constructies in gevonden zouden moeten kunnen worden?*

NedTaal 25, 39 (2020); https://doi.org/10.5117/NEDTAA2020.1.003.BLOE

/content/journals/10.5117/NEDTAA2020.1.003.BLOE

Data & Media loading...

Article Type: Research Article

Keyword(s): automatic annotation; corpus linguistics; evaluation; verb clusters

Most Cited Most Cited RSS feed

- oa Leve hun! Waarom hun nog steeds hun zeggen
  
  Authors: Geertje van Bergen, Wessel Stoop, Jorrig Vogels & Helen de Hoop
- oa Expressive markers in online teenage talk
  
  Authors: Lisa Hilte, Reinhild Vandekerckhove & Walter Daelemans
- oa Tussentaal wordt omgangstaal in Vlaanderen
  
  By Johan De Caluwe
- oa Understanding grammar at the community level requires a diachronic perspective
  
  By Freek Van de Velde
- oa Language-specific tendencies towards morphological or syntactic constructions
  
  Authors: Isa Hendrikx, Kristel Van Goethem, Fanny Meunier & Philippe Hiligsmann
- oa Goed of fout
  
  Authors: Hans Bennis & Frans Hinskens
- oa Feiten en fictie - Taalvariatie in Vlaamse televisiereeksen vroeger en nu
  
  Authors: Sarah Van Hoof & Bram Vande kerckhove
- oa Perceptie van tussentaal in het gesproken Nederlands in Vlaanderen
  
  By Chloé Lybaert
- oa Connectieven in de rechterperiferie - Een contrastieve analyse van dus en donc in gesproken taal
  
  By Liesbeth Degand
- oa Expeditie Tussentaal - Leeftijd, identiteit en context in “Expeditie Robinson”
  
  Authors: Zenner Eline, Geeraerts Dirk & Speelman Dirk
More Less

oa Een corpus waar alle constructies in gevonden zouden moeten kunnen worden?*

Corpusonderzoek met behulp van automatisch gegenereerde syntactische annotatie

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

oa Leve hun! Waarom hun nog steeds hun zeggen

oa Expressive markers in online teenage talk

oa Tussentaal wordt omgangstaal in Vlaanderen

oa Understanding grammar at the community level requires a diachronic perspective

oa Language-specific tendencies towards morphological or syntactic constructions

oa Goed of fout

oa Feiten en fictie - Taalvariatie in Vlaamse televisiereeksen vroeger en nu

oa Perceptie van tussentaal in het gesproken Nederlands in Vlaanderen

oa Connectieven in de rechterperiferie - Een contrastieve analyse van dus en donc in gesproken taal

oa Expeditie Tussentaal - Leeftijd, identiteit en context in “Expeditie Robinson”

oa Een corpus waar alle constructies in gevonden zouden moeten kunnen worden?^*