2004
Volume 25, Issue 1
  • ISSN: 1384-5845
  • E-ISSN: 2352-1171

Abstract

Abstract

In this contribution, I discuss the use of automatic syntactic annotation in Dutch corpus research, using a case study of five-verb clusters. Large amounts of text can be annotated automatically, but the parser makes mistakes, while correct annotation is very important in linguistic research. How much of a problem is this, and how can we learn about the extent of these parsing mistakes? There are several approaches to evaluating the quality of automatic annotation for specific research questions. I demonstrate these approaches for the case study at hand, which will help us to make claims based on automatically annotated corpus data with greater confidence.

Loading

Article metrics loading...

/content/journals/10.5117/NEDTAA2020.1.003.BLOE
2020-04-01
2022-05-21
Loading full text...

Full text loading...

/deliver/fulltext/13845845/25/1/03_NEDTAA2020.1_BLOE.html?itemId=/content/journals/10.5117/NEDTAA2020.1.003.BLOE&mimeType=html&fmt=ahah

References

  1. Augustinus, Liesbeth(2015). Complement raising and cluster formation in Dutch. A treebank-supported investigation. Doctoraal proefschrift KU Leuven.
    [Google Scholar]
  2. Augustinus, Liesbeth, VincentVandeghinste, InekeSchuurman & FrankVan Eynde(2013). Example-based treebank querying with GrETEL – now also for spoken Dutch. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), Linköping University Electronic Press, 423-428.
    [Google Scholar]
  3. Augustinus, Liesbeth, VincentVandeghinste & FrankVan Eynde(2012). Example-based treebank querying. In: Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), 3161-3167.
    [Google Scholar]
  4. Bastiaanse, Roelien & GosseBouma(2007). Frequency and linguistic complexity in agrammatic speech production. Brain and Language103(1), 18-28.
    [Google Scholar]
  5. van der Beek, Leonoor, GosseBouma & Gertjanvan Noord(2002). Een brede computationele grammatica voor het Nederlands. Nederlandse Taalkunde7(4), 353-374.
    [Google Scholar]
  6. Blasi, Damian, RyanCotterell, LawrenceWolf-Sonkin, SabineStoll, BalthasarBickel & MarcoBaroni(2019). On the distribution of deep clausal embeddings: A large cross-linguistic study. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, 3938-3943.
    [Google Scholar]
  7. Bloem, Jelke (2016a). Evaluating automatically annotated treebanks for linguistic research. In: Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC-4). Portorož, Slovenia: Institut für Deutsche Sprache, 8-14.
    [Google Scholar]
  8. Bloem, Jelke (2016b). Testing the processing hypothesis of word order variation using a probabilistic language model. In: Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), 174-185.
    [Google Scholar]
  9. Bloem, Jelke, ArjenVersloot & FredWeerman(2014). Applying automatically parsed corpora to the study of language variation. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin: Dublin City University and Association for Computational Linguistics, 1974-1984.
    [Google Scholar]
  10. Bloem, Jelke, ArjenVersloot & FredWeerman(2017). Verbal cluster order and processing complexity. In: EnochAboh (red.), Complexity in human languages: A multifaceted approach. Elsevier, 94-119.
    [Google Scholar]
  11. Bloemhoff, Henk(1979). Heranalyse van een Stellingwerver oppervlaktestructuur. Us Wurk, 28(1-4), 31-38.
    [Google Scholar]
  12. Bouma, Gosse(2017). Finding long-distance dependencies in the Lassy corpus. Crossroads Semantics: Computation, experiment and grammar, 39-56.
    [Google Scholar]
  13. Bouma, Gosse & JenniferSpenader(2008). The distribution of weak and strong object reflexives in Dutch. LOT Occasional Series12, 103-114.
    [Google Scholar]
  14. Clark, Kevin & ChristopherD Manning(2016). Improving coreference resolution by learning entity-level distributed representations. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 643-653
    [Google Scholar]
  15. Coussé, Evie, MonaArfs & GertDe Sutter(2008). Variabele werkwoordsvolgorde in de Nederlandse werkwoordelijke eindgroep: een taalgebruiksgebaseerd perspectief op de synchronie en diachronie van de zgn. rode en groene woordvolgorde. In: GudrunRawoens (red.), Taal aan den lijve: het gebruik van corpora in taalkundig onderzoek en taalonderwijs. Academia Press, 29-47.
    [Google Scholar]
  16. De Schutter, Georges(2012). De werkwoordelijke eindgroep en nog steeds geen einde?Verslagen & Mededelingen van de Koninklijke Academie voor Nederlandse Taal- en Letterkunde122(1), 1-38.
    [Google Scholar]
  17. De Sutter, Gert(2005). Rood, groen, corpus! Een taalgebruiksgebaseerde analyse van woordvolgordevariatie in tweeledige werkwoordelijke eindgroepen. Doctoraal proefschrift KU Leuven.
    [Google Scholar]
  18. Gibson, Edward & EvelinaFedorenko(2013). The need for quantitative methods in syntax and semantics research. Language and Cognitive Processes28(1-2), 88-124.
    [Google Scholar]
  19. Haeseryn, Walter, KirstenRomijn, GuidoGeerts, Jaapde Rooij & MaartenCornelis van den Toorn(1997). Algemene Nederlandse Spraakkunst. Tweede editie. Groningen: Martinus Nijhoff.
    [Google Scholar]
  20. Hinrichs, Erhard & KathrinBeck(2013). Auxiliary fronting in German: A walk in the woods. In: The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), 61-72.
    [Google Scholar]
  21. Kakkonen, Tuomo(2005). Dependency treebanks: methods, annotation schemes and tools. In: Proceedings of NODALIDA 2005, 94-104.
    [Google Scholar]
  22. de Kok, Daniël(2010). Dact [Decaffeinated Alpino Corpus Tool]. <rug-compling.github.com/dact>
    [Google Scholar]
  23. Lin, Yuri, Jean-BaptisteMichel, Erez LiebermanAiden, JonOrwant, WillBrockman & SlavPetrov(2012). Syntactic annotations for the Google Books Ngram corpus. In: Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 169-174.
    [Google Scholar]
  24. Müller, Frank Henrik.2004. Stylebook for the Tübingen partially parsed corpus of written German (TüPP-D/Z). In: Sonderforschungsbereich 441, Seminar für Sprachwissenschaft, Universität Tübingen28.
    [Google Scholar]
  25. Napoles, Courtney, MatthewGormley & BenjaminVan Durme(2012). Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. Association for Computational Linguistics, 95-100.
    [Google Scholar]
  26. Nivre, Joakim, Marie-CatherineDe Marneffe, FilipGinter, YoavGoldberg, JanHajic, ChristopherD Manning, RyanT McDonald, SlavPetrov, SampoPyysalo, NataliaSilveira, et al.(2016). Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 1659-1666.
    [Google Scholar]
  27. van Noord, Gertjan & GosseBouma(2009). Parsed corpora for linguistics. In: Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? Association for Computational Linguistics, 33-39.
    [Google Scholar]
  28. van Noord, Gertjan, GosseBouma, FrankVan Eynde, Daniëlde Kok, JelmerLinde, InekeSchuurman, Erik TjongKim Sang & VincentVandeghinste(2013). Large scale syntactic annotation of written Dutch: Lassy. In: PeterSpyns & JanOdijk (red.), Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing. Berlin: Springer, 147-164.
    [Google Scholar]
  29. van Noord, Gertjan, InekeSchuurman & GosseBouma(2010). Lassy Syntactische Annotatie. Revision 21780, November 13, 2019.
    [Google Scholar]
  30. van Noord, Gertjan(2009). Huge parsed corpora in LASSY. In: Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7)12. LOT, 115-126.
    [Google Scholar]
  31. van Noord, Gertjan, PietMertens, CédricFairon, A.Dister & P.Watrin(2006). At Last Parsing Is Now Operational. In: TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles. LeuvenUniversity Press, 20-42.
    [Google Scholar]
  32. Odijk, Jan, Gertjanvan Noord, PeterKleiweg & Erik TjongKim Sang(2017). The Parse and Query (PaQu) application. In: JanOdijk & Arjanvan Hessen (eds.), CLARIN in the Low Countries. Ubiquity Press, 281-297.
    [Google Scholar]
  33. Odijk, Jan, Martijnvan der Klis & SheeanSpoel(2018). Extensions to the GrETEL Treebank Query Application. In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16). Praag: Charles University, 46-55.
    [Google Scholar]
  34. Odijk, Jan(2015). Linguistic research with PaQu. Computational Linguistics in the Netherlands journal5, 3-14.
    [Google Scholar]
  35. Pullum, Geoffrey K(2007). Ungrammaticality, rarity, and corpus use. Corpus Linguistics and Linguistic Theory3(1), 33-47.
    [Google Scholar]
  36. Pullum, Geoffrey K(2017). Theory, data, and the epistemology of syntax. Grammatische Variation. Empirische Zugänge und theoretische Modellierung, 283-298.
    [Google Scholar]
  37. Schäfer, Roland & FelixBildhauer(2012). Building large corpora from the web using a new efficient tool chain. In: LREC, 486-493.
    [Google Scholar]
  38. Stroop, Jan(1970). Systeem in gesproken werkwoordsgroepen. Taal en Tongval22, 128-147.
    [Google Scholar]
  39. Stroop, Jan(2009). Twee- en meerledige werkwoordsgroepen in gesproken Nederlands. In: EgbertBeijk e.a. (red.), Fons Verborum. Feestbundel voor prof. dr. Fons Moerdijk. Leiden: Instituut voor Nederlandse Lexicologie, 459-469.
    [Google Scholar]
  40. Vandeghinste, Vincent & KoenMertens(2020). GrETEL @ INT: Querying Very Large Treebanks by Example. Computational Linguistics in the Netherlands30.
    [Google Scholar]
  41. van Wierst, Pauline, AriannaBetti, StevenHofstede, ThomCastermans, MichelWestenberg, YvetteOortwijn, ShenghuiWang & RobKoopman(2018). BolVis: Visualization for Text-based Research in Philosophy. In: 3rd Workshop on Visualization for the Digital Humanities. Berlin.
    [Google Scholar]
  42. Willems, Annelore & GertDe Sutter(2015). Reassessing the effect of the complexity principle on PP Placement in Dutch. Nederlandse Taalkunde20(3), 339-367.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.5117/NEDTAA2020.1.003.BLOE
Loading
/content/journals/10.5117/NEDTAA2020.1.003.BLOE
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): automatic annotation; corpus linguistics; evaluation; verb clusters
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error