2004
Volume 25, Issue 1
  • ISSN: 1384-5845
  • E-ISSN: 2352-1171

Abstract

Abstract

Corpora are a useful and important source of evidence for linguistic research, but they are not the only kind of evidence, do not have any special status as evidence, and have their limitations. Recent very user-friendly applications such as GrETEL make it very easy to search in large and richly annotated corpora on the basis of an example sentence and without knowledge of a query language or the exact nature of the linguistic annotations. It is therefore very tempting to use these applications intensively. That is fine, but also dangerous in ways, because in many cases, in order to interpret the results correctly, the researcher must really be aware of the precise nature of the linguistic annotations and of the way in which the user-friendly interface generates a query on the basis of an example sentence. I will illustrate this with several examples. I also sketch some methods for avoiding or mitigating the dangers and argue that the applications should support these methods also in as user-friendly a manner as possible.

Loading

Article metrics loading...

/content/journals/10.5117/NEDTAA2020.1.002.ODIJ
2020-04-01
2021-12-07
Loading full text...

Full text loading...

References

  1. Augustinus, Liesbeth, Schuurman, Ineke, Vandeghinste, Vincent & Van Eynde, Frank(2014). GrETEL. Searching for breadcrumbs in texts (CLARIN Educational Module). Centre for Computational Linguistics KU Leuven. <dev.clarin.nl/sites/default/files/EducationalModule-v4b.pdf>
  2. Augustinus, Liesbeth(2015). Complement raising and cluster formation in Dutch. A treebank-supported investigation. Doctoraal proefschrift KU Leuven.
  3. Bloem, Jelke(2016). Evaluating automatically annotated treebanks for linguistic research. In: P.Bański, M.Kupietz, H.Lüngen, A.Witt, A.Barbaresi, H.Biber, E.Breiteneder & S.Clematide (red.), Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC 4). Paris: ELRA, 8-14. <www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-CMLC Proceedings.pdf>
    [Google Scholar]
  4. Bouma, Gosse, J.M.van Koppen, FrankLandsbergen, JanOdijk, Tonvan der Wouden & Matjevan de Camp(2015). Enriching a descriptive grammar with treebank queries. Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories 14 (TLT14), 13-25.
    [Google Scholar]
  5. Edelman, Shimon & Morten H.Christiansen(2003)How seriously should we take Minimalist syntax?Trends in Cognitive Science7, 60-61.
    [Google Scholar]
  6. Featherston, Sam.(2009). Relax, lean back, and be a linguist. Zeitschrift für Sprachwissenschaft28(1), 127-132.
    [Google Scholar]
  7. Gibson, Edward & EvelinaFedorenko.(2010). Weak quantitative standards in linguistics research. Trends in Cognitive Science14, 233-234.
    [Google Scholar]
  8. Gibson, Edward, StevenT. Piantadosi & EvelinaFedorenko(2013). Quantitative methods in syntax/semantics research: A response to Sprouse and Almeida (2013). Language and Cognitive Processes28(3), 229-240.
    [Google Scholar]
  9. CharlesJ. Fillmore(1992). ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In: JanSvartvik (red.), Directions in Corpus Linguistics. Berlin/New York: Mouton de Gruyter, 35-60.
    [Google Scholar]
  10. Hoekstra, Heleen, MichaelMoortgat, BramRenmans, MachteldSchouppe, InekeSchuurman & Tonvan der Wouden(2003). CGN syntactische annotatie. CGN report Utrecht University.
  11. Kampen, Jacqueline van(2009). The non-biological evolution of grammar: Wh-question formation in Germanic. Biolinguistics3(2-3), 154-185.
    [Google Scholar]
  12. Langendoen, D. Terence(1973). The problem of linguistic theory in relation to language behavior: A tribute and reply to Paul Goodman. Daedalus102(3), 195-201.
    [Google Scholar]
  13. Linzen, Tal & YoheiOseki(2018). The reliability of acceptability judgments across languages. Glossa: a journal of general linguistics3(1), 1-25.
    [Google Scholar]
  14. Noord, Gertjan van, InekeSchuurman & GosseBouma(2011). Lassy syntactische annotatie (revision 19455). Lassy report, RU Groningen. <https://www.let.rug.nl/vannoord/Lassy/sa-man_lassy.pdf>
    [Google Scholar]
  15. Odijk, Jan(2015). Linguistic research with PaQU. Computational Linguistics in The Netherlands Journal5, 3-14.
    [Google Scholar]
  16. Odijk, Jan (2016a). Linguistic research using CLARIN. Lingua178, 1-4.
    [Google Scholar]
  17. Odijk, Jan (red.) (2016b). Linguistic research in the CLARIN infrastructure. Lingua178.
    [Google Scholar]
  18. Odijk, Jan (2016c). A use case for linguistic research on Dutch with CLARIN. In: K.De Smedt (red.), Selected Papers from the CLARIN Annual Conference 2015 (Vol. 123). Linköping: Linköping University Electronic Press, 45-61.
    [Google Scholar]
  19. Odijk, Jan(2018). Boosting linguistic research with CLARIN. Lezing op ESSLLI 2018, Sofia (Bulgarije), 14 augustus 2018.
    [Google Scholar]
  20. Odijk, Jan, AlexisDimitriadis, Martijnvan der Klis, Marjovan Koppen, MeieOtten & Remcovan de Veen (2018a). The AnnCor CHILDES Treebank. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. Paris: European Language Resources Association (ELRA), 2275-2283.
    [Google Scholar]
  21. Odijk, Jan & A.van Hessen (red.) (2017). CLARIN in the Low Countries. London: Ubiquity Press.
    [Google Scholar]
  22. Odijk, Jan, M.van der Klis & S.Spoel (2018b). Extensions to the GrETEL Treebank Query Application. In: E.Bejcek (red.), Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16). Praag: Charles University, 46-55.
    [Google Scholar]
  23. Odijk, Jan, Gertjanvan Noord, P.Kleiweg & ErikTjong Kim Sang(2017). The Parse and Query (PaQu) application. In: J.Odijk, & A.van Hessen (red.), CLARIN in the Low Countries. London, UK: Ubiquity Press, 281-297.
    [Google Scholar]
  24. Oostdijk, Nelleke, MartinReynaert, VeroniqueHoste & InekeSchuurman(2013). The construction of a 500-million-word reference corpus of contemporary written Dutch. In: P.Spyns & J.Odijk (red.), Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme.
    [Google Scholar]
  25. Phillips, C.(2010). Should we impeach armchair linguists?Japanese/Korean Linguistics17, 49-64.
    [Google Scholar]
  26. Phillips, C. & H.Lasnik(2003). Linguistics and empirical evidence: Reply to Edelman and Christiansen. Trends in Cognitive Sciences7(2), 61-62.
    [Google Scholar]
  27. Pullum, Geoffrey(2017). Theory, data, and the epistemology of syntax. In: M.Konopka & A.Wöllstein (red.), Grammatische Variation: Empirische Zugänge und theoretische Modellierung. Berlin/New York: Mouton de Gruyter, 283-298.
    [Google Scholar]
  28. Schütze, Carson T.(1996). The empirical base of linguistics: Grammaticality judgments and linguistic methodology, Chicago: The University of Chicago Press.
  29. Schütze, Carson T.(2016). The empirical base of linguistics: Grammaticality judgments and linguistic methodology. Berlin: Language Science Press.
  30. Spyns, Peter & JanOdijk (red.) (2013). Essential speech and language technology for Dutch. Results by the STEVIN-programme. Berlin/Heidelberg: Springer.
  31. Van Eynde, Frank(2009). A treebank-driven investigation of predicative complements in Dutch. An efficient, practical, actually usable approach. Computational Linguistics in the Netherlands 2009 - Selected Papers from the 19th CLIN Meeting, CLIN 2009, 131-145.
    [Google Scholar]
  32. Van Eynde, Frank, LiesbethAugustinus & VincentVandeghinste(2016). Number agreement in copular constructions: A treebank-based investigation. Lingua178, 104-126.
    [Google Scholar]
  33. Wasow, Thomas and JenniferE. Arnold, (2005). Intuitions in linguistic argumentation. Lingua115, 1481-1496.
    [Google Scholar]
  34. Wouden, Ton van deret al. (2016a). Het Taalportaal. Een nieuwe wetenschappelijke grammatica voor het Nederlands en het Fries (en het Afrikaans). Nederlandse Taalkunde21(1), 157-168.
    [Google Scholar]
  35. Wouden, Ton van der, GosseBouma, Matjevan de Camp, Marjovan Koppen, FrankLandsbergen & JanOdijk (2016b). Enriching a grammatical database with intelligent links to linguistic resources. In: K.De Smedt (red.), Selected Papers from the CLARIN Annual Conference 2015. Linköping: Linköping University Electronic Press, 108-117.
    [Google Scholar]
  36. Wouden, Ton van der, GosseBouma, Matjevan de Camp, Marjovan Koppen, FrankLandsbergen & JanOdijk(2017). Enriching a scientific grammar with links to linguistic resources: The Taalportaal. In: J.Odijk & A.van Hessen (red.), CLARIN in the Low Countries. London: Ubiquity Press, pp. 299-310.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.5117/NEDTAA2020.1.002.ODIJ
Loading
/content/journals/10.5117/NEDTAA2020.1.002.ODIJ
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): corpus analysis; corpus applications; reliability of corpus analysis
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error