A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora | Amsterdam University Press Journals Online
2004
Volume 3, Issue 1
  • E-ISSN: 2665-9085

Abstract

Abstract

The collaborative effort of theory-driven content analysis can benefit significantly from the use of topic analysis methods, which allow researchers to add more categories while developing or testing a theory. This additive approach enables the reuse of previous efforts of analysis or even the merging of separate research projects, thereby making these methods more accessible and increasing the discipline’s ability to create and share content analysis capabilities. This paper proposes a weakly supervised topic analysis method that uses both a low-cost unsupervised method to compile a training set and supervised deep learning as an additive and accurate text classification method. We test the validity of the method, specifically its additivity, by comparing the results of the method after adding 200 categories to an initial number of 450. We show that the suggested method provides a foundation for a low-cost solution for large-scale topic analysis.

Loading

Article metrics loading...

/content/journals/10.5117/CCR2021.1.002.FOGE
2021-03-01
2024-04-19
Loading full text...

Full text loading...

/deliver/fulltext/26659085/3/1/02_CCR2021.1_FOGE.html?itemId=/content/journals/10.5117/CCR2021.1.002.FOGE&mimeType=html&fmt=ahah

References

  1. Barberá, P., Casas, A., Nagler, J., Egan, P., Bonneau, R., Jost, J. T., & Tucker, J. A.(2018). Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data. American Political Science Review, 1–19. https://doi.org/10.1017/S0003055419000352
    [Google Scholar]
  2. Bengio, Y., Courville, A., & Vincent, P.(2012). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. Retrieved from https://ieeexplore.ieee.org/abstract/document/6472238
    [Google Scholar]
  3. Blei, D. M.(2012). Probabilistic Topic Models. Communications of the ACM, 55(4), 77–84. https://doi.org/https://doi.org/10.1145/2133806.2133826
    [Google Scholar]
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I.(2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. Retrieved from www.jmlr.org/papers/v3/blei03a.html
    [Google Scholar]
  5. Burscher, B., Odijk, D., Vliegenthart, R., de Rijke, M., & de Vreese, C. H.(2014). Teaching the Computer to Code Frames in News: Comparing Two Supervised Machine Learning Approaches to Frame Analysis. Communication Methods and Measures, 8(3), 190–206. Retrieved from https://doi.org/10.1080/19312458.2014.937527
    [Google Scholar]
  6. Burscher, B., Vliegenthart, R., & De Vreese, C. H.(2015). Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?The ANNALS of the American Academy of Political and Social Science, 659(1), 122–131. https://doi.org/10.1177/0002716215569441
    [Google Scholar]
  7. Cambria, E., & White, B.(2014). Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine, 9(2), 48–57. Retrieved from https://doi.org/10.1109/MCI.2014.2307227
    [Google Scholar]
  8. Chuang, J., Roberts, M. E., Stewart, B. M., Weiss, R., Tingley, D., Grimmer, J., & Heer, J.(2015). TopicCheck: Interactive Alignment for Assessing Topic Model Stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Retrieved from www.aclweb.org/anthology/N15-1018
    [Google Scholar]
  9. Collingwood, L., & Wilkerson, J.(2012). Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods. The Journal of Information Technology and Politics, 9(3), 298–318. Retrieved from https://doi.org/10.1080/19331681.2012.669191
    [Google Scholar]
  10. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., & Croft, W. B.(2017). Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (pp. 65–74). https://doi.org/10.1145/3077136.3080832
    [Google Scholar]
  11. Denny, M. J., & Spirling, A.(2018). Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It. Political Analysis, 26(2), 168–189. Retrieved from https://doi.org/10.1017/pan.2017.44
    [Google Scholar]
  12. Deriu, J., Lucchi, A., De Luca, V., Severyn, A., Müller, S., Cieliebak, M., … Jaggi, M.(2017). Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification. In Proceedings of the 26th international conference on world wide web. International World Wide Web Conferences Steering Committee (pp. 1045–1052). Retrieved from arxiv.org/abs/1703.02504
    [Google Scholar]
  13. dos Santos, C. N., & Gatti, M.(2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In COLING (pp. 69–78). Retrieved from www.aclweb.org/anthology/C14-1008
    [Google Scholar]
  14. Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., & Freire, N.(2013). Offspring from Reproduction Problems: What Replication Failure Teaches Us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (pp. 1691–1701). Retrieved from www.aclweb.org/anthology/P13-1166
    [Google Scholar]
  15. Gal, Y., & Ghahramani, Z.(2016). A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems.
    [Google Scholar]
  16. Grimmer, J., & King, G.(2011). General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences, 108(7), 2643–2650. Retrieved from https://doi.org/10.1073/pnas.1018067108
    [Google Scholar]
  17. Grimmer, Justin.(2010). A Bayesian Hierarchical Topic Model for Political Texts : Measuring Expressed Agendas in Senate Press Releases. Political Analysis, 18(1), 1–35. Retrieved from https://doi.org/10.1093/pan/mpp034
    [Google Scholar]
  18. Grimmer, Justin, & Stewart, B. M.(2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
    [Google Scholar]
  19. Guggenheim, L., Jang, S. M., Bae, S. Y., & Neuman, W. R.(2015). The Dynamics of Issue Frame Competition in Traditional and Social Media. The ANNALS of the American Academy of Political and Social Science, 659(1), 207–224. Retrieved from https://doi.org/10.1177%2F0002716215570549
    [Google Scholar]
  20. Günther, E., & Quandt, T.(2015). Word Counts and Topic Models. Digital Journalism, 4(1), 75–88. Retrieved from https://doi.org/10.1080/21670811.2015.1093270
    [Google Scholar]
  21. Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P.(2016). Big Social Data Analytics in Journalism and Mass Communication: Comparing Dictionary-Based Text Analysis and Unsupervised Topic Modeling. Journalism & Mass Communication Quarterly, 93(2), 332–359. Retrieved from https://doi.org/10.1177%2F1077699016639231
    [Google Scholar]
  22. Hernández-González, J., Inza, I., & Lozano, J. A.(2016). Weak supervision and other non-standard classification problems: A taxonomy. Pattern Recognition Letters, 69, 49–55. https://doi.org/10.1016/j.patrec.2015.10.008
    [Google Scholar]
  23. Hochreiter, S., & Schmidhuber, J. J.(1997). Long Short-Term Memory. Neural Computation, 9(8), 1–32. Retrieved from https://doi.org/10.1162/neco.1997.9.8.1735
    [Google Scholar]
  24. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., & Weld, D. S.(2011). Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 541–555). Association for Computational Linguistics.
    [Google Scholar]
  25. Jacobi, C., Van Atteveldt, W., & Welbers, K.(2016). Quantitative Analysis of Large Amounts of Journalistic Texts Using Topic Modelling. Digital Journalism, 4(1), 89–106. https://doi.org/10.1080/21670811.2015.1093271
    [Google Scholar]
  26. King, G., Lam, P., & Roberts, M. E.(2017). Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal of Political Science, 61(4), 971–988. https://doi.org/10.1111/ajps.12291
    [Google Scholar]
  27. Kingma, D. P., & Ba, J.(2014). Adam: A Method for Stochastic Optimization. ArXiv, 1–15. Retrieved from arxiv.org/abs/1412.6980
    [Google Scholar]
  28. Krizhevsky, A., Sutskever, I., & Hinton, G. E.(2012). Imagenet large scale visual recognition challenge. Advances in Neural Information Processing Systems.
    [Google Scholar]
  29. Kurata, G., Xiang, B., & Zhou, B.(2016). Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 521–526). https://doi.org/10.18653/v1/n16-1063
    [Google Scholar]
  30. Lai, S., Xu, L., Liu, K., & Zhao, J.(2015). Recurrent Convolutional Neural Networks for Text Classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267–2273). Retrieved from www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9745/9552
    [Google Scholar]
  31. LeCunn, Y., Bengio, Y., & Hinton, G.(2015). Deep Learning. Nature, 521(7533), 436–444. Retrieved from https://doi.org/10.1038/nature14539
    [Google Scholar]
  32. Leetaru, K., & Schrodt, P. a.(2013). GDELT: Global Data on Events, Location and Tone, 1979-2012. In 2013 Annual Meeting of the International Studies Association (Vol. 2). Retrieved from citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.686.6605&rep=rep1&type=pdf
    [Google Scholar]
  33. Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D.(2015). Computer-Assisted Text Analysis for Comparative Politics. Political Analysis, 23(2), 1–24. https://doi.org/10.1093/pan/mpu019
    [Google Scholar]
  34. Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., … Adam, S.(2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2–3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
    [Google Scholar]
  35. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D.(2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Baltimore, Maryland, USA: Association for Computational Linguistics. Retrieved from www.aclweb.org/anthology/P14-5010
    [Google Scholar]
  36. Nair, V., & Hinton, G. E.(2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). https://doi.org/10.1123/jab.2016-0355
    [Google Scholar]
  37. Nam, J., Kim, J., Loza Mencía, E., Gurevych, I., & Fürnkranz, J.(2014). Large-scale multi-label text classification − Revisiting neural networks. In Joint european conference on machine learning and knowledge discovery in databases. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-662-44851-9_28
    [Google Scholar]
  38. Nardulli, P. F., Althaus, S. L., & Hayes, M.(2015). A Progressive Supervised-Learning Approach to Generating Rich Civil Strife Data. Sociological Methodology, 45(1), 148–183. https://doi.org/10.1177/0081175015581378
    [Google Scholar]
  39. Pennington, J., Socher, R., & Manning, C. D.(2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. Retrieved from www.aclweb.org/anthology/D14-1162
    [Google Scholar]
  40. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R.(2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. Retrieved from https://doi.org/10.1111/j.1540-5907.2009.00427.x
    [Google Scholar]
  41. Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Ré, C.(2020). Snorkel: rapid training data creation with weak supervision. VLDB Journal, 29(2–3), 709–730. https://doi.org/10.1007/s00778-019-00552-1
    [Google Scholar]
  42. Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., … Rand, D. G.(2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103
    [Google Scholar]
  43. Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M.(2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures, 12(2–3), 140–157. https://doi.org/10.1080/19312458.2018.1455817
    [Google Scholar]
  44. Schwartz, H. A., & Ungar, L. H.(2015). Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods. The ANNALS of the American Academy of Political and Social Science, 659(1), 78–94. Retrieved from https://doi.org/10.1177%2F0002716215569197
    [Google Scholar]
  45. Soroka, S., Young, L., & Balmas, M.(2015). Bad News or Mad News? Sentiment scoring of Negativity, Fear, and Anger in News Content. The ANNALS of the American Academy of Political and Social Science, 659(May), 108–121. https://doi.org/10.1177/0002716215569217
    [Google Scholar]
  46. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.(2014). Dropout : A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (JMLR), 15, 1929–1958. Retrieved from www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
    [Google Scholar]
  47. Wilkerson, J. D., & Casas, A.(2017). Large-scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Sciene, 20(May 2017), 1–18. Retrieved from https://doi.org/10.1146/annurev-polisci-052615-025542
    [Google Scholar]
  48. Zhou, Z. H.(2018). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53. https://doi.org/10.1093/nsr/nwx106
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.5117/CCR2021.1.002.FOGE
Loading
/content/journals/10.5117/CCR2021.1.002.FOGE
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error