A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora

Yair Fogel-Dror; Shaul R. Shenhav; Tamir Sheafer

doi:10.5117/CCR2021.1.002.FOGE

E-ISSN: 2665-9085

oa A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora
By Yair Fogel-Dror, Shaul R. Shenhav & Tamir Sheafer
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 3, Issue 1, Mar 2021, p. 29 - 59
DOI: https://doi.org/10.5117/CCR2021.1.002.FOGE
Language: English
- Published online: 01 Mar 2021

Abstract

The collaborative effort of theory-driven content analysis can benefit significantly from the use of topic analysis methods, which allow researchers to add more categories while developing or testing a theory. This additive approach enables the reuse of previous efforts of analysis or even the merging of separate research projects, thereby making these methods more accessible and increasing the discipline’s ability to create and share content analysis capabilities. This paper proposes a weakly supervised topic analysis method that uses both a low-cost unsupervised method to compile a training set and supervised deep learning as an additive and accurate text classification method. We test the validity of the method, specifically its additivity, by comparing the results of the method after adding 200 categories to an initial number of 450. We show that the suggested method provides a foundation for a low-cost solution for large-scale topic analysis.

Article metrics loading...

/content/journals/10.5117/CCR2021.1.002.FOGE

2021-03-01

2024-04-19

Full text loading...

/deliver/fulltext/26659085/3/1/02_CCR2021.1_FOGE.html?itemId=/content/journals/10.5117/CCR2021.1.002.FOGE&mimeType=html&fmt=ahah

References

Barberá, P., Casas, A., Nagler, J., Egan, P., Bonneau, R., Jost, J. T., & Tucker, J. A.(2018). Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data. American Political Science Review, 1–19. https://doi.org/10.1017/S0003055419000352
[Google Scholar]
Bengio, Y., Courville, A., & Vincent, P.(2012). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. Retrieved from https://ieeexplore.ieee.org/abstract/document/6472238
[Google Scholar]
Blei, D. M.(2012). Probabilistic Topic Models. Communications of the ACM, 55(4), 77–84. https://doi.org/https://doi.org/10.1145/2133806.2133826
[Google Scholar]
Blei, D. M., Ng, A. Y., & Jordan, M. I.(2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. Retrieved from www.jmlr.org/papers/v3/blei03a.html
[Google Scholar]
Burscher, B., Odijk, D., Vliegenthart, R., de Rijke, M., & de Vreese, C. H.(2014). Teaching the Computer to Code Frames in News: Comparing Two Supervised Machine Learning Approaches to Frame Analysis. Communication Methods and Measures, 8(3), 190–206. Retrieved from https://doi.org/10.1080/19312458.2014.937527
[Google Scholar]
Burscher, B., Vliegenthart, R., & De Vreese, C. H.(2015). Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?The ANNALS of the American Academy of Political and Social Science, 659(1), 122–131. https://doi.org/10.1177/0002716215569441
[Google Scholar]
Cambria, E., & White, B.(2014). Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine, 9(2), 48–57. Retrieved from https://doi.org/10.1109/MCI.2014.2307227
[Google Scholar]
Chuang, J., Roberts, M. E., Stewart, B. M., Weiss, R., Tingley, D., Grimmer, J., & Heer, J.(2015). TopicCheck: Interactive Alignment for Assessing Topic Model Stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Retrieved from www.aclweb.org/anthology/N15-1018
[Google Scholar]
Collingwood, L., & Wilkerson, J.(2012). Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods. The Journal of Information Technology and Politics, 9(3), 298–318. Retrieved from https://doi.org/10.1080/19331681.2012.669191
[Google Scholar]
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., & Croft, W. B.(2017). Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (pp. 65–74). https://doi.org/10.1145/3077136.3080832
[Google Scholar]
Denny, M. J., & Spirling, A.(2018). Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It. Political Analysis, 26(2), 168–189. Retrieved from https://doi.org/10.1017/pan.2017.44
[Google Scholar]
Deriu, J., Lucchi, A., De Luca, V., Severyn, A., Müller, S., Cieliebak, M., … Jaggi, M.(2017). Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification. In Proceedings of the 26th international conference on world wide web. International World Wide Web Conferences Steering Committee (pp. 1045–1052). Retrieved from arxiv.org/abs/1703.02504
[Google Scholar]
dos Santos, C. N., & Gatti, M.(2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In COLING (pp. 69–78). Retrieved from www.aclweb.org/anthology/C14-1008
[Google Scholar]
Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., & Freire, N.(2013). Offspring from Reproduction Problems: What Replication Failure Teaches Us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (pp. 1691–1701). Retrieved from www.aclweb.org/anthology/P13-1166
[Google Scholar]
Gal, Y., & Ghahramani, Z.(2016). A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems.
[Google Scholar]
Grimmer, J., & King, G.(2011). General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences, 108(7), 2643–2650. Retrieved from https://doi.org/10.1073/pnas.1018067108
[Google Scholar]
Grimmer, Justin.(2010). A Bayesian Hierarchical Topic Model for Political Texts : Measuring Expressed Agendas in Senate Press Releases. Political Analysis, 18(1), 1–35. Retrieved from https://doi.org/10.1093/pan/mpp034
[Google Scholar]
Grimmer, Justin, & Stewart, B. M.(2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
[Google Scholar]
Guggenheim, L., Jang, S. M., Bae, S. Y., & Neuman, W. R.(2015). The Dynamics of Issue Frame Competition in Traditional and Social Media. The ANNALS of the American Academy of Political and Social Science, 659(1), 207–224. Retrieved from https://doi.org/10.1177%2F0002716215570549
[Google Scholar]
Günther, E., & Quandt, T.(2015). Word Counts and Topic Models. Digital Journalism, 4(1), 75–88. Retrieved from https://doi.org/10.1080/21670811.2015.1093270
[Google Scholar]
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P.(2016). Big Social Data Analytics in Journalism and Mass Communication: Comparing Dictionary-Based Text Analysis and Unsupervised Topic Modeling. Journalism & Mass Communication Quarterly, 93(2), 332–359. Retrieved from https://doi.org/10.1177%2F1077699016639231
[Google Scholar]
Hernández-González, J., Inza, I., & Lozano, J. A.(2016). Weak supervision and other non-standard classification problems: A taxonomy. Pattern Recognition Letters, 69, 49–55. https://doi.org/10.1016/j.patrec.2015.10.008
[Google Scholar]
Hochreiter, S., & Schmidhuber, J. J.(1997). Long Short-Term Memory. Neural Computation, 9(8), 1–32. Retrieved from https://doi.org/10.1162/neco.1997.9.8.1735
[Google Scholar]
Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., & Weld, D. S.(2011). Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 541–555). Association for Computational Linguistics.
[Google Scholar]
Jacobi, C., Van Atteveldt, W., & Welbers, K.(2016). Quantitative Analysis of Large Amounts of Journalistic Texts Using Topic Modelling. Digital Journalism, 4(1), 89–106. https://doi.org/10.1080/21670811.2015.1093271
[Google Scholar]
King, G., Lam, P., & Roberts, M. E.(2017). Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal of Political Science, 61(4), 971–988. https://doi.org/10.1111/ajps.12291
[Google Scholar]
Kingma, D. P., & Ba, J.(2014). Adam: A Method for Stochastic Optimization. ArXiv, 1–15. Retrieved from arxiv.org/abs/1412.6980
[Google Scholar]
Krizhevsky, A., Sutskever, I., & Hinton, G. E.(2012). Imagenet large scale visual recognition challenge. Advances in Neural Information Processing Systems.
[Google Scholar]
Kurata, G., Xiang, B., & Zhou, B.(2016). Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 521–526). https://doi.org/10.18653/v1/n16-1063
[Google Scholar]
Lai, S., Xu, L., Liu, K., & Zhao, J.(2015). Recurrent Convolutional Neural Networks for Text Classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267–2273). Retrieved from www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9745/9552
[Google Scholar]
LeCunn, Y., Bengio, Y., & Hinton, G.(2015). Deep Learning. Nature, 521(7533), 436–444. Retrieved from https://doi.org/10.1038/nature14539
[Google Scholar]
Leetaru, K., & Schrodt, P. a.(2013). GDELT: Global Data on Events, Location and Tone, 1979-2012. In 2013 Annual Meeting of the International Studies Association (Vol. 2). Retrieved from citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.686.6605&rep=rep1&type=pdf
[Google Scholar]
Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D.(2015). Computer-Assisted Text Analysis for Comparative Politics. Political Analysis, 23(2), 1–24. https://doi.org/10.1093/pan/mpu019
[Google Scholar]
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., … Adam, S.(2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2–3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
[Google Scholar]
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D.(2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Baltimore, Maryland, USA: Association for Computational Linguistics. Retrieved from www.aclweb.org/anthology/P14-5010
[Google Scholar]
Nair, V., & Hinton, G. E.(2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). https://doi.org/10.1123/jab.2016-0355
[Google Scholar]
Nam, J., Kim, J., Loza Mencía, E., Gurevych, I., & Fürnkranz, J.(2014). Large-scale multi-label text classification − Revisiting neural networks. In Joint european conference on machine learning and knowledge discovery in databases. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-662-44851-9_28
[Google Scholar]
Nardulli, P. F., Althaus, S. L., & Hayes, M.(2015). A Progressive Supervised-Learning Approach to Generating Rich Civil Strife Data. Sociological Methodology, 45(1), 148–183. https://doi.org/10.1177/0081175015581378
[Google Scholar]
Pennington, J., Socher, R., & Manning, C. D.(2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. Retrieved from www.aclweb.org/anthology/D14-1162
[Google Scholar]
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R.(2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. Retrieved from https://doi.org/10.1111/j.1540-5907.2009.00427.x
[Google Scholar]
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Ré, C.(2020). Snorkel: rapid training data creation with weak supervision. VLDB Journal, 29(2–3), 709–730. https://doi.org/10.1007/s00778-019-00552-1
[Google Scholar]
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., … Rand, D. G.(2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103
[Google Scholar]
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M.(2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures, 12(2–3), 140–157. https://doi.org/10.1080/19312458.2018.1455817
[Google Scholar]
Schwartz, H. A., & Ungar, L. H.(2015). Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods. The ANNALS of the American Academy of Political and Social Science, 659(1), 78–94. Retrieved from https://doi.org/10.1177%2F0002716215569197
[Google Scholar]
Soroka, S., Young, L., & Balmas, M.(2015). Bad News or Mad News? Sentiment scoring of Negativity, Fear, and Anger in News Content. The ANNALS of the American Academy of Political and Social Science, 659(May), 108–121. https://doi.org/10.1177/0002716215569217
[Google Scholar]
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.(2014). Dropout : A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (JMLR), 15, 1929–1958. Retrieved from www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
[Google Scholar]
Wilkerson, J. D., & Casas, A.(2017). Large-scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Sciene, 20(May 2017), 1–18. Retrieved from https://doi.org/10.1146/annurev-polisci-052615-025542
[Google Scholar]
Zhou, Z. H.(2018). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53. https://doi.org/10.1093/nsr/nwx106
[Google Scholar]

http://instance.metastore.ingenta.com/content/journals/10.5117/CCR2021.1.002.FOGE

A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora

CCR 3, 29 (2021); https://doi.org/10.5117/CCR2021.1.002.FOGE

/content/journals/10.5117/CCR2021.1.002.FOGE

Data & Media loading...

Article Type: Research Article

Keyword(s): computational content analysis; deep learning; natural language processing; topic analysis; weak supervision

Most Cited Most Cited RSS feed

Call for Papers

Global Vietnam welcomes papers in varied forms, including individual concept papers, research articles, book reviews, debate and opinion pieces, etc. It also welcomes proposals for special issues. The maximum length for a paper published in the Journal is up to 12,000 words including references, notes and figures/tables/charts. Requested length for book reviews is two pages. You can send your proposal to Prof Phan Le-Ha ([email protected]) or to the commissioning editor at AUP, Inge Klompmakers ([email protected]).

Tijdschrift voor Geschiedenis zoekt reviewartikelen!

Aan de hand van een serie reviewartikelen brengt Tijdschrift voor Geschiedenis de komende tijd recente ontwikkelingen in het historische landschap in kaart. Maakt uw vakgebied een interessante ontwikkeling door? Heerst er een debat? Kregen recente publicaties volgens u niet voldoende aandacht? Kruip dan in uw pen en schrijf een reviewartikel voor Tijdschrift voor Geschiedenis! We verwelkomen bijdragen van historici uit alle mogelijke vakgebieden.

Meer info via: www.aup-online.com/content/journals/00407518
en tijdschriftvoorgeschiedenis.org.

oa A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Conversational Agent Research Toolkit

Computational observation

Detecting Impoliteness and Incivility in Online Discussions

Opinion-based Homogeneity on YouTube

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

iCoRe: The GDELT Interface for the Advancement of Communication Research

The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video

A Roadmap for Computational Communication Research