2004
Volume 4 Number 1
  • E-ISSN: 2665-9085

Abstract

Abstract

Political activity on social media presents a data-rich window into political behavior, but the vast amount of data means that almost all content analyses of social media require a data labeling step. However, most automated machine classification methods ignore the multimodality of posted content, focusing either on text or images. State-of-the-art vision-and-language models are unusable for most political science research: they require all observations to have both image and text and require computationally expensive pretraining. This paper proposes a novel vision-and-language framework called multimodal representations using modality translation (MARMOT). MARMOT presents two methodological contributions: it can construct representations for observations missing image or text, and it replaces the computationally expensive pretraining with modality translation. MARMOT outperforms an ensemble text-only classifier in 19 of 20 categories in multilabel classifications of tweets reporting election incidents during the 2016 U.S. general election. Moreover, MARMOT shows significant improvements over the results of benchmark multimodal models on the Hateful Memes dataset, improving the best result set by VisualBERT in terms of accuracy from 0.6473 to 0.6760 and area under the receiver operating characteristic curve (AUC) from 0.7141 to 0.7530. The GitHub repository for MARMOT can be found at github.com/patrickywu/MARMOT.

Loading

Article metrics loading...

/content/journals/10.5117/CCR2022.1.008.WU
2022-02-01
2024-11-04
Loading full text...

Full text loading...

/deliver/fulltext/26659085/4/1/CCR2022.1.008.WU.html?itemId=/content/journals/10.5117/CCR2022.1.008.WU&mimeType=html&fmt=ahah

References

  1. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv: 1607.06450 [stat.ML].
    [Google Scholar]
  2. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv: 1409.0473 [cs.CL].
    [Google Scholar]
  3. Barberá, P., Casas, A., Nagler, J., Egan, P. J., Bonneau, R., Jost, J. T., & Tucker, J. A. (2019). Who leads? who follows? measuring issue attention and agenda setting by legislators and the mass public using social media data. American Political Science Review, 113(4), 883–901. https://doi.org/10.1017/s0003055419000352
    [Google Scholar]
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
    [Google Scholar]
  5. Bloem, P. (2019). Transformers from scratch. http://www.peterbloem.nl/blog/transformers
    [Google Scholar]
  6. Casas, A., & Webb Williams, N. (2019). Images that matter: Online protests and the mobilizing role of pictures. Political Research Quarterly, 72(2), 360–375. https://doi.org/10.1177/1065912918786805
    [Google Scholar]
  7. Chang, C., & Masterson, M. (2020). Using word order in political text classification with long short-term memory models. Political Analysis, 28(3), 395–411. https://doi.org/10.1017/pan.2019.46
    [Google Scholar]
  8. Chilamkurthy, S. (2017). Transfer learning for computer vision tutorial. https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#transfer-learning-for-computer-vision-tutorial
    [Google Scholar]
  9. Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. arXiv: 1703.04009 [cs.CL].
    [Google Scholar]
  10. Deng, J., Dong, W., Socher, R., Li, L.-J., KaiLi, & LiFei-Fei. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
    [Google Scholar]
  11. Desai, K., & Johnson, J. (2021). Virtex: Learning visual representations from textual annotations. arXiv: 2006.06666 [cs.CV].
    [Google Scholar]
  12. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805 [cs.CL].
    [Google Scholar]
  13. Fong, C., & Tyler, M. (2020). Machine learning predictions as regression covariates. Political Analysis, 1–18. https://doi.org/10.1017/pan.2020.38
    [Google Scholar]
  14. Hastie, T., Tibshirani, R., & Friedman, J. (2000). Elements of statistical learning: Data mining, inference, and prediction. Springer.
    [Google Scholar]
  15. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv: 1512.03385 [cs.CV].
    [Google Scholar]
  16. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
    [Google Scholar]
  17. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. arXiv: 1704.00109 [cs.LG].
    [Google Scholar]
  18. Kiela, D., Bhooshan, S., Firooz, H., & Testuggine, D. (2019). Supervised multimodal bitransformers for classifying images and text. arXiv: 1909.02950 [cs.CL].
    [Google Scholar]
  19. Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., & Testuggine, D. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv: 2005.04790 [cs.AI].
    [Google Scholar]
  20. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv: 1412.6980 [cs.LG].
    [Google Scholar]
  21. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv: 1612.01474 [stat.ML].
    [Google Scholar]
  22. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, II-1188–II-1196. https://doi.org/10.5555/3044805.3045025
    [Google Scholar]
  23. Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., & Chang, K.-W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv: 1908.03557 [cs.CV].
    [Google Scholar]
  24. Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P. (2014). Microsoft coco: Common objects in context. arXiv: 1405.0312 [cs.CV].
    [Google Scholar]
  25. Liu, K., Li, Y., Xu, N., & Natarajan, P. (2018). Learn to combine modalities in multimodal deep learning. arXiv: 1805.11730 [stat.ML].
    [Google Scholar]
  26. Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265 [cs.CV].
    [Google Scholar]
  27. MacAvaney, S., Yao, H.-R., Yang, E., Russell, K., Goharian, N., & Frieder, O. (2019). Hate speech detection: Challenges and solutions. PLOS ONE, 14(8), 1–16. https://doi.org/10.1371/journal.pone.0221152
    [Google Scholar]
  28. Mebane, W. R., Jr., Wu, P. Y., Woods, L., Klaver, J., Pineda, A., & Miller, B. (2018). Observing election incidents in the united states via twitter: Does who observes matter? [Working Paper]. http://websites.umich.edu/~wmebane/mw18B.pdf
    [Google Scholar]
  29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, 3111–3119. https://doi.org/10.5555/2999792.2999959
    [Google Scholar]
  30. Mogadala, A. (2015). Polylingual multimodal learning. ECML PKDD Doctoral Consortium. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.722.4854&rep=rep1&type=pdf
    [Google Scholar]
  31. Pan, J., & Siegel, A. A. (2020). How Saudi crackdowns fail to silence online dissent. American Political Science Review, 114(1), 109–125. https://doi.org/10.1017/S0003055419000650
    [Google Scholar]
  32. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191
    [Google Scholar]
  33. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://doi.org/10.3115/v1/d14-1162
    [Google Scholar]
  34. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv: 1506.01497 [cs.CV].
    [Google Scholar]
  35. Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2016). Self-critical sequence training for image captioning. arXiv: 1612.00563 [cs.LG].
    [Google Scholar]
  36. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature , 323(6088), 533–536. https://doi.org/10.1038/323533a0
    [Google Scholar]
  37. Rush, A. (2018). The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
    [Google Scholar]
  38. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. arXiv: 1801.04381 [cs.CV].
    [Google Scholar]
  39. Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565. https://doi.org/10.18653/v1/P18-1238
    [Google Scholar]
  40. Siegel, A. A., & Badaan, V. (2020). #No2sectarianism: Experimental approaches to reducing sectarian hate speech online. American Political Science Review, 114(3), 837–855. https://doi.org/10.1017/S0003055420000283
    [Google Scholar]
  41. Siegel, A. A., Nikitin, E., Barberá, P., Sterling, J., Pullen, B., Bonneau, R., Nagler, J., & Tucker, J. A. (2021). Trumping hate on twitter? Online hate speech in the 2016 u.s. election campaign and its aftermath. Quarterly Journal of Political Science, 16(1), 71–104. https://doi.org/10.1561/100.00019045
    [Google Scholar]
  42. Singh, A., Goswami, V., & Parikh, D. (2020). Are we pretraining it right? Digging deeper into visio-linguistic pretraining. arXiv: 2004.08744 [cs.CV].
    [Google Scholar]
  43. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations. arXiv: 1908.08530 [cs.CV].
    [Google Scholar]
  44. Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). Videobert: A joint model for video and language representation learning. arXiv: 1904.01766 [cs.CV].
    [Google Scholar]
  45. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the inception architecture for computer vision. arXiv: 1512.00567 [cs.CV].
    [Google Scholar]
  46. Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv: 1908.07490 [cs.CL].
    [Google Scholar]
  47. Terechshenko, Z., Linder, F., Padmakumar, V., Liu, M., Nagler, J., Tucker, J. A., & Bonneau, R. (2021). A comparison of methods in political science text classification: Transfer learning language models for politics [Working Paper].
    [Google Scholar]
  48. Tian, L., Zheng, D., & Zhu, C. (2013). Image classification based on the combination of text features and visual features. International Journal of Intelligent Systems, 28(3), 242–256. https://doi.org/10.1002/int.21567
    [Google Scholar]
  49. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv: 1706.03762 [cs.CL].
    [Google Scholar]
  50. Vig, J. (2019). A multiscale visualization of attention in the transformer model. arXiv: 1906.05714 [cs.HC].
    [Google Scholar]
  51. Wang, S., McCormick, T. H., & Leek, J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48), 30266–30275. https://doi.org/10.1073/pnas.2001238117
    [Google Scholar]
  52. Wang, W., Tran, D., & Feiszli, M. (2019). What makes training multi-modal classification networks hard? arXiv: 1905.12681 [cs.CV].
    [Google Scholar]
  53. Webb Williams, N., Casas, A., & Wilkerson, J. D. (2020). Images as data for social science research: An introduction to convolutional neural nets for image classification. Cambridge University Press. https://doi.org/10.1017/9781108860741
    [Google Scholar]
  54. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv: 1910.03771 [cs.CL].
    [Google Scholar]
  55. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv: 1609.08144 [cs.CL].
    [Google Scholar]
  56. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. arXiv: 1502.03044 [cs.LG].
    [Google Scholar]
  57. Zahavy, T., Magnani, A., Krishnan, A., & Mannor, S. (2016). Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. arXiv: 1611.09534 [cs.CV].
    [Google Scholar]
  58. Zhang, H., & Pan, J. (2019). CASM: A deep-learning approach for identifying collective action events with text and image data from social media. Sociological Methodology, 49(1), 1–57. https://doi.org//https://doi.org/10.1177/0081175019860244
    [Google Scholar]
/content/journals/10.5117/CCR2022.1.008.WU
Loading
/content/journals/10.5117/CCR2022.1.008.WU
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error