Lowering the Language Barrier: Investigating Deep Transfer Learning and Machine Translation for Multilingual Analyses of Political Texts

Moritz Laurer; Wouter van Atteveldt; Andreu Casas Salleras; Kasper Welbers

doi:10.5117/CCR2023.2.7.LAUR

Lowering the Language Barrier: Investigating Deep Transfer Learning and Machine Translation for Multilingual Analyses of Political Texts

Moritz Laurer, Wouter van Atteveldt, Andreu Casas Salleras, Kasper Welbers

Department of Politics, International Relations and Philosophy

Research output: Contribution to journal › Article › peer-review

Abstract

The social science toolkit for computational text analysis is still very much in the making. We know surprisingly little about how to produce valid insights from large amounts of multilingual texts for comparative social science research. In this paper, we test several recent innovations from deep transfer learning to help advance the computational toolkit for social science research in multilingual settings. We investigate the extent to which prior language and task knowledge stored in the parameters of modern language models is useful for enabling multilingual research; we investigate the extent to which these algorithms can be fruitfully combined with machine translation; and we investigate whether these methods are accurate, practical and valid in multilingual settings – three essential conditions for lowering the language barrier in practice. We use two datasets with texts in 12 languages from 27 countries for our investigation. Our analysis shows, that, based on these innovations, supervised machine learning can produce substantively meaningful outputs. Our BERT-NLI model trained on only 674 or 1,674 texts in only one or two languages can validly predict political party families’ stances towards immigration in eight other languages and ten other countries.

Original language	English
Pages (from-to)	1-28
Number of pages	28
Journal	Computational Communication Research
Volume	5
Issue number	2
DOIs	https://doi.org/10.5117/CCR2023.2.7.LAUR
Publication status	Published - Jan 2023

Keywords

text-as-data
machine learning
multilingualism
computational social sciences

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.5117/CCR2023.2.7.LAUR

Cite this

@article{2fa13ad60b3e4168b24f37635b4c7d4f,

title = "Lowering the Language Barrier: Investigating Deep Transfer Learning and Machine Translation for Multilingual Analyses of Political Texts",

abstract = "The social science toolkit for computational text analysis is still very much in the making. We know surprisingly little about how to produce valid insights from large amounts of multilingual texts for comparative social science research. In this paper, we test several recent innovations from deep transfer learning to help advance the computational toolkit for social science research in multilingual settings. We investigate the extent to which prior language and task knowledge stored in the parameters of modern language models is useful for enabling multilingual research; we investigate the extent to which these algorithms can be fruitfully combined with machine translation; and we investigate whether these methods are accurate, practical and valid in multilingual settings – three essential conditions for lowering the language barrier in practice. We use two datasets with texts in 12 languages from 27 countries for our investigation. Our analysis shows, that, based on these innovations, supervised machine learning can produce substantively meaningful outputs. Our BERT-NLI model trained on only 674 or 1,674 texts in only one or two languages can validly predict political party families{\textquoteright} stances towards immigration in eight other languages and ten other countries.",

keywords = "text-as-data, machine learning, multilingualism, computational social sciences",

author = "Moritz Laurer and {van Atteveldt}, Wouter and {Casas Salleras}, Andreu and Kasper Welbers",

year = "2023",

month = jan,

doi = "10.5117/CCR2023.2.7.LAUR",

language = "English",

volume = "5",

pages = "1--28",

journal = "Computational Communication Research",

number = "2",

}

TY - JOUR

T1 - Lowering the Language Barrier

T2 - Investigating Deep Transfer Learning and Machine Translation for Multilingual Analyses of Political Texts

AU - Laurer, Moritz

AU - van Atteveldt, Wouter

AU - Casas Salleras, Andreu

AU - Welbers, Kasper

PY - 2023/1

Y1 - 2023/1

N2 - The social science toolkit for computational text analysis is still very much in the making. We know surprisingly little about how to produce valid insights from large amounts of multilingual texts for comparative social science research. In this paper, we test several recent innovations from deep transfer learning to help advance the computational toolkit for social science research in multilingual settings. We investigate the extent to which prior language and task knowledge stored in the parameters of modern language models is useful for enabling multilingual research; we investigate the extent to which these algorithms can be fruitfully combined with machine translation; and we investigate whether these methods are accurate, practical and valid in multilingual settings – three essential conditions for lowering the language barrier in practice. We use two datasets with texts in 12 languages from 27 countries for our investigation. Our analysis shows, that, based on these innovations, supervised machine learning can produce substantively meaningful outputs. Our BERT-NLI model trained on only 674 or 1,674 texts in only one or two languages can validly predict political party families’ stances towards immigration in eight other languages and ten other countries.

AB - The social science toolkit for computational text analysis is still very much in the making. We know surprisingly little about how to produce valid insights from large amounts of multilingual texts for comparative social science research. In this paper, we test several recent innovations from deep transfer learning to help advance the computational toolkit for social science research in multilingual settings. We investigate the extent to which prior language and task knowledge stored in the parameters of modern language models is useful for enabling multilingual research; we investigate the extent to which these algorithms can be fruitfully combined with machine translation; and we investigate whether these methods are accurate, practical and valid in multilingual settings – three essential conditions for lowering the language barrier in practice. We use two datasets with texts in 12 languages from 27 countries for our investigation. Our analysis shows, that, based on these innovations, supervised machine learning can produce substantively meaningful outputs. Our BERT-NLI model trained on only 674 or 1,674 texts in only one or two languages can validly predict political party families’ stances towards immigration in eight other languages and ten other countries.

KW - text-as-data

KW - machine learning

KW - multilingualism

KW - computational social sciences

U2 - 10.5117/CCR2023.2.7.LAUR

DO - 10.5117/CCR2023.2.7.LAUR

M3 - Article

VL - 5

SP - 1

EP - 28

JO - Computational Communication Research

JF - Computational Communication Research

IS - 2

ER -