Crosslinguistic Influences between Language Varieties

Project description

Objective: To study the special characteristics of non-native language, native language and translated language.

Researchers: Gili Goldin, Ella Rabinovich and Shuly Wintner. Collaborating with Yulia Tsvetkov (LTI, CMU), Nathan Schneider (Georgetown) and Noah Smith (University of Washington).

Status: Complete

Funding: BSF grant no. 2017699, United States National Science Foundation grant no. 1813153.

Abstract

Most people in the world today are multilingual. Multilingualism is a gradual phenomenon: it ranges from language learners at various levels of competence through highly fluent, advanced nonnative speakers all the way to native speakers who can also master other lanaguages, to translators. Previous research has extensively examined text from second language learners who have not yet achieved fluency. This project focused on text produced by nonnative but highly fluent speakers. Even without the grammatical errors characteristic of learner language that are readily apparent to native speakers, fluent but nonnative language differs subtly from native, monolingual language in the frequencies of certain concepts, constructions, and collocations. This raises the possibility that language technologies – typically trained on "standard" native language – are systematically biased in ways that render them less useful for the majority of users.

In this project we developed computational methods that examined large datasets of language varieties in order to detect the subtle influences of the native language on a foreign language. We further developed natural language processing models that address a much larger portion of the world’s population – in particular, nonnative speakers – thereby helping to detect and mitigate bias inherent in current methods. We developed state-of-the-art techniques for native language identification of fluent authors that facilitate potential applications in language learning, cybersecurity, geolocation, personalization, and more. Some of the methods we proposed shed light on entity portrayals in narratives and detect veiled biases in texts, which makes them applicable for social science research. We also touched upon theoretical questions in multilingualism, such as the over-use of cognates by nonnative speakers, thereby improving our understand of cognition in the bilingual mind. The project openly shares implementations and data, and includes educational activities that bring research into the classroom and well beyond the university.

Resources

If you are using the following resources, please cite Rabinovich et al. (2018) and/or Goldin et al. (2018).

The Reddit-L2 corpus (8GB)
Reddit-L2 corpus cleanup code
Reddit-L2 chunks as in Goldin et al. (2018). See the readme file.
Same dataset, where the data of all authors with the same L1 constitute one file. See the readme file.

Publications

Sachin Kumar, Antonios Anastasopoulos, Shuly Wintner and Yulia Tsvetkov. Machine Translation into Low-resource Language Varieties. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 110--121, August 2021. 📖
Sachin Kumar, Shuly Wintner, Noah A. Smith and Yulia Tsvetkov. Topics to Avoid: Demoting Latent Confounds in Text Classification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4153--4163, November 2019. 📖
Anjalie Field, Doron Kliger, Shuly Wintner, Jennifer Pan, Dan Jurafsky and Yulia Tsvetkov. Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3570--3580, October 2018. 📖
Ella Rabinovich, Yulia Tsvetkov and Shuly Wintner. Native Language Cognate Effects on Second Language Lexical Choice. Transactions of the Association for Computational Linguistics 6:329-342, 2018. 📖
Gili Goldin, Ella Rabinovich and Shuly Wintner. Native Language Identification with User Generated Content. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 3591-3601, Brussels, Belgium, November 2018. 📖

Page updated

Report abuse