The Knesset Corpus

Project description

Objective: Collect, maintain, organize and annotate the Hebrew protocols of the Knesset (Israeli parliament).

Researchers: Gili Goldin, Ella Rabinovich, Shuly Wintner. In collaboration with from Noam Ordan and Nick Howell (IAHLT).

Status: Complete

Funding: Israeli Ministry of Science and Technology grant no. 3-17990.

Abstract

We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament in the last three decades. Sentences are annotated with morpho-syntactic information and named entities, and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and fac-tions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of male and female speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society,supporting research in linguistics, political science, communication, law, etc.

Resources

The datasets are available on HuggingFace. Other resources are available on GitHub.

Publications

Gili Goldin, Nick Howell, Noam Ordan, Ella Rabinovich and Shuly Wintner. The Knesset corpus: an annotated corpus of Hebrew parliamentary proceedings. Language Resources and Evaluation 59:2973–3004, 2025. 📖

Page updated

Report abuse