Objective: To collect, maintain, organize and annotate (written) Hebrew corpora; and to design and implement a finite-state rule based morphological analyzer for Hebrew which can be easily extended and improved.
Researchers: Shlomo Yona, Shuly Wintner. In collaboration with Alon Itai (Technion)
Status: Complete
Funding: Israeli Ministry of Science and Technology, as part of the Knowledge Center for Hebrew Language Telecommunication.
The goals of this project are two. First, to maintain a collection of written Hebrew texts, taken mostly from newspapers, organize them structurally using XML, and annotate them morphologically (syntactic annotation may follow in the future). We currently have over 2500 newspaper articles (mostly from HaAretz, Maariv and Yediot) and over 40,000 short newswire articles from Arutz 7, totalling over one million word tokens.
The texts are annotated morphologically using an automatic morphological analyzer; two versions of the corpus exist: one in which each word is assigned all its analyses, and another in which morphological ambiguity is resolved. We are currently working on a manually annotated subset of the corpus, whose analyses are verified. The articles are represented in XML, using dedicated schemas that we have designed.
Second, to develop a finite-state based morphological analyzer and generator for Hebrew. We concentrate on inflectional morphology as much of the inflectional morphology of Semitic languages can be naturally modeled using finite-state operations and because inflectional morphology is sufficient for most practical applications.
All corpora were available from the Knowledge Center for Processing Hebrew, which unfortunately no longer exists. Some of them, along with other resources for Hebrew, are distributed by Elazar Gershuni.
The morphological grammar was also available from the Knowledge Center for Processing Hebrew; the manually-curated Hebrew Lexicon that the analyzer relies on is available.
Shlomo Yona and Shuly Wintner. A finite-state morphological grammar of Hebrew. Natural Language Engineering 14(2):173-190, April 2008. 📖
Alon Itai and Shuly Wintner. Language resources for Hebrew. Language Resources and Evaluation 42(1):75-98, March 2008. 📖
Alon Itai, Shuly Wintner and Shlomo Yona. A Computational Lexicon of Contemporary Hebrew. In Proceedings of LREC-2006, Genoa, Italy, May 2006. 📖
Shlomo Yona and Shuly Wintner. A finite-state morphological grammar of Hebrew. In Proceedings of the ACL-2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, June 2005. 📖
Shuly Wintner and Shlomo Yona. Resources for Processing Hebrew. Proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans, September 2003. 📖