Objective: To define and classify multi-word expressions in Hebrew; develop a methodology for their lexical representation; incorporate them in an existing lexicon and a morphological processing system based upon it; and develop techniques for automatic acquisition of MWEs from corpora.
Researchers: Hassan Al-Haj, Yulia Tsvetkov, Hanna Fadida, (Technion) Kayla Jacobs (Technion) and Shuly Wintner. Joint project with Alon Itai at the Technion.
Status: Complete
Funding: ISF (grant 1269/07).
Mutli-word expressions (MWE) are lexical words consisting of more than a single orthographic word. Semantically, their meaning is non-compositional (i.e., cannot be established from the meanings of their components); syntactically, they may function as words or as phrases; morphologically, their behavior is many times idiosyncratic; and orthographically, they are written with intervening spaces. Oftentimes, MWE are named entities.
The identification of MWE is an important task for a variety of NLP applications, ranging from information retrieval and building ontologies to machine translation. MWE are a challenge for computational processing of natural languages because they combine properties of words and phrases, and because phonological, morphological and orthographic processes apply to them differently than to ordinary tokens. In Hebrew, this challenge is paramount due to the complex morphology and orthography of the language: morphological and orthographic processes in Hebrew apply to MWE in unique ways, complicating morphological processing and automatic extraction of MWE.
We will develop theories and techniques for representing, analyzing and acquiring Hebrew MWE. Specifically, we will:
Develop an architecture for lexical specification of MWE in Hebrew, and extend an existing lexicon of the language with capabilities to store MWE;
Develop techniques for morphological processing of MWE in Hebrew, and extend an existing morphological processor (anaylzer/generator) with capabilities to process MWW;
Develop techniques to extract MWE from monolingual and bilingual corpora, and populate the lexicon with automatically acquired MWE;
Evaluate the quality of the tools using state-of-the-art evaluation measures, and investigate their applicability to other languages with complex morphology and orthography, notably Arabic.
None.
Kayla Jacobs, Alon Itai and Shuly Wintner. Acronyms: Identification, Expansion and Disambiguation. Annals of Mathematics and Artificial Intelligence 8:(5-6): 517-532, 2020. 📖
Livnat Herzig Sheinfux, Tali Arad Greshler, Nurit Melnik and Shuly Wintner. Verbal multiword expressions: Idiomaticity and flexibility. In Yannick Parmentier and Jakub Waszczuk (eds.), Representation and parsing of multiword expressions: Current trends, chapter 2, pages 35-68, Berlin: Language Science Press. 2019. 📖
Hanna Fadida, Alon Itai and Shuly Wintner. A Hebrew Verb--Complement Dictionary. Language Resources and Evaluation 48(2):249-278, June 2014. 📖
Hassan Al-Haj, Alon Itai and Shuly Wintner. Lexical Representation of Multiword Expressions in Morphologically-complex Languages. International Journal of Lexicography 27(2):130-170, June 2014. 📖
Yulia Tsvetkov and Shuly Wintner. Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources. Computational Linguistics 40(2):449-468, June 2014. 📖
Yulia Tsvetkov and Shuly Wintner. Extraction of Multi-word Expressions from Small Parallel Corpora. Natural Language Engineering 18(4):549-573, October 2012. 📖
Yulia Tsvetkov and Shuly Wintner. Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 836-845, Edinburgh, Scotland, July 2011. 📖
Yulia Tsvetkov and Shuly Wintner. Extraction of Multi-word Expressions from Small Parallel Corpora. Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 1256-1264, Beijing, August 2010. 📖
Hassan Al-Haj and Shuly Wintner. Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy. Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 10-18, Beijing, August 2010. 📖
Yulia Tsvetkov and Shuly Wintner. Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content. Proceedings of the seventh international conference on Language Resources and Evaluation (LREC-2010), pages 3389-3392, Malta, May 2010. 📖