|
|
2017 » Papers » Volume 3 » RECOVERING OLD ROMANIAN LEMMATA 1. RECOVERING OLD ROMANIAN LEMMATA Authors: Gifu Daniela Volume 3 | DOI: 10.12753/2066-026X-17-176 | Pages: 19-24 | Download PDF | Abstract
The present study describes a method of exploring the Romanian old words collected in the written press, from 1829 till 2015, and contains three independent collections of newspapers, developed and structured corresponding to Moldavia, Wallachia, and Transylvania. The purpose of this research is to record the chronology of the Romanian old words identified in the diachronic corpus that we have built in the last three years, called RODICA (ROmanian DIachonic Corpus with Annotations), in order to develop a learning technology based on the language evolution. We have used the citations extracted from the eDTLR (Dictionary Thesaurus of the Romanian Language in electronic form), with an important role for eLearning applications. To recover lemmata of old, obsolete, word forms using eDTLR quotations is a target on which NLP group from Iasi has started to work a few years ago and this work contributes significantly in developing a diachronic POS-tagger. A diachronic POS-tagger is very important in different comparative language analyzes of eLearning. However, there is an important problem, which arises, the fact that the corpus contains only quotations and not a complete sentence. One quotation may be attached both to a title-word different than the word searched for or belonging to the searched for title-word. We consider that the model of language that we used could be useful both for applicative objectives (for enabling effective language similarity analysis using statistical methods) and for scientific objectives (for exploring the nature of related languages). | Keywords
historical corpus, words' chronology, old lemmata, written press, diachronic POS-tagger |
|
|
|