Language Resources

For the development of products and applications in Linguistic Technology it is necessary to have basic linguistic resources (textual and oral corpus, lexicons and knowledge bases) and development tools (morphological and syntactic analysers, meaning disambiguators, corpus treatment tools, lemmatisers, integrated tool environments, etc.).

We have more than 25 years of experience in the creation of this type of basic linguistic resources and we have different reference corpus, lexicons and knowledge bases which are the basis for the development of tools that not only carry out a superficial analysis, but also approach the deep knowledge of the sentence, its meaning.

As far as reference corpus is concerned, our main resource is the EPEC corpus (Euskararen Prozesamendurako Erreferentziazko Corpusa-Reference Corpus for the Processing of Basque), which has 300.000 words tagged at different linguistic levels: morphological, syntactic and semantic. We also have a terminology corpus of around 18 million words (Garaterm), a corpus labeled with temporal expressions (EusTimeBank), a corpus labeled with multi-word verbal expressions (Parseme), and finally, a corpus labeled with units and discursive relationships (RST Treebank).

With regard to knowledge and lexical databases, we have EDBL, the general lexical data-base for Basque; Basque WordNet, which has been constructed with the expand approach of the English WordNet; BVI, the Basque Verb Index for collecting the arguments and semantic roles of verbs, and finally, Konbitzul, an online database of verb-noun Multi Word Expressions in Spanish and Basque.

Concerning the speech databases and tools, the last years we have developed many voice-based resources for different purposes: On the one hand, databases for emotional speech synthesis (EmodB_EU1, 2 and 3), synthesis and voice conversion (AhoSpeaker), biligual speech synthesis (Ahosyn), Speechdat-like (MDB600-EU and FDB1060-EU), alaryngeal speech recordings, and even an ethnographic database (Bizkaifon). On the other hand, voice processing tools like a voice detection algorithm or a pitch detection algorithm, speech synthesis tools for windows, android and web systems, using standard Basque or local variations (Iparrahotsa), a Basque speech recognizer and a public voice bank (ZureTTS).

 

Pages