DeepKnowledge (PID2021-127777OB-C21) project funded by MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe"
(2022 - 2025)
Being language the most efficient system for exchanging information, Natural Language Processing (NLP) is one of the most important technologies of the current digital transformation. Understanding language is crucial for the success of text analytics and information access applications which depend on the quality of the underlying text-processing tools. In recent years, the NLP community is contributing to the emergence of powerful new deep learning techniques and tools that are revolutionizing the approach to Language Technology (LT) tasks. NLP is moving from a methodology in which a pipeline of multiple modules was the typical way to implement NLP solutions, to architectures based on complex neural networks trained with vast amounts of text data. Thanks to these recent advancements, the NLP community is currently engaged in a paradigm shift with the production and exploitation of large, pre-trained transformer-based language models. As a result, many in the industry have started deploying large pre-trained neural language models in production. Compared to previous work, results are improving so much that systems are claiming to obtain human-level performance in laboratory benchmarks when testing on some difficult language understanding tasks. Despite their impressive capabilities, large pretrained language models do come with severe drawbacks. Currently we have no clear understanding of how they work, when they fail, and what emergent properties they may present, or which novel ways of exploiting these models can help to improve state-of-the-art in NLP. It is important to understand the limitations of large pretrained language models. Some authors call these models foundation models to underscore their critically central yet incomplete character. To tackle these questions, much critical multidisciplinary collaboration and research is needed. This paradigm shift means that we have only just started to scratch the surface of the new possibilities offered by these large pre-trained language models. DeepKnowledge will pre-train language models for the official languages in Spain in a way that could be used by applying novel techniques to extract a more precise and generalizable knowledge.
Organization: Ministerio de Ciencia e Innovación (MCIN)
Main researcher: Rodrigo Agerri, German Rigau
Rodrigo Agerri, Izaskun Aldezabal, Olatz Ansa, Unai Atutxa, Gorka Azkune, Jeremy Barnes, Ander Barrena, Jon Ander Campos, Izaskun Etxeberria, Joseba Fernandez de Landa, Iker García, Itziar Gonzalez-Dios, Mikel Iruskieta, Oier López de Lacalle , German Rigau , Oscar Sainz, Ander Salaberria, Aitor Soroa, Olia Toporkov, Suna Şeyma Uçar