Analyse de textes [FR]

Les outils d'analyse du langage naturel sont des modules logiciels qui effectuent des analyses linguistiques sur des textes à différents niveaux. Ces outils sont des composants essentiels de tout logiciel de traitement du langage naturel (TLN) qui analyse du texte, et tout logiciel de text mining est généralement construit en combinant des modules linguistiques de base formant des pipelines complexes.

Le centre HiTZ a une grande tradition dans la construction d'outils d'analys...lire la suite

Chercheur/se principal/e: 

voir plus

Text_analysis_tabs

Demos

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

Contrats

Projects

All HiTZ projects

Patents

MALTIXA

Ressources

Publications

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics 46 (1), 245-248. (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Itziar Aduriz, Jose Mari Arriola, Xabier Artola, Zuhaitz Beloki, Nerea Ezeiza, Koldo Gojenola

Morfeus+: Word Parsing in Basque beyond Morphological Segmentation (2020)

WORD STRUCTURE 13.3, 283-315

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarriutz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte: ekarpen berriak euskararen ikerketari/ Nuevas aportaciones al estudio de la lengua vasca. (argitaratze-bidean)

Elena Zotova, Rodrigo Agerri, Manuel Nuñez and German Rigau

Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus (2020)

Language Resources and Evaluation Conference (LREC 2020)

José Ramom Pichel, Pablo Gamallo, Marco Neves & Iñaki Alegria

Distância diacrónica automática entre variantes diatópicas do português e do espanhol (2020)

Linguamática, Vol. 12 N. 1, 117–126 ISSN: 1647–0818

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining (2020)

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Pages 255-262

Uxoa Inurrieta, tziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. (2020)

Inurrieta U, Aduriz I, Díaz de Ilarraza A, Labaka G, Sarasola K (2020) Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. PLoS ONE 15(8): e0237767. https://doi.org/10.1371/journal.pone.0237767

Amaia Aguirregoitia Martinez, Kepa Bengoetxea Kortazar, Itziar Gonzalez-Dios

Are CLIL texts too complicated? A computational analysis of their linguistic characteristics (2020)

Journal of Immersion and Content-Based Language Education (Available online)

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarrituz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte. Ekarpen berriak euskararen ikerketari / Nuevas aportaciones al estudio de la lengua vasca.

Jose Ramom Pichel Camos

Medidas de distância entre línguas baseadas em corpus (2020)

Nazioarteko tesia. Artikulu bilduma.

Mikel Artetxe, Gorka Labaka, Eneko Agirre

Translation Artifacts in Cross-lingual Transfer Learning (2020)

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Pages 7674–7684).

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2019)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Jon Alkorta, Koldo Gojenola, Mikel Iruskieta

Sentimenduen tratamendu konputazionalerantz: gramatika maila ezberdinetako sentimendu balentzia aldatzaileen bila (2019)

Olatz Arbelaitz, Urtzi Etxeberria, Ainhoa Latatu, Miren Josu Ormaetxebarria (arg.), III. Ikergazte. Nazioarteko Ikerketa Euskaraz, Giza Zientziak eta Artea (1. liburukia), 39-46. Udako Euskal Unibertsitatea (UEU). Bilbo.

Joseba Fernandez de Landa, Rodrigo Agerri, Iñaki Alegria

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case (2019)

MDPI: Information: Vol. 10, 6. 212. doi: 10.3390/info10060212 https://www.mdpi.com/2078-2489/10/6/212

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

Procesamiento del Lenguaje Natural, Revista no 63, septiembre de 2019, pp. 77-84

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Jose Mari Arriola, Izaskun Aldezabal, Ainara Estarrona

A modular grammar-helping tool for Basque: work in progress (2019)

NoDaLiDa2019, Turku, Finland

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar Aduriz, John Carroll.

Using Linguistic Data for English and Spanish Verb-Noun Combination Identification (2016)

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers pages 857–867, Osaka, Japan, December 11-17 2016. ISBN: 978-4-87974-702-0.

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

All HiTZ publications

Text_analysis_tabs_full

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

All HiTZ projects

MALTIXA

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics 46 (1), 245-248. (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Itziar Aduriz, Jose Mari Arriola, Xabier Artola, Zuhaitz Beloki, Nerea Ezeiza, Koldo Gojenola

Morfeus+: Word Parsing in Basque beyond Morphological Segmentation (2020)

WORD STRUCTURE 13.3, 283-315

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarriutz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte: ekarpen berriak euskararen ikerketari/ Nuevas aportaciones al estudio de la lengua vasca. (argitaratze-bidean)

Elena Zotova, Rodrigo Agerri, Manuel Nuñez and German Rigau

Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus (2020)

Language Resources and Evaluation Conference (LREC 2020)

José Ramom Pichel, Pablo Gamallo, Marco Neves & Iñaki Alegria

Distância diacrónica automática entre variantes diatópicas do português e do espanhol (2020)

Linguamática, Vol. 12 N. 1, 117–126 ISSN: 1647–0818

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining (2020)

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Pages 255-262

Uxoa Inurrieta, tziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. (2020)

Inurrieta U, Aduriz I, Díaz de Ilarraza A, Labaka G, Sarasola K (2020) Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. PLoS ONE 15(8): e0237767. https://doi.org/10.1371/journal.pone.0237767

Amaia Aguirregoitia Martinez, Kepa Bengoetxea Kortazar, Itziar Gonzalez-Dios

Are CLIL texts too complicated? A computational analysis of their linguistic characteristics (2020)

Journal of Immersion and Content-Based Language Education (Available online)

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarrituz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte. Ekarpen berriak euskararen ikerketari / Nuevas aportaciones al estudio de la lengua vasca.

Jose Ramom Pichel Camos

Medidas de distância entre línguas baseadas em corpus (2020)

Nazioarteko tesia. Artikulu bilduma.

Mikel Artetxe, Gorka Labaka, Eneko Agirre

Translation Artifacts in Cross-lingual Transfer Learning (2020)

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Pages 7674–7684).

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2019)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Jon Alkorta, Koldo Gojenola, Mikel Iruskieta

Sentimenduen tratamendu konputazionalerantz: gramatika maila ezberdinetako sentimendu balentzia aldatzaileen bila (2019)

Olatz Arbelaitz, Urtzi Etxeberria, Ainhoa Latatu, Miren Josu Ormaetxebarria (arg.), III. Ikergazte. Nazioarteko Ikerketa Euskaraz, Giza Zientziak eta Artea (1. liburukia), 39-46. Udako Euskal Unibertsitatea (UEU). Bilbo.

Joseba Fernandez de Landa, Rodrigo Agerri, Iñaki Alegria

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case (2019)

MDPI: Information: Vol. 10, 6. 212. doi: 10.3390/info10060212 https://www.mdpi.com/2078-2489/10/6/212

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

Procesamiento del Lenguaje Natural, Revista no 63, septiembre de 2019, pp. 77-84

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Jose Mari Arriola, Izaskun Aldezabal, Ainara Estarrona

A modular grammar-helping tool for Basque: work in progress (2019)

NoDaLiDa2019, Turku, Finland

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar Aduriz, John Carroll.

Using Linguistic Data for English and Spanish Verb-Noun Combination Identification (2016)

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers pages 857–867, Osaka, Japan, December 11-17 2016. ISBN: 978-4-87974-702-0.

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

All HiTZ publications