HiTZ Language Technology Webinar Series
The HiTZ center hosts a webinar series on Language Technology with talks by key researchers in the field. Please find below a list of upcoming webinars, as well as recordings of past webinars.
If you want to attend the next webinar, please complete this form and you will receive the link.
If you want to receive information about the next webinars, please complete this form.
You can watch past seminars here
Semi-supervised Learning for Low-resource Multilingual and Multimodal Speech Processing with Machine Speech Chain (May 5, 2022, 15:00 CET)
Summary:The development of advanced spoken language technologies based on automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has enabled computers to either learn how to listen or speak. Many applications and services are now available but still support fewer than 100 languages. Nearly 7000 living languages that are spoken by 350 million people remain uncovered. This is because the construction is commonly done based on machine learning trained in a supervised fashion where a large amount of paired speech and corresponding transcription is required. In this talk, we will introduce a semi-supervised learning mechanism based on a machine speech chain framework. First, we describe the primary machine speech chain architecture that learns not only to listen or speak but also to listen while speaking. The framework enables ASR and TTS to teach each other given unpaired data. After that, we describe the use of machine speech chain for code-switching and cross-lingual ASR and TTS of several languages, including low-resourced ethnic languages. Finally, we describe the recent multimodal machine chain that mimics overall human communication to listen while speaking and visualizing. With the support of image captioning and production models, the framework enables ASR and TTS to improve their performance using an image-only dataset.
Bio:Sakriani Sakti is currently an associate professor at Japan Advanced Institute of Science and Technology (JAIST) Japan, adjunct associate professor at Nara Institute of Science and Technology (NAIST) Japan, visiting research scientist at RIKEN Center for Advanced Intelligent Project (RIKEN AIP) Japan, and adjunct professor at the University of Indonesia. She received DAAD-Siemens Program Asia 21st Century Award in 2000 to study in Communication Technology, University of Ulm, Germany, and received her MSc degree in 2002. During her thesis work, she worked with the Speech Understanding Department, DaimlerChrysler Research Center, Ulm, Germany. She then worked as a researcher at ATR Spoken Language Communication (SLC) Laboratories Japan in 2003-2009, and NICT SLC Groups Japan in 2006-2011, which established multilingual speech recognition for speech-to-speech translation. While working with ATR and NICT, Japan, she continued her study (2005-2008) with Dialog Systems Group University of Ulm, Germany, and received her Ph.D. degree in 2008. She was actively involved in international collaboration activities such as Asian Pacific Telecommunity Project (2003-2007) and various speech-to-speech translation research projects, including A-STAR and U-STAR (2006-2011). In 2011-2017, she was an assistant professor at the Augmented Human Communication Laboratory, NAIST, Japan. She also served as a visiting scientific researcher of INRIA Paris-Rocquencourt, France, in 2015-2016, under JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation. In 2018–2021, she was a research associate professor at NAIST and a research scientist at RIKEN, Center for Advanced Intelligent Project AIP, Japan. Currently, she is an associate professor at JAIST, adjunct associate professor at NAIST, visiting research scientist at RIKEN AIP, and adjunct professor at the University of Indonesia. She is a member of JNS, SFN, ASJ, ISCA, IEICE, and IEEE. Furthermore, she is currently a committee member of IEEE SLTC (2021-2023) and an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020-2023). She was a board member of Spoken Language Technologies for Under-resourced languages (SLTU) and the general chair of SLTU2016. She was also the general chair of the "Digital Revolution for Under-resourced Languages (DigRevURL)" Workshop as the Interspeech Special Session in 2017 and DigRevURL Asia in 2019. She was also the organizing committee of the Zero Resource Speech Challenge 2019 and 2020. She was also involved in creating joint ELRA and ISCA Special Interest Group on Under-resourced Languages (SIGUL) and served as SIGUL Board since 2018. Last year, in collaboration with UNESCO and ELRA, she was also the organizing committee of the International Conference of "Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide". Her research interests lie in deep learning & graphical model framework, statistical pattern recognition, zero-resourced speech technology, multilingual speech recognition and synthesis, spoken language translation, social-affective dialog system, and cognitive-communication.
Summary:The fundamental issue underlying natural language understanding is that of semantics – there is a need to move toward understanding natural language at an appropriate level of abstraction in order to support natural language understanding and communication with computers. Machine Learning has become ubiquitous in our attempt to induce semantic representations of natural language and support decisions that depend on it; however, while we have made significant progress over the last few years, it has focused on classification tasks for which we have large amounts of annotated data. Supporting high level decisions that depend on natural language understanding is still beyond our capabilities, partly since most of these tasks are very sparse and generating supervision signals for it does not scale. I will discuss some of the challenges underlying reasoning – making natural language understanding decisions that depend on multiple, interdependent, models, and exemplify it mostly using the domain of Reasoning about Time, as it is expressed in natural language. Bio:Dan Roth is the Eduardo D. Glandt Distinguished Professor at the Department of Computer and Information Science, University of Pennsylvania, lead of NLP Science at Amazon AWS AI, and a Fellow of the AAAS, the ACM, AAAI, and the ACL. In 2017, Roth was awarded the John McCarthy Award, the highest award the AI community gives to mid-career AI researchers. Roth was recognized “for major conceptual and theoretical advances in the modeling of natural language understanding, machine learning, and reasoning.” Roth has published broadly in machine learning, natural language processing, knowledge representation and reasoning, and learning theory, and has developed advanced machine learning based tools for natural language applications that are being used widely. Roth was the Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR) and a program chair of AAAI, ACL, and CoNLL. Roth has been involved in several startups; most recently he was a co-founder and chief scientist of NexLP, a startup that leverages the latest advances in Natural Language Processing (NLP), Cognitive Analytics, and Machine Learning in the legal and compliance domains. NexLP was acquired by Reveal in 2020. Prof. Roth received his B.A Summa cum laude in Mathematics from the Technion, Israel, and his Ph.D. in Computer Science from Harvard University in 1995.
Recent years have seen considerable progress in the deployment of 'intelligent' communicative agents such as Apple's Siri and Amazon’s Alexa. However, effective speech-based human-robot dialogue is less well developed; not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there appears to be a large gap between the formulaic behaviour that typifies contemporary spoken language dialogue systems and the rich and flexible nature of human-human conversation. As a consequence, we still seem to be some distance away from creating Autonomous Social Agents such as robots that are truly capable of conversing effectively with their human counterparts in real world situations. This talk will address these issues and will argue that we need to go far beyond our current capabilities and understanding if we are to move from developing robots that simply talk and listen to evolving intelligent communicative machines that are capable of entering into effective cooperative relationships with human beings.
Bio: Prof. Moore has over 40 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. He was President of the European/International Speech Communication Association from 1997 to 2001, General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. In 2017 he organised the first international workshop on ‘Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR)’. Prof. Moore is the current Editor-in-Chief of Computer Speech & Language and in 2016 he was awarded the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies” and in 2020 he was given the International Speech Communication Association Special Service Medal for "service in the establishment, leadership and international growth of ISCA".
Abstract Speech recognition is the mapping of a continuous, highly variable speech signal onto discrete, abstract representations. The question of how speech is represented and processed in the human brain and in automatic speech recognition (ASR) systems, although crucial in both the field of human speech processing and the field of automatic speech processing, has historically been investigated in the two fields separately. This webinar will discuss how comparisons between humans and deep neural network (DNN)-based ASRs, and cross-fertilization of the two research fields, can provide valuable insights into the way humans process speech and improve ASR technology. Specifically, it will present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech in human listeners and DNNs and on lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent information. It will explain how listeners adapt to the speech of new speakers, and will present the results of a lexically-guided perceptual study carried out on a DNN-based ASR system, similar to the human experiments. In order to investigate the speech representations and adaptation processes in the DNN-based ASR systems, activations in the hidden layers of the DNN were visualized. These visualizations revealed that DNNs use speech representations that are similar to those used by human listeners, without being explicitly taught to do so, and showed an adaptation of the phoneme categories similar to what is assumed happens in the human brain.
Bio: Odette Scharenborg is an Associate Professor and Delft Technology Fellow at Delft University of Technology working on automatic speech processing. She has an interdisciplinary background in automatic speech recognition and psycholinguistics, and uses knowledge from how humans process speech in order to develop inclusive automatic speech recognition systems that are able to recognise speech from everyone, irrespective of how they speak or the language they speak. Since 2017, she is on the Board of the International Speech Communication Association, and currently serves as Vice-President. Since 2018, she is on the IEEE Speech and Language Processing Technical Committee, and she is a Senior Associate Editor of IEEE Signal Processing Letters.
Summary: Researchers in NLP increasingly frame and discuss research results in ways that serve to deemphasize the field's successes, at least in part in an effort to combat the field's widespread hype. Though well-meaning, this often yields misleading or even false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It harms our credibility in ways that can make it harder to mitigate present-day harms, from NLP deployments, like those involving discriminatory systems for content moderation or resume screening. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances. This talk urges researchers to be careful about these claims and suggests some research directions and communication strategies that will make it easier to avoid or rebut them.
Bio: Sam Bowman has been on the faculty at NYU since 2016, when he completed PhD with Chris Manning and Chris Potts at Stanford. At NYU, he is a member of the Center for Data Science, the Department of Linguistics, and Courant Institute's Department of Computer Science. His research focuses on data, evaluation techniques, and modeling techniques for sentence and paragraph understanding in natural language processing, and on applications of machine learning to scientific questions in linguistic syntax and semantics. He is the senior organizer behind the GLUE and SuperGLUE benchmark competitions and he has received a 2015 EMNLP Best Resource Paper Award, a 2019 *SEM Best Paper Award, a 2017 Google Faculty Research Award, and a 2021 NSF CAREER award.
Task descriptions are ubiquitous in human learning. They are usually accompanied by a few examples, but there is little human learning that is based on examples only. In contrast, the typical learning setup for NLP tasks lacks task descriptions and is supervised with 100s or 1000s and often many more examples. This webinar will introduce Pattern-Exploiting Training (PET), an approach to learning that mimics human learning in that it leverages task descriptions in few-shot settings. PET is built on top of a pretrained language model that "understands" the task description, especially after fine-tuning, resulting in excellent performance compared to other few-shot methods. In particular, a model trained with PET outperforms GPT-3 even though it has 99.9% fewer parameters. The idea of task descriptions can also be applied to reducing bias in text generated by language models. Instructing a model to reveal and reduce its biases is remarkably effective as will be demonstrated in an evaluation on several benchmarks. This may contribute in the future to a fairer and more inclusive NLP.
Summary:The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
Bio:Desmond is an Assistant Professor at the University of Copenhagen. His primary research interests are multimodal and multilingual machine learning and he was involved in the creation of the Multi30K, How2, and MaRVL datasets. His work received an Area Chair Favourite paper at COLING 2018 and the Best Long Paper Award at EMNLP 2021. He co-organised the Multimodal Machine Translation Shared Task from 2016–2018, the 2018 Frederick Jelinek Memorial Workshop on Grounded Sequence-to-Sequence Learning, the How2 Challenge Workshop at ICML 2019, and the Workshop on Multilingual Multimodal Learning at ACL 2022.
Watch 2020-2021 webinar series here.