Skip to main content

Thesis defence : Mamady NABE

Thesis defence

On 14 March 2023

COSMO-Onset : a Bayesian, neurally inspired model of speech perception combining bottom-up envelope processing and top-down predictions for syllabic segmentation

Neurocognitive speech perceptual processing is classically conceived as a hierarchy of computations – typically including acoustic or multi-sensory feature extraction, pre-lexical categorization, lexical access, prosodic and syntactic integration, up to final comprehension stages. It is increasingly considered that neural communication within and across these various stages is based on synchronization processes and operates thanks to chunking and selection mechanisms exploiting neural oscillatory dynamics at various frequencies.In contrast to classical models of speech perception such as the TRACE or SHORTLIST models, which achieve segmentation solely through the decoding of the spectro-temporal content of the speech input, recent neuroscience research in speech perception advocates for a clear separation between two processing pathways: a decoding pathway and a temporal control pathway. The latter proposal has given rise to several neuro-computational models, which, for segmentation, rely solely on the processing of the acoustic envelope enabling syllabic rhythm tracking from the speech signal. In this sense, they are entirely “bottom-up” segmentation models.However, several studies have shown that reliable speech perception can not be achieved only through bottom-up processes. For instance, clear evidence for the role of top-down temporal predictions has been provided by Aubanel and Schwartz (2020). Their study showed that speech sequences embedded in noise were better processed and understood by listeners when they were presented in their natural, irregular timing than in timing made isochronous, without changing their spectro-temporal content. The strong benefit in intelligibility displayed by natural syllabic timing, both in English and in French, was interpreted by the authors as evidence for the role of top-down temporal predictions for syllabic parsing.The objective of the present thesis is to address the question of the fusion of bottom-up and top-down processes for speech syllabic segmentation. Our contribution is the COSMO-Onset model, a Bayesian hierarchical model of speech perception, involving a speech segmentation module with an original top-down mechanism for syllabic onset prediction, involving lexical temporal knowledge. We use the model to explore the respective roles of bottom-up envelope processing and top-down linguistic predictions and how they can be efficiently combined for syllabic segmentation. On a first set of experiments on simplified, synthetic stimuli, we show that while purely bottom-up onset detection is sufficient for word recognition in nominal conditions, top-down prediction of syllabic onset events allows overcoming challenging adverse conditions, such as when the acoustic envelope is degraded, leading either to spurious or missing onset events in the sensory signal. On a second set of experiments on real speech stimuli from the Aubanel and Schwartz (2020) experiment, we show that the COSMO-Onset model succesfully accounts for the complementary roles of isochrony and naturalness in speech perception in noise.

Composition du jury :
Noël NGUYEN - Aix-Marseille Université - Rapporteur
Frédéric BIMBOT - CNRS - Rapporteur
Okko RÄSANEN - Tampere Université - Examinateur
Itsaso OLASAGASTI - Université de Genève - Examinatrice
Laurent GIRIN - Grenoble-INP - Examinateur
Julien Diard - CNRS - Directeur de thèse
Jean-Luc Schwartz - CNRS - Co-directeur de thèse

Keywords: Speech segmentation, Top-Down, Neural oscillations, Bayesian modeling, Temporal aspects of speech

Read the thesis


On 14 March 2023



01/11/2019 - 14/03/2023

Submitted on 20 November 2023

Updated on 20 November 2023