Pesquisa | Portal Regional da BVS

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning.

Sheikh, Shakeel A; Sahidullah, Md; Hirsch, Fabrice; Ouni, Slim.

IEEE J Biomed Health Inform ; 27(5): 2553-2564, 2023 05.

Artigo em Inglês | MEDLINE | ID: mdl-37027629

RESUMO

Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances (interjections) and core behaviors (blocks, repetitions, and prolongations), and is caused by the failure of speech sensorimotors. Due to its complex nature, stuttering detection (SD) is a difficult task. If detected at an early stage, it could facilitate speech therapists to observe and rectify the speech patterns of persons who stutter (PWS). The stuttered speech of PWS is usually available in limited amounts and is highly imbalanced. To this end, we address the class imbalance problem in the SD domain via a multi-branching (MB) scheme and by weighting the contribution of classes in the overall loss function, resulting in a huge improvement in stuttering classes on the SEP-28 k dataset over the baseline (StutterNet). To tackle data scarcity, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme. The augmented training outperforms the MB StutterNet (clean) by a relative margin of 4.18% in macro F1-score ( F1). In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech, resulting in an overall improvement of 4.48% in F1 over the single context based MB StutterNet. Finally, we have shown that applying data augmentation in the cross-corpora scenario can improve the overall SD performance by a relative margin of 13.23% in F1 over the clean training.

Assuntos

Aprendizado Profundo , Gagueira , Humanos , Gagueira/diagnóstico , Fala

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis.

Dahmani, Sara; Colotte, Vincent; Girard, Valérian; Ouni, Slim.

Neural Netw ; 141: 315-329, 2021 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-33957381

RESUMO

Great improvement has been made in the field of expressive audiovisual Text-to-Speech synthesis (EAVTTS) thanks to deep learning techniques. However, generating realistic speech is still an open issue and researchers in this area have been focusing lately on controlling the speech variability. In this paper, we use different neural architectures to synthesize emotional speech. We study the application of unsupervised learning techniques for emotional speech modeling as well as methods for restructuring emotions representation to make it continuous and more flexible. This manipulation of the emotional representation should allow us to generate new styles of speech by mixing emotions. We first present our expressive audiovisual corpus. We validate the emotional content of this corpus with three perceptual experiments using acoustic only, visual only and audiovisual stimuli. After that, we analyze the performance of a fully connected neural network in learning characteristics specific to different emotions for the phone duration aspect and the acoustic and visual modalities. We also study the contribution of a joint and separate training of the acoustic and visual modalities in the quality of the generated synthetic speech. In the second part of this paper, we use a conditional variational auto-encoder (CVAE) architecture to learn a latent representation of emotions. We applied this method in an unsupervised manner to generate features of expressive speech. We used a probabilistic metric to compute the overlapping degree between emotions latent clusters to choose the best parameters for the CVAE. By manipulating the latent vectors, we were able to generate nuances of a given emotion and to generate new emotions that do not exist in our database. For these new emotions, we obtain a coherent articulation. We conducted four perceptual experiments to evaluate our findings.

Assuntos

Emoções , Percepção da Fala , Fala , Aprendizagem , Redes Neurais de Computação

Is markerless acquisition technique adequate for speech production?

Ouni, Slim; Dahmani, Sara.

J Acoust Soc Am ; 139(6): EL234, 2016 06.

Artigo em Inglês | MEDLINE | ID: mdl-27369178

RESUMO

In this study, the precision of markerless acquisition techniques have been assessed when used to acquire articulatory data for speech production studies. Two different markerless systems have been evaluated and compared to a marker-based one. The main finding is that both markerless systems provide a reasonable result during normal speech and the quality is uneven during fast articulated speech. The quality of the data is dependent on the temporal resolution of the markerless system.

Assuntos

Face/fisiologia , Imageamento Tridimensional/instrumentação , Acústica da Fala , Transdutores , Fenômenos Biomecânicos , Desenho de Equipamento , Humanos , Imageamento Tridimensional/métodos , Masculino , Movimento (Física) , Reprodutibilidade dos Testes , Processamento de Sinais Assistido por Computador , Software , Fatores de Tempo

An episodic memory-based solution for the acoustic-to-articulatory inversion problem.

Demange, Sébastien; Ouni, Slim.

J Acoust Soc Am ; 133(5): 2921-30, 2013 May.

Artigo em Inglês | MEDLINE | ID: mdl-23654397

RESUMO

This paper presents an acoustic-to-articulatory inversion method based on an episodic memory. An episodic memory is an interesting model for two reasons. First, it does not rely on any assumptions about the mapping function but rather it relies on real synchronized acoustic and articulatory data streams. Second, the memory inherently represents the real articulatory dynamics as observed. It is argued that the computational models of episodic memory, as they are usually designed, cannot provide a satisfying solution for the acoustic-to-articulatory inversion problem due to the insufficient quantity of training data. Therefore, an episodic memory is proposed, called generative episodic memory (G-Mem), which is able to produce articulatory trajectories that do not belong to the set of episodes the memory is based on. The generative episodic memory is evaluated using two electromagnetic articulography corpora: one for English and one for French. Comparisons with a codebook-based method and with a classical episodic memory (which is termed concatenative episodic memory) are presented in order to evaluate the proposed generative episodic memory in terms of both its modeling of articulatory dynamics and its generalization capabilities. The results show the effectiveness of the method where an overall root-mean-square error of 1.65 mm and a correlation of 0.71 are obtained for the G-Mem method. They are comparable to those of methods recently proposed.

Assuntos

Memória Episódica , Fonética , Processamento de Sinais Assistido por Computador , Acústica da Fala , Medida da Produção da Fala/métodos , Qualidade da Voz , Algoritmos , Inteligência Artificial , Simulação por Computador , Humanos , Modelos Teóricos , Fatores de Tempo

Estimating the control parameters of an articulatory model from electromagnetic articulograph data.

Toutios, Asterios; Ouni, Slim; Laprie, Yves.

J Acoust Soc Am ; 129(5): 3245-57, 2011 May.

Artigo em Inglês | MEDLINE | ID: mdl-21568426

RESUMO

Finding the control parameters of an articulatory model that result in given acoustics is an important problem in speech research. However, one should also be able to derive the same parameters from measured articulatory data. In this paper, a method to estimate the control parameters of the the model by Maeda from electromagnetic articulography (EMA) data, which allows the derivation of full sagittal vocal tract slices from sparse flesh-point information, is presented. First, the articulatory grid system involved in the model's definition is adapted to the speaker involved in the experiment, and EMA data are registered to it automatically. Then, articulatory variables that correspond to measurements defined by Maeda on the grid are extracted. An initial solution for the articulatory control parameters is found by a least-squares method, under constraints ensuring vocal tract shape naturalness. Dynamic smoothness of the parameter trajectories is then imposed by a variational regularization method. Generated vocal tract slices for vowels are compared with slices appearing in magnetic resonance images of the same speaker or found in the literature. Formants synthesized on the basis of these generated slices are adequately close to those tracked in real speech recorded concurrently with EMA.

Assuntos

Incisivo/fisiologia , Laringe/fisiologia , Lábio/fisiologia , Imageamento por Ressonância Magnética/métodos , Modelos Biológicos , Fonação/fisiologia , Língua/fisiologia , Antropometria , Orelha , Feminino , Humanos , Nariz , Palato/fisiologia , Faringe/fisiologia

Incorporation of phonetic constraints in acoustic-to-articulatory inversion.

Potard, Blaise; Laprie, Yves; Ouni, Slim.

J Acoust Soc Am ; 123(4): 2310-23, 2008 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-18397035

RESUMO

This study investigates the use of constraints upon articulatory parameters in the context of acoustic-to-articulatory inversion. These speaker independent constraints, referred to as phonetic constraints, were derived from standard phonetic knowledge for French vowels and express authorized domains for one or several articulatory parameters. They were experimented on in an existing inversion framework that utilizes Maeda's articulatory model and a hypercubic articulatory-acoustic table. Phonetic constraints give rise to a phonetic score rendering the phonetic consistency of vocal tract shapes recovered by inversion. Inversion has been applied to vowels articulated by a speaker whose corresponding x-ray images are also available. Constraints were evaluated by measuring the distance between vocal tract shapes recovered through inversion to real vocal tract shapes obtained from x-ray images, by investigating the spreading of inverse solutions in terms of place of articulation and constriction degree, and finally by studying the articulatory variability. Results show that these constraints capture interdependencies and synergies between speech articulators and favor vocal tract shapes close to those realized by the human speaker. In addition, this study also provides how acoustic-to-articulatory inversion can be used to explore acoustical and compensatory articulatory properties of an articulatory model.

Assuntos

Acústica , Fonética , Percepção da Fala , Humanos , Medida da Produção da Fala

Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion.

Ouni, Slim; Laprie, Yves.

J Acoust Soc Am ; 118(1): 444-60, 2005 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-16119364

RESUMO

Acoustic-to-articulatory inversion is a difficult problem mainly because of the nonlinearity between the articulatory and acoustic spaces and the nonuniqueness of this relationship. To resolve this problem, we have developed an inversion method that provides a complete description of the possible solutions without excessive constraints and retrieves realistic temporal dynamics of the vocal tract shapes. We present an adaptive sampling algorithm to ensure that the acoustical resolution is almost independent of the region in the articulatory space under consideration. This leads to a codebook that is organized in the form of a hierarchy of hypercubes, and ensures that, within each hypercube, the articulatory-to-acoustic mapping can be approximated by means of a linear transform. The inversion procedure retrieves articulatory vectors corresponding to acoustic entries from the hypercube codebook. A nonlinear smoothing algorithm together with a regularization technique is then used to recover the best articulatory trajectory. The inversion ensures that inverse articulatory parameters generate original formant trajectories with high precision and a realistic sequence of the vocal tract shapes.

Assuntos

Modelos Biológicos , Boca/fisiologia , Fonética , Fala/fisiologia , Prega Vocal/fisiologia , Algoritmos , Humanos , Modelos Lineares

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA