Pesquisa | Portal Regional da BVS (teste)

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation.

Naeini, Saeid Alavi; Simmatis, Leif; Jafari, Deniz; Yunusova, Yana; Taati, Babak.

IEEE J Transl Eng Health Med ; 12: 382-389, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38606392

RESUMO

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.

Assuntos

Percepção da Fala , Fala , Humanos , Interface para o Reconhecimento da Fala , Disartria/diagnóstico , Distúrbios da Fala

Analytical Validation of a Webcam-Based Assessment of Speech Kinematics: Digital Biomarker Evaluation following the V3 Framework.

Simmatis, Leif; Alavi Naeini, Saeid; Jafari, Deniz; Xie, Michael Kai Yue; Tanchip, Chelsea; Taati, Niyousha; McKinlay, Scotia; Sran, Rupinder; Truong, Justin; Guarin, Diego L; Taati, Babak; Yunusova, Yana.

Digit Biomark ; 7(1): 7-17, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37205279

RESUMO

Introduction: Kinematic analyses have recently revealed a strong potential to contribute to the assessment of neurological diseases. However, the validation of home-based kinematic assessments using consumer-grade video technology has yet to be performed. In line with best practices for digital biomarker development, we sought to validate webcam-based kinematic assessment against established, laboratory-based recording gold standards. We hypothesized that webcam-based kinematics would possess psychometric properties comparable to those obtained using the laboratory-based gold standards. Methods: We collected data from 21 healthy participants who repeated the phrase "buy Bobby a puppy" (BBP) at four different combinations of speaking rate and volume: Slow, Normal, Loud, and Fast. We recorded these samples twice back-to-back, simultaneously using (1) an electromagnetic articulography ("EMA"; NDI Wave) system, (2) a 3D camera (Intel RealSense), and (3) a 2D webcam for video recording via an in-house developed app. We focused on the extraction of kinematic features in this study, given their demonstrated value in detecting neurological impairments. We specifically extracted measures of speed/acceleration, range of motion (ROM), variability, and symmetry using the movements of the center of the lower lip during these tasks. Using these kinematic features, we derived measures of (1) agreement between recording methods, (2) test-retest reliability of each method, and (3) the validity of webcam recordings to capture expected changes in kinematics as a result of different speech conditions. Results: Kinematics measured using the webcam demonstrated good agreement with both the RealSense and EMA (ICC-A values often ≥0.70). Test-retest reliability, measured using the absolute agreement (2,1) formulation of the intraclass correlation coefficient (i.e., ICC-A), was often "moderate" to "strong" (i.e., ≥0.70) and similar between the webcam and EMA-based kinematic features. Finally, the webcam kinematics were typically as sensitive to differences in speech tasks as EMA and the 3D camera gold standards. Discussion and Conclusions: Our results suggested that webcam recordings display good psychometric properties, comparable to laboratory-based gold standards. This work paves the way for a large-scale clinical validation to continue the development of these promising technologies for the assessment of neurological diseases via home-based methods.

3D Video Tracking Technology in the Assessment of Orofacial Impairments in Neurological Disease: Clinical Validation.

Jafari, Deniz; Simmatis, Leif; Guarin, Diego; Bouvier, Liziane; Taati, Babak; Yunusova, Yana.

J Speech Lang Hear Res ; 66(8S): 3151-3165, 2023 08 17.

Artigo em Inglês | MEDLINE | ID: mdl-36989177

RESUMO

PURPOSE: This study sought to determine whether clinically interpretable kinematic features extracted automatically from three-dimensional (3D) videos were correlated with corresponding perceptual clinical orofacial ratings in individuals with orofacial impairments due to neurological disorders. METHOD: 45 participants (19 diagnosed with motor neuron diseases [MNDs] and 26 poststroke) performed two nonspeech tasks (mouth opening and lip spreading) and one speech task (repetition of a sentence "Buy Bobby a Puppy") while being video-recorded in a standardized lab setting. The color video recordings of participants were assessed by an expert clinician-a speech language pathologist-on the severity of three orofacial measures: symmetry, range of motion (ROM), and speed. Clinically interpretable 3D kinematic features, linked to symmetry, ROM, and speed, were automatically extracted from video recordings, using a deep facial landmark detection and tracking algorithm for each of the three tasks. Spearman correlations were used to identify features that were significantly correlated (p value < .05) with their corresponding clinical scores. Clinically significant kinematic features were then used in the subsequent multivariate regression models to predict the overall orofacial impairment severity score. RESULTS: Several kinematic features extracted from 3D video recordings were associated with their corresponding perceptual clinical scores, indicating clinical validity of these automatically derived measures. Different patterns of significant features were observed between MND and poststroke groups; these differences were aligned with clinical expectations in both cases. CONCLUSIONS: The results show that kinematic features extracted automatically from simple clinical tasks can capture characteristics used by clinicians during assessments. These findings support the clinical validity of video-based automatic extraction of kinematic features.

Assuntos

Doenças do Sistema Nervoso , Fala , Animais , Cães , Fala/fisiologia , Algoritmos , Fenômenos Biomecânicos/fisiologia

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA