Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
1.
Crit Care ; 27(1): 277, 2023 07 10.
Article in English | MEDLINE | ID: mdl-37430313

ABSTRACT

OBJECTIVES: Evaluating effectiveness of speech/phrase recognition software in critically ill patients with speech impairments. DESIGN: Prospective study. SETTING: Tertiary hospital critical care unit in the northwest of England. PARTICIPANTS: 14 patients with tracheostomies, 3 female and 11 male. MAIN OUTCOME MEASURES: Evaluation of dynamic time warping (DTW) and deep neural networks (DNN) methods in a speech/phrase recognition application. Using speech/phrase recognition app for voice impaired (SRAVI), patients attempted mouthing various supported phrases with recordings evaluated by both DNN and DTW processing methods. Then, a trio of potential recognition phrases was displayed on the screen, ranked from first to third in order of likelihood. RESULTS: A total of 616 patient recordings were taken with 516 phrase identifiable recordings. The overall results revealed a total recognition accuracy across all three ranks of 86% using the DNN method. The rank 1 recognition accuracy of the DNN method was 75%. The DTW method had a total recognition accuracy of 74%, with a rank 1 accuracy of 48%. CONCLUSION: This feasibility evaluation of a novel speech/phrase recognition app using SRAVI demonstrated a good correlation between spoken phrases and app recognition. This suggests that speech/phrase recognition technology could be a therapeutic option to bridge the gap in communication in critically ill patients. WHAT IS ALREADY KNOWN ABOUT THIS TOPIC: Communication can be attempted using visual charts, eye gaze boards, alphabet boards, speech/phrase reading, gestures and speaking valves in critically ill patients with speech impairments. WHAT THIS STUDY ADDS: Deep neural networks and dynamic time warping methods can be used to analyse lip movements and identify intended phrases. HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE AND POLICY: Our study shows that speech/phrase recognition software has a role to play in bridging the communication gap in speech impairment.


Subject(s)
Mobile Applications , Speech , Humans , Female , Male , Feasibility Studies , Critical Illness/therapy , Prospective Studies
2.
Int J Speech Technol ; 26(1): 163-184, 2023.
Article in English | MEDLINE | ID: mdl-37008883

ABSTRACT

Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker's visual speech style; (3) we introduce "displacement factor" as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.

3.
Sensors (Basel) ; 23(4)2023 Feb 10.
Article in English | MEDLINE | ID: mdl-36850569

ABSTRACT

Speech is the most spontaneous and natural means of communication. Speech is also becoming the preferred modality for interacting with mobile or fixed electronic devices. However, speech interfaces have drawbacks, including a lack of user privacy; non-inclusivity for certain users; poor robustness in noisy conditions; and the difficulty of creating complex man-machine interfaces. To help address these problems, the Special Issue "Future Speech Interfaces with Sensors and Machine Intelligence" assembles eleven contributions covering multimodal and silent speech interfaces; lip reading applications; novel sensors for speech interfaces; and enhanced speech inclusivity tools for future speech interfaces. Short summaries of the articles are presented, followed by an overall evaluation. The success of this Special Issue has led to its being re-issued as "Future Speech Interfaces with Sensors and Machine Intelligence-II" with a deadline in March of 2023.


Subject(s)
Communication , Speech , Humans , Artificial Intelligence , Electronics , Privacy
4.
Sensors (Basel) ; 23(4)2023 Feb 11.
Article in English | MEDLINE | ID: mdl-36850648

ABSTRACT

The current accuracy of speech recognition can reach over 97% on different datasets, but in noisy environments, it is greatly reduced. Improving speech recognition performance in noisy environments is a challenging task. Due to the fact that visual information is not affected by noise, researchers often use lip information to help to improve speech recognition performance. This is where the performance of lip recognition and the effect of cross-modal fusion are particularly important. In this paper, we try to improve the accuracy of speech recognition in noisy environments by improving the lip reading performance and the cross-modal fusion effect. First, due to the same lip possibly containing multiple meanings, we constructed a one-to-many mapping relationship model between lips and speech allowing for the lip reading model to consider which articulations are represented from the input lip movements. Audio representations are also preserved by modeling the inter-relationships between paired audiovisual representations. At the inference stage, the preserved audio representations could be extracted from memory by the learned inter-relationships using only video input. Second, a joint cross-fusion model using the attention mechanism could effectively exploit complementary intermodal relationships, and the model calculates cross-attention weights on the basis of the correlations between joint feature representations and individual modalities. Lastly, our proposed model achieved a 4.0% reduction in WER in a -15 dB SNR environment compared to the baseline method, and a 10.1% reduction in WER compared to speech recognition. The experimental results show that our method could achieve a significant improvement over speech recognition models in different noise environments.


Subject(s)
Lipreading , Speech Perception , Humans , Speech , Learning , Lip
5.
Sensors (Basel) ; 23(4)2023 Feb 17.
Article in English | MEDLINE | ID: mdl-36850882

ABSTRACT

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.


Subject(s)
Gestures , Speech , Humans , Computers, Handheld , Acoustics , Computer Systems
6.
Small ; 19(17): e2205058, 2023 04.
Article in English | MEDLINE | ID: mdl-36703524

ABSTRACT

Lip-reading provides an effective speech communication interface for people with voice disorders and for intuitive human-machine interactions. Existing systems are generally challenged by bulkiness, obtrusiveness, and poor robustness against environmental interferences. The lack of a truly natural and unobtrusive system for converting lip movements to speech precludes the continuous use and wide-scale deployment of such devices. Here, the design of a hardware-software architecture to capture, analyze, and interpret lip movements associated with either normal or silent speech is presented. The system can recognize different and similar visemes. It is robust in a noisy or dark environment. Self-adhesive, skin-conformable, and semi-transparent dry electrodes are developed to track high-fidelity speech-relevant electromyogram signals without impeding daily activities. The resulting skin-like sensors can form seamless contact with the curvilinear and dynamic surfaces of the skin, which is crucial for a high signal-to-noise ratio and minimal interference. Machine learning algorithms are employed to decode electromyogram signals and convert them to spoken words. Finally, the applications of the developed lip-reading system in augmented reality and medical service are demonstrated, which illustrate the great potential in immersive interaction and healthcare applications.


Subject(s)
Movement , Skin , Humans , Electromyography/methods , Electrodes , Machine Learning
7.
J Imaging ; 8(10)2022 Sep 21.
Article in English | MEDLINE | ID: mdl-36286349

ABSTRACT

Despite the success of hand-crafted features in computer visioning for many years, nowadays, this has been replaced by end-to-end learnable features that are extracted from deep convolutional neural networks (CNNs). Whilst CNNs can learn robust features directly from image pixels, they require large amounts of samples and extreme augmentations. On the contrary, hand-crafted features, like SIFT, exhibit several interesting properties as they can provide local rotation invariance. In this work, a novel scheme combining the strengths of SIFT descriptors with CNNs, namely SIFT-CNN, is presented. Given a single-channel image, one SIFT descriptor is computed for every pixel, and thus, every pixel is represented as an M-dimensional histogram, which ultimately results in an M-channel image. Thus, the SIFT image is generated from the SIFT descriptors for all the pixels in a single-channel image, while at the same time, the original spatial size is preserved. Next, a CNN is trained to utilize these M-channel images as inputs by operating directly on the multiscale SIFT images with the regular convolution processes. Since these images incorporate spatial relations between the histograms of the SIFT descriptors, the CNN is guided to learn features from local gradient information of images that otherwise can be neglected. In this manner, the SIFT-CNN implicitly acquires a local rotation invariance property, which is desired for problems where local areas within the image can be rotated without affecting the overall classification result of the respective image. Some of these problems refer to indirect immunofluorescence (IIF) cell image classification, ground-based all-sky image-cloud classification and human lip-reading classification. The results for the popular datasets related to the three different aforementioned problems indicate that the proposed SIFT-CNN can improve the performance and surpasses the corresponding CNNs trained directly on pixel values in various challenging tasks due to its robustness in local rotations. Our findings highlight the importance of the input image representation in the overall efficiency of a data-driven system.

8.
J Neurosci ; 42(31): 6108-6120, 2022 08 03.
Article in English | MEDLINE | ID: mdl-35760528

ABSTRACT

Speech perception in noisy environments is enhanced by seeing facial movements of communication partners. However, the neural mechanisms by which audio and visual speech are combined are not fully understood. We explore MEG phase-locking to auditory and visual signals in MEG recordings from 14 human participants (6 females, 8 males) that reported words from single spoken sentences. We manipulated the acoustic clarity and visual speech signals such that critical speech information is present in auditory, visual, or both modalities. MEG coherence analysis revealed that both auditory and visual speech envelopes (auditory amplitude modulations and lip aperture changes) were phase-locked to 2-6 Hz brain responses in auditory and visual cortex, consistent with entrainment to syllable-rate components. Partial coherence analysis was used to separate neural responses to correlated audio-visual signals and showed non-zero phase-locking to auditory envelope in occipital cortex during audio-visual (AV) speech. Furthermore, phase-locking to auditory signals in visual cortex was enhanced for AV speech compared with audio-only speech that was matched for intelligibility. Conversely, auditory regions of the superior temporal gyrus did not show above-chance partial coherence with visual speech signals during AV conditions but did show partial coherence in visual-only conditions. Hence, visual speech enabled stronger phase-locking to auditory signals in visual areas, whereas phase-locking of visual speech in auditory regions only occurred during silent lip-reading. Differences in these cross-modal interactions between auditory and visual speech signals are interpreted in line with cross-modal predictive mechanisms during speech perception.SIGNIFICANCE STATEMENT Verbal communication in noisy environments is challenging, especially for hearing-impaired individuals. Seeing facial movements of communication partners improves speech perception when auditory signals are degraded or absent. The neural mechanisms supporting lip-reading or audio-visual benefit are not fully understood. Using MEG recordings and partial coherence analysis, we show that speech information is used differently in brain regions that respond to auditory and visual speech. While visual areas use visual speech to improve phase-locking to auditory speech signals, auditory areas do not show phase-locking to visual speech unless auditory speech is absent and visual speech is used to substitute for missing auditory signals. These findings highlight brain processes that combine visual and auditory signals to support speech understanding.


Subject(s)
Auditory Cortex , Speech Perception , Visual Cortex , Acoustic Stimulation , Auditory Cortex/physiology , Auditory Perception , Female , Humans , Lipreading , Male , Speech/physiology , Speech Perception/physiology , Visual Cortex/physiology , Visual Perception/physiology
9.
eNeuro ; 9(3)2022.
Article in English | MEDLINE | ID: mdl-35728955

ABSTRACT

Speech is an intrinsically multisensory signal, and seeing the speaker's lips forms a cornerstone of communication in acoustically impoverished environments. Still, it remains unclear how the brain exploits visual speech for comprehension. Previous work debated whether lip signals are mainly processed along the auditory pathways or whether the visual system directly implements speech-related processes. To probe this, we systematically characterized dynamic representations of multiple acoustic and visual speech-derived features in source localized MEG recordings that were obtained while participants listened to speech or viewed silent speech. Using a mutual-information framework we provide a comprehensive assessment of how well temporal and occipital cortices reflect the physically presented signals and unique aspects of acoustic features that were physically absent but may be critical for comprehension. Our results demonstrate that both cortices feature a functionally specific form of multisensory restoration: during lip reading, they reflect unheard acoustic features, independent of co-existing representations of the visible lip movements. This restoration emphasizes the unheard pitch signature in occipital cortex and the speech envelope in temporal cortex and is predictive of lip-reading performance. These findings suggest that when seeing the speaker's lips, the brain engages both visual and auditory pathways to support comprehension by exploiting multisensory correspondences between lip movements and spectro-temporal acoustic cues.


Subject(s)
Lipreading , Speech Perception , Acoustic Stimulation , Acoustics , Humans , Speech
10.
Sensors (Basel) ; 22(8)2022 Apr 12.
Article in English | MEDLINE | ID: mdl-35458932

ABSTRACT

Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.


Subject(s)
Artificial Intelligence , Speech , Cloud Computing , Neural Networks, Computer , Speech Recognition Software
11.
Int J Dev Disabil ; 68(1): 47-55, 2022.
Article in English | MEDLINE | ID: mdl-35173963

ABSTRACT

A weaker McGurk effect is observed in individuals with autism spectrum disorder (ASD); weaker integration is considered to be the key to understanding how low-order atypical processing leads to their maladaptive social behaviors. However, the mechanism for this weaker McGurk effect has not been fully understood. Here, we investigated (1) whether the weaker McGurk effect in individuals with high autistic traits is caused by poor lip-reading ability and (2) whether the hearing environment modifies the weaker McGurk effect in individuals with high autistic traits. To confirm them, we conducted two analogue studies among university students, based on the dimensional model of ASD. Results showed that individuals with high autistic traits have intact lip-reading ability as well as abilities to listen and recognize audiovisual congruent speech (Experiment 1). Furthermore, a weaker McGurk effect in individuals with high autistic traits, which appear under the without-noise condition, would disappear under the high noise condition (Experiments 1 and 2). Our findings suggest that high background noise might shift weight on the visual cue, thereby increasing the strength of the McGurk effect among individuals with high autistic traits.

12.
Sensors (Basel) ; 22(3)2022 Feb 02.
Article in English | MEDLINE | ID: mdl-35161879

ABSTRACT

Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and systems. Tongue surface contours interpolated from estimated and hand-labelled keypoints produced an average mean sum of distances (MSD) of 0.93, s.d. 0.46 mm, compared with 0.96, s.d. 0.39 mm, for two human labellers, and 2.3, s.d. 1.5 mm, for the best performing edge detection algorithm. A pilot set of simultaneous electromagnetic articulography (EMA) and ultrasound recordings demonstrated partial correlation among three physical sensor positions and the corresponding estimated keypoints and requires further investigation. The accuracy of the estimating lip aperture from a camera video was high, with a mean MSD of 0.70, s.d. 0.56 mm compared with 0.57, s.d. 0.48 mm for two human labellers. DeepLabCut was found to be a fast, accurate and fully automatic method of providing unique kinematic data for tongue, hyoid, jaw, and lips.


Subject(s)
Dental Articulators , Speech , Biomechanical Phenomena , Humans , Jaw/diagnostic imaging , Lip/diagnostic imaging , Tongue/diagnostic imaging
13.
Front Artif Intell ; 5: 1070964, 2022.
Article in English | MEDLINE | ID: mdl-36714203

ABSTRACT

Unlike the conventional frame-based camera, the event-based camera detects changes in the brightness value for each pixel over time. This research work on lip-reading as a new application by the event-based camera. This paper proposes an event camera-based lip-reading for isolated single sound recognition. The proposed method consists of imaging from event data, face and facial feature points detection, and recognition using a Temporal Convolutional Network. Furthermore, this paper proposes a method that combines the two modalities of the frame-based camera and an event-based camera. In order to evaluate the proposed method, the utterance scenes of 15 Japanese consonants from 20 speakers were collected using an event-based camera and a video camera and constructed an original dataset. Several experiments were conducted by generating images at multiple frame rates from an event-based camera. As a result, the highest recognition accuracy was obtained in the image of the event-based camera at 60 fps. Moreover, it was confirmed that combining two modalities yields higher recognition accuracy than a single modality.

14.
Sensors (Basel) ; 21(23)2021 Nov 26.
Article in English | MEDLINE | ID: mdl-34883888

ABSTRACT

As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system's performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., "time" and "some". In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%-an improvement of 15.0% compared with the state-of-the-art approaches.


Subject(s)
Language , Lipreading , Humans
15.
J Imaging ; 7(5)2021 May 20.
Article in English | MEDLINE | ID: mdl-34460687

ABSTRACT

Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.

16.
Med Teach ; 43(11): 1333-1334, 2021 11.
Article in English | MEDLINE | ID: mdl-33108745

ABSTRACT

Due to the widening access to medicine scheme, students with disabilities are entering medicine. Hearing-impaired students are an important subcategory of medical students, whose specific learning challenges with respect to medicine are poorly explored in the literature. We feel that this topic is particularly important and relevant given the current covid-19 pandemic, which has led to the widespread use of surgical masks, thereby posing a barrier to hearing, communication and education for hearing-impaired medical students. Therefore, the medical education of these students is of even more paramount importance as the pandemic continues. This personal view details the experiences of a current hearing-impaired medical student in the United Kingdom, with key learning points for medical educators who may require insight into hearing loss and how to tailor their teaching techniques accordingly.


Subject(s)
COVID-19 , Hearing Loss , Students, Medical , Humans , Pandemics , SARS-CoV-2
17.
Entropy (Basel) ; 22(12)2020 Dec 03.
Article in English | MEDLINE | ID: mdl-33279914

ABSTRACT

Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.

18.
Elife ; 92020 08 24.
Article in English | MEDLINE | ID: mdl-32831168

ABSTRACT

Visual speech carried by lip movements is an integral part of communication. Yet, it remains unclear in how far visual and acoustic speech comprehension are mediated by the same brain regions. Using multivariate classification of full-brain MEG data, we first probed where the brain represents acoustically and visually conveyed word identities. We then tested where these sensory-driven representations are predictive of participants' trial-wise comprehension. The comprehension-relevant representations of auditory and visual speech converged only in anterior angular and inferior frontal regions and were spatially dissociated from those representations that best reflected the sensory-driven word identity. These results provide a neural explanation for the behavioural dissociation of acoustic and visual speech comprehension and suggest that cerebral representations encoding word identities may be more modality-specific than often upheld.


Subject(s)
Brain/anatomy & histology , Brain/physiology , Phonetics , Speech , Acoustic Stimulation/methods , Adolescent , Adult , Brain Mapping , Female , Humans , Photic Stimulation , Reading , Speech Perception , Young Adult
19.
Otolaryngol Head Neck Surg ; 163(4): 771-777, 2020 10.
Article in English | MEDLINE | ID: mdl-32453650

ABSTRACT

OBJECTIVES: To compare speech perception (SP) in noise for normal-hearing (NH) individuals and individuals with hearing loss (IWHL) and to demonstrate improvements in SP with use of a visual speech recognition program (VSRP). STUDY DESIGN: Single-institution prospective study. SETTING: Tertiary referral center. SUBJECTS AND METHODS: Eleven NH and 9 IWHL participants in a sound-isolated booth facing a speaker through a window. In non-VSRP conditions, SP was evaluated on 40 Bamford-Kowal-Bench speech-in-noise test (BKB-SIN) sentences presented by the speaker at 50 A-weighted decibels (dBA) with multiperson babble noise presented from 50 to 75 dBA. SP was defined as the percentage of words correctly identified. In VSRP conditions, an infrared camera was used to track 35 points around the speaker's lips during speech in real time. Lip movement data were translated into speech-text via an in-house developed neural network-based VSRP. SP was evaluated similarly in the non-VSRP condition on 42 BKB-SIN sentences, with the addition of the VSRP output presented on a screen to the listener. RESULTS: In high-noise conditions (70-75 dBA) without VSRP, NH listeners achieved significantly higher speech perception than IWHL listeners (38.7% vs 25.0%, P = .02). NH listeners were significantly more accurate with VSRP than without VSRP (75.5% vs 38.7%, P < .0001), as were IWHL listeners (70.4% vs 25.0% P < .0001). With VSRP, no significant difference in SP was observed between NH and IWHL listeners (75.5% vs 70.4%, P = .15). CONCLUSIONS: The VSRP significantly increased speech perception in high-noise conditions for NH and IWHL participants and eliminated the difference in SP accuracy between NH and IWHL listeners.


Subject(s)
Artificial Intelligence , Hearing Loss/rehabilitation , Noise , Speech Perception , Visual Perception , Adult , Case-Control Studies , Hearing Loss/physiopathology , Humans , Middle Aged , Prospective Studies , Sound Spectrography , Speech Perception/physiology , Visual Perception/physiology , Young Adult
20.
J Neurosci ; 40(5): 1053-1065, 2020 01 29.
Article in English | MEDLINE | ID: mdl-31889007

ABSTRACT

Lip-reading is crucial for understanding speech in challenging conditions. But how the brain extracts meaning from, silent, visual speech is still under debate. Lip-reading in silence activates the auditory cortices, but it is not known whether such activation reflects immediate synthesis of the corresponding auditory stimulus or imagery of unrelated sounds. To disentangle these possibilities, we used magnetoencephalography to evaluate how cortical activity in 28 healthy adult humans (17 females) entrained to the auditory speech envelope and lip movements (mouth opening) when listening to a spoken story without visual input (audio-only), and when seeing a silent video of a speaker articulating another story (video-only). In video-only, auditory cortical activity entrained to the absent auditory signal at frequencies <1 Hz more than to the seen lip movements. This entrainment process was characterized by an auditory-speech-to-brain delay of ∼70 ms in the left hemisphere, compared with ∼20 ms in audio-only. Entrainment to mouth opening was found in the right angular gyrus at <1 Hz, and in early visual cortices at 1-8 Hz. These findings demonstrate that the brain can use a silent lip-read signal to synthesize a coarse-grained auditory speech representation in early auditory cortices. Our data indicate the following underlying oscillatory mechanism: seeing lip movements first modulates neuronal activity in early visual cortices at frequencies that match articulatory lip movements; the right angular gyrus then extracts slower features of lip movements, mapping them onto the corresponding speech sound features; this information is fed to auditory cortices, most likely facilitating speech parsing.SIGNIFICANCE STATEMENT Lip-reading consists in decoding speech based on visual information derived from observation of a speaker's articulatory facial gestures. Lip-reading is known to improve auditory speech understanding, especially when speech is degraded. Interestingly, lip-reading in silence still activates the auditory cortices, even when participants do not know what the absent auditory signal should be. However, it was uncertain what such activation reflected. Here, using magnetoencephalographic recordings, we demonstrate that it reflects fast synthesis of the auditory stimulus rather than mental imagery of unrelated, speech or non-speech, sounds. Our results also shed light on the oscillatory dynamics underlying lip-reading.


Subject(s)
Auditory Cortex/physiology , Lipreading , Speech Perception/physiology , Acoustic Stimulation , Female , Humans , Magnetoencephalography , Male , Pattern Recognition, Visual/physiology , Sound Spectrography , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL
...