Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-36159631

ABSTRACT

Current hearing aids are limited with respect to speech-specific optimization for spatial sound sources to perform speech enhancement. In this study, we therefore propose an approach for spatial detection of speech based on sound source localization and blind optimization of speech enhancement for binaural hearing aids. We have combined an estimator for the direction of arrival (DOA), featuring high spatial resolution but no specialization to speech, with a measure of speech quality with low spatial resolution obtained after directional filtering. The DOA estimator provides spatial sound source probability in the frontal horizontal plane. The measure of speech quality is based on phoneme representations obtained from a deep neural network, which is part of a hybrid automatic speech recognition (ASR) system. Three ASR-based speech quality measures (ASQM) are explored: entropy, mean temporal distance (M-Measure), matched phoneme (MaP) filtering. We tested the approach in four acoustic scenes with one speaker and either a localized or a diffuse noise source at various signal-to-noise ratios (SNR) in anechoic or reverberant conditions. The effects of incorrect spatial filtering and noise were analyzed. We show that two of the three ASQMs (M-Measure, MaP filtering) are suited to reliably identify the speech target in different conditions. The system is not adapted to the environment and does not require a-priori information about the acoustic scene or a reference signal to estimate the quality of the enhanced speech signal. Nevertheless, our approach performs well in all acoustic scenes tested and varying SNRs and reliably detects incorrect spatial filtering angles.

2.
J Acoust Soc Am ; 151(4): 2636, 2022 04.
Article in English | MEDLINE | ID: mdl-35461479

ABSTRACT

When confronted with unfamiliar or novel forms of speech, listeners' word recognition performance is known to improve with exposure, but data are lacking on the fine-grained time course of adaptation. The current study aims to fill this gap by investigating the time course of adaptation to several different types of distorted speech. Keyword scores as a function of sentence position in a block of 30 sentences were measured in response to eight forms of distorted speech. Listeners recognised twice as many words in the final sentence compared to the initial sentence with around half of the gain appearing in the first three sentences, followed by gradual gains over the rest of the block. Rapid adaptation was apparent for most of the eight distortion types tested with differences mainly in the gradual phase. Adaptation to sine-wave speech improved if listeners had heard other types of distortion prior to exposure, but no similar facilitation occurred for the other types of distortion. Rapid adaptation is unlikely to be due to procedural learning since listeners had been familiarised with the task and sentence format through exposure to undistorted speech. The mechanisms that underlie rapid adaptation are currently unclear.


Subject(s)
Speech Perception , Hearing/physiology , Language , Noise , Speech , Speech Perception/physiology
3.
J Acoust Soc Am ; 151(3): 1417, 2022 03.
Article in English | MEDLINE | ID: mdl-35364918

ABSTRACT

Automatic speech recognition (ASR) has made major progress based on deep machine learning, which motivated the use of deep neural networks (DNNs) as perception models and specifically to predict human speech recognition (HSR). This study investigates if a modeling approach based on a DNN that serves as phoneme classifier [Spille, Ewert, Kollmeier, and Meyer (2018). Comput. Speech Lang. 48, 51-66] can predict HSR for subjects with different degrees of hearing loss when listening to speech embedded in different complex noises. The eight noise signals range from simple stationary noise to a single competing talker and are added to matrix sentences, which are presented to 20 hearing-impaired (HI) listeners (categorized into three groups with different types of age-related hearing loss) to measure their speech recognition threshold (SRT), i.e., the signal-to-noise ratio with 50% word recognition rate. These are compared to responses obtained from the ASR-based model using degraded feature representations that take into account the individual hearing loss of the participants captured by a pure-tone audiogram. Additionally, SRTs obtained from eight normal-hearing (NH) listeners are analyzed. For NH subjects and three groups of HI listeners, the average SRT prediction error is below 2 dB, which is lower than the errors of the baseline models.


Subject(s)
Deep Learning , Presbycusis , Speech Perception , Hearing/physiology , Humans , Speech , Speech Perception/physiology
4.
Trends Hear ; 24: 2331216520970011, 2020.
Article in English | MEDLINE | ID: mdl-33272109

ABSTRACT

Speech audiometry in noise based on sentence tests is an important diagnostic tool to assess listeners' speech recognition threshold (SRT), i.e., the signal-to-noise ratio corresponding to 50% intelligibility. The clinical standard measurement procedure requires a professional experimenter to record and evaluate the response (expert-conducted speech audiometry). The use of automatic speech recognition enables self-conducted measurements with an easy-to-use speech-based interface. This article compares self-conducted SRT measurements using smart speakers with expert-conducted laboratory measurements. With smart speakers, there is no control over the absolute presentation level, potential errors from the automated response logging, and room acoustics. We investigate the differences between highly controlled measurements in the laboratory and smart speaker-based tests for young normal-hearing (NH) listeners as well as for elderly NH, mildly and moderately hearing-impaired listeners in low, medium, and highly reverberant room acoustics. For the smart speaker setup, we observe an overall bias in the SRT result that depends on the hearing loss. The bias ranges from +0.7 dB for elderly moderately hearing-impaired listeners to +2.2 dB for young NH listeners. The intrasubject standard deviation is close to the clinical standard deviation (0.57/0.69 dB for the young/elderly NH compared with 0.5 dB observed for clinical tests and 0.93/1.09 dB for the mild/moderate hearing-impaired listeners compared with 0.9 dB). For detecting a clinically elevated SRT, the speech-based test achieves an area under the curve value of 0.95 and therefore seems promising for complementing clinical measurements.


Subject(s)
Hearing Loss , Speech Perception , Aged , Audiometry, Speech , Auditory Threshold , Hearing , Hearing Loss/diagnosis , Humans , Noise
5.
Eur J Neurosci ; 51(5): 1234-1241, 2020 03.
Article in English | MEDLINE | ID: mdl-29205588

ABSTRACT

Previous research has shown that it is possible to predict which speaker is attended in a multispeaker scene by analyzing a listener's electroencephalography (EEG) activity. In this study, existing linear models that learn the mapping from neural activity to an attended speech envelope are replaced by a non-linear neural network (NN). The proposed architecture takes into account the temporal context of the estimated envelope and is evaluated using EEG data obtained from 20 normal-hearing listeners who focused on one speaker in a two-speaker setting. The network is optimized with respect to the frequency range and the temporal segmentation of the EEG input, as well as the cost function used to estimate the model parameters. To identify the salient cues involved in auditory attention, a relevance algorithm is applied that highlights the electrode signals most important for attention decoding. In contrast to linear approaches, the NN profits from a wider EEG frequency range (1-32 Hz) and achieves a performance seven times higher than the linear baseline. Relevant EEG activations following the speech stimulus after 170 ms at physiologically plausible locations were found. This was not observed when the model was trained on the unattended speaker. Our findings therefore indicate that non-linear NNs can provide insight into physiological processes by analyzing EEG activity.


Subject(s)
Speech Perception , Speech , Acoustic Stimulation , Electroencephalography , Machine Learning
6.
J Speech Lang Hear Res ; 62(1): 177-189, 2019 01 30.
Article in English | MEDLINE | ID: mdl-30534994

ABSTRACT

Purpose For elderly listeners, it is more challenging to listen to 1 voice surrounded by other voices than for young listeners. This could be caused by a reduced ability to use acoustic cues-such as slight differences in onset time-for the segregation of concurrent speech signals. Here, we study whether the ability to benefit from onset asynchrony differs between young (18-33 years) and elderly (55-74 years) listeners. Method We investigated young (normal hearing, N = 20) and elderly (mildly hearing impaired, N = 26) listeners' ability to segregate 2 vowels with onset asynchronies ranging from 20 to 100 ms. Behavioral measures were complemented by a specific event-related brain potential component, the object-related negativity, indicating the perception of 2 distinct auditory objects. Results Elderly listeners' behavioral performance (identification accuracy of the 2 vowels) was considerably poorer than young listeners'. However, both age groups showed the same amount of improvement with increasing onset asynchrony. Object-related negativity amplitude also increased similarly in both age groups. Conclusion Both age groups benefit to a similar extent from onset asynchrony as a cue for concurrent speech segregation during active (behavioral measurement) and during passive (electroencephalographic measurement) listening.


Subject(s)
Speech Acoustics , Speech Perception/physiology , Adult , Age Factors , Aged , Analysis of Variance , Audiometry , Auditory Threshold , Cues , Electroencephalography , Female , Humans , Male , Middle Aged , Young Adult
7.
Hear Res ; 359: 40-49, 2018 03.
Article in English | MEDLINE | ID: mdl-29373159

ABSTRACT

The effort required to listen to and understand noisy speech is an important factor in the evaluation of noise reduction schemes. This paper introduces a model for Listening Effort prediction from Acoustic Parameters (LEAP). The model is based on methods from automatic speech recognition, specifically on performance measures that quantify the degradation of phoneme posteriorgrams produced by a deep neural net: Noise or artifacts introduced by speech enhancement often result in a temporal smearing of phoneme representations, which is measured by comparison of phoneme vectors. This procedure does not require a priori knowledge about the processed speech, and is therefore single-ended. The proposed model was evaluated using three datasets of noisy speech signals with listening effort ratings obtained from normal hearing and hearing impaired subjects. The prediction quality was compared to several baseline models such as the ITU-T standard P.563 for single-ended speech quality assessment, the American National Standard ANIQUE+ for single-ended speech quality assessment, and a single-ended SNR estimator. In all three datasets, the proposed new model achieved clearly better prediction accuracies than the baseline models; correlations with subjective ratings were above 0.9. So far, the model is trained on the specific noise types used in the evaluation. Future work will be concerned with overcoming this limitation by training the model on a variety of different noise types in a multi-condition way in order to make it generalize to unknown noise types.


Subject(s)
Attention , Deep Learning , Hearing Disorders/psychology , Models, Psychological , Noise/adverse effects , Perceptual Masking , Persons With Hearing Impairments/psychology , Speech Perception , Acoustic Stimulation , Adult , Aged , Audiometry, Speech , Auditory Pathways/physiopathology , Case-Control Studies , Female , Hearing , Hearing Disorders/diagnosis , Hearing Disorders/physiopathology , Humans , Male , Middle Aged , Young Adult
8.
Trends Hear ; 202016 09 07.
Article in English | MEDLINE | ID: mdl-27604782

ABSTRACT

To characterize the individual patient's hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The "typical" audiogram shapes from Bisgaard et al with or without a "typical" level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only.


Subject(s)
Auditory Threshold , Hearing Loss , Noise , Speech Perception , Humans , Power, Psychological , Speech Intelligibility
9.
J Acoust Soc Am ; 131(5): 4134-51, 2012 May.
Article in English | MEDLINE | ID: mdl-22559385

ABSTRACT

In an attempt to increase the robustness of automatic speech recognition (ASR) systems, a feature extraction scheme is proposed that takes spectro-temporal modulation frequencies (MF) into account. This physiologically inspired approach uses a two-dimensional filter bank based on Gabor filters, which limits the redundant information between feature components, and also results in physically interpretable features. Robustness against extrinsic variation (different types of additive noise) and intrinsic variability (arising from changes in speaking rate, effort, and style) is quantified in a series of recognition experiments. The results are compared to reference ASR systems using Mel-frequency cepstral coefficients (MFCCs), MFCCs with cepstral mean subtraction (CMS) and RASTA-PLP features, respectively. Gabor features are shown to be more robust against extrinsic variation than the baseline systems without CMS, with relative improvements of 28% and 16% for two training conditions (using only clean training samples or a mixture of noisy and clean utterances, respectively). When used in a state-of-the-art system, improvements of 14% are observed when spectro-temporal features are concatenated with MFCCs, indicating the complementarity of those feature types. An analysis of the importance of specific MF shows that temporal MF up to 25 Hz and spectral MF up to 0.25 cycles/channel are beneficial for ASR.


Subject(s)
Speech Acoustics , Speech Recognition Software/standards , Algorithms , Noise , Sound Spectrography
10.
J Acoust Soc Am ; 129(1): 388-403, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21303019

ABSTRACT

The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.


Subject(s)
Pattern Recognition, Automated , Pattern Recognition, Physiological , Phonetics , Speech Acoustics , Speech Perception , Acoustic Stimulation , Audiometry, Pure-Tone , Audiometry, Speech , Auditory Threshold , Cues , Female , Humans , Male , Neural Networks, Computer , Noise/adverse effects , Perceptual Masking , Recognition, Psychology , Speech Recognition Software , Time Factors , Time Perception
11.
J Acoust Soc Am ; 128(5): 3126-41, 2010 Nov.
Article in English | MEDLINE | ID: mdl-21110608

ABSTRACT

The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).


Subject(s)
Phonation/physiology , Phonetics , Speech Intelligibility/physiology , Speech Perception/physiology , Acoustic Stimulation/methods , Adolescent , Adult , Cues , Databases, Factual , Female , Humans , Language , Male , Models, Biological , Noise , Perceptual Masking/physiology , Recognition, Psychology/physiology , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL
...