Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Intervalo de ano de publicação
1.
J Acoust Soc Am ; 155(3): 1916-1927, 2024 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-38456734

RESUMO

Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.


Assuntos
Percepção da Fala , Humanos , Mascaramento Perceptivo , Ruído/efeitos adversos , Inteligibilidade da Fala , Acústica da Fala
2.
J Acoust Soc Am ; 148(5): 3348, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33261399

RESUMO

Objective metrics, such as the perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and signal-to-distortion ratio (SDR), are often used for evaluating speech. These metrics are intrusive since they require a reference (clean) speech signal to complete the evaluation. The need for a reference signal reduces the practicality of these metrics, since a clean reference signal is not typically available during real-world testing. In this paper, a two-stage approach is presented that estimates the objective score of these intrusive metrics in a non-intrusive manner, which enables testing in real-world environments. More specifically, objective score estimation is treated as a machine-learning problem, and the use of speech-enhancement residuals and convolutional long short-term memory (SER-CL) networks is proposed to blindly estimate the objective scores (i.e., PESQ, STOI, and SDR) of various speech signals. The approach is evaluated in simulated and real environments that contain different combinations of noise and reverberation. The results reveal that the proposed approach is a reasonable alternative for evaluating speech, where it performs well in terms of accuracy and correlation. The proposed approach also outperforms comparison approaches in several environments.


Assuntos
Inteligibilidade da Fala , Percepção da Fala , Memória de Curto Prazo , Ruído/efeitos adversos , Razão Sinal-Ruído
3.
J Acoust Soc Am ; 141(6): 4668, 2017 06.
Artigo em Inglês | MEDLINE | ID: mdl-28679243

RESUMO

Time-frequency masking is a common solution for the single-channel source separation (SCSS) problem where the goal is to find a time-frequency mask that separates the underlying sources from an observed mixture. An estimated mask is then applied to the mixed signal to extract the desired signal. During signal reconstruction, the time-frequency-masked spectral amplitude is combined with the mixture phase. This article considers the impact of replacing the mixture spectral phase with an estimated clean spectral phase combined with the estimated magnitude spectrum using a conventional model-based approach. As the proposed phase estimator requires estimated fundamental frequency of the underlying signal from the mixture, a robust pitch estimator is proposed. The upper-bound clean phase results show the potential of phase-aware processing in single-channel source separation. Also, the experiments demonstrate that replacing the mixture phase with the estimated clean spectral phase consistently improves perceptual speech quality, predicted speech intelligibility, and source separation performance across all signal-to-noise ratio and noise scenarios.


Assuntos
Acústica , Ruído/efeitos adversos , Processamento de Sinais Assistido por Computador , Acústica da Fala , Medida da Produção da Fala/métodos , Qualidade da Voz , Feminino , Análise de Fourier , Humanos , Masculino , Razão Sinal-Ruído , Espectrografia do Som , Fatores de Tempo
4.
IEEE/ACM Trans Audio Speech Lang Process ; 25(7): 1492-1501, 2017 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-30112422

RESUMO

In real-world situations, speech is masked by both background noise and reverberation, which negatively affect perceptual quality and intelligibility. In this paper, we address monaural speech separation in reverberant and noisy environments. We perform dereverberation and denoising using supervised learning with a deep neural network. Specifically, we enhance the magnitude and phase by performing separation with an estimate of the complex ideal ratio mask. We define the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech. Our approach is evaluated using simulated and real room impulse responses, and with background noises. The proposed approach improves objective speech quality and intelligibility significantly. Evaluations and comparisons show that it outperforms related methods in many reverberant and noisy environments.

5.
IEEE/ACM Trans Audio Speech Lang Process ; 24(3): 483-492, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27069955

RESUMO

Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

6.
J Acoust Soc Am ; 138(3): 1399-407, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26428778

RESUMO

As a means of speech separation, time-frequency masking applies a gain function to the time-frequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates speech from noise with reasonable sound quality. A deep neural network then approximates clean speech by estimating activation weights from the ratio-masked speech, where the weights linearly combine elements from a NMF speech model. Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods. In addition, a listening test was performed and its results show that the output of the proposed algorithm is preferred over the comparison systems in terms of speech quality.


Assuntos
Percepção da Fala/fisiologia , Adulto , Algoritmos , Feminino , Humanos , Masculino , Modelos Biológicos , Rede Nervosa/fisiologia , Ruído , Ruído dos Transportes , Mascaramento Perceptivo/fisiologia , Razão Sinal-Ruído , Espectrografia do Som/métodos , Acústica da Fala , Adulto Jovem
7.
J Acoust Soc Am ; 136(2): 892-902, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-25096123

RESUMO

This study proposes an approach to improve the perceptual quality of speech separated by binary masking through the use of reconstruction in the time-frequency domain. Non-negative matrix factorization and sparse reconstruction approaches are investigated, both using a linear combination of basis vectors to represent a signal. In this approach, the short-time Fourier transform (STFT) of separated speech is represented as a linear combination of STFTs from a clean speech dictionary. Binary masking for separation is performed using deep neural networks or Bayesian classifiers. The perceptual evaluation of speech quality, which is a standard objective speech quality measure, is used to evaluate the performance of the proposed approach. The results show that the proposed techniques improve the perceptual quality of binary masked speech, and outperform traditional time-frequency reconstruction approaches.


Assuntos
Ruído/efeitos adversos , Mascaramento Perceptivo , Processamento de Sinais Assistido por Computador , Acústica da Fala , Inteligibilidade da Fala , Percepção da Fala , Qualidade da Voz , Estimulação Acústica , Algoritmos , Audiometria da Fala , Teorema de Bayes , Análise de Fourier , Humanos , Modelos Lineares , Redes Neurais de Computação , Medida da Produção da Fala , Fatores de Tempo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...