Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4234-4245, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38241115

RESUMO

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time.


Assuntos
Algoritmos , Humanos , Processamento de Sinais Assistido por Computador , Fala/fisiologia , Processamento de Linguagem Natural , Bases de Dados Factuais , Espectrografia do Som/métodos
2.
Front Psychiatry ; 13: 885120, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35573327

RESUMO

Electroencephalography (EEG) is one of the most widely-used biosignal capturing technology for investigating brain activities, cognitive diseases, and affective disorders. To understand the underlying principles of brain activities and affective disorders using EEG data, one of the fundamental tasks is to accurately identify emotions from EEG signals, which has attracted huge attention in the field of affective computing. To improve the accuracy and effectiveness of emotion recognition based on EEG data, previous studies have successfully developed numerous feature extraction methods and classifiers. Among them, ensemble empirical mode decomposition (EEMD) is an efficient signal decomposition technique for extracting EEG features. It can alleviate the mode-mixing problem by adding white noise to the source signal. However, there remain some issues when applying this method to recognition tasks. As the added noise cannot be filtered completely, spurious modes are generated due to the residual noise. Therefore, it is crucial to perform intrinsic mode function (IMF) selection to find the most valuable IMF components that represent brain activities. Furthermore, the number of decomposed IMFs is various to different original signals, thus how to unify feature dimensions needs better solutions. To solve these issues, we propose a novel forecasting framework, named DEEMD-SPP, to identify emotions from EEG signals, based on the combination of denoising ensemble empirical mode decomposition (DEEMD) and Spatial Pyramid Pooling Network (SPP-Net). First, DEEMD is proposed to decompose the EEG signals, which effectively eliminates residual noise in the IMFs and selects the most valuable IMFs. Second, time-domain and frequency-domain features are extracted from the selected IMFs. Finally, SPP-net is employed as the classifier to recognize emotions, which can effectively transform various-sized feature maps into fixed-sized feature vectors through the pyramid pooling layer. The experimental results demonstrate that our proposed DEEMD-SPP framework can effectively reduce the effect of spike-in white noise, accurately extract EEG features, and significantly improve the performance of emotion recognition.

3.
Front Neurosci ; 15: 689791, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34335165

RESUMO

Recently, emotion classification from electroencephalogram (EEG) data has attracted much attention. As EEG is an unsteady and rapidly changing voltage signal, the features extracted from EEG usually change dramatically, whereas emotion states change gradually. Most existing feature extraction approaches do not consider these differences between EEG and emotion. Microstate analysis could capture important spatio-temporal properties of EEG signals. At the same time, it could reduce the fast-changing EEG signals to a sequence of prototypical topographical maps. While microstate analysis has been widely used to study brain function, few studies have used this method to analyze how brain responds to emotional auditory stimuli. In this study, the authors proposed a novel feature extraction method based on EEG microstates for emotion recognition. Determining the optimal number of microstates automatically is a challenge for applying microstate analysis to emotion. This research proposed dual-threshold-based atomize and agglomerate hierarchical clustering (DTAAHC) to determine the optimal number of microstate classes automatically. By using the proposed method to model the temporal dynamics of auditory emotion process, we extracted microstate characteristics as novel temporospatial features to improve the performance of emotion recognition from EEG signals. We evaluated the proposed method on two datasets. For public music-evoked EEG Dataset for Emotion Analysis using Physiological signals, the microstate analysis identified 10 microstates which together explained around 86% of the data in global field power peaks. The accuracy of emotion recognition achieved 75.8% in valence and 77.1% in arousal using microstate sequence characteristics as features. Compared to previous studies, the proposed method outperformed the current feature sets. For the speech-evoked EEG dataset, the microstate analysis identified nine microstates which together explained around 85% of the data. The accuracy of emotion recognition achieved 74.2% in valence and 72.3% in arousal using microstate sequence characteristics as features. The experimental results indicated that microstate characteristics can effectively improve the performance of emotion recognition from EEG signals.

4.
Neural Netw ; 143: 250-260, 2021 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34157649

RESUMO

End-to-end TTS advancement has shown that synthesized speech prosody can be controlled by conditioning the decoder with speech prosody attribute labels. However, to annotate quantitatively the prosody patterns of a large set of training data is both time consuming and expensive. To use unannotated data, variational autoencoder (VAE) has been proposed to model individual prosody attribute as a random variable in the latent space. The VAE is an unsupervised approach and the corresponding latent variables are in general correlated with each other. For more effective and direct control of speech prosody along each attribute dimension, it is highly desirable to disentangle the correlated latent variables. Additionally, being able to interpret the disentangled attributes as speech perceptual cues is useful for designing more efficient prosody control of TTS. In this paper, we propose two attribute separation schemes: (1) using 3 separate VAEs to model the real-valued, different prosodic features, i.e., F0, energy and duration; (2) minimizing mutual information between different prosody attributes to remove their mutual correlations, for facilitating more direct prosody control. Experimental results confirm that the two proposed schemes can indeed make individual prosody attributes more interpretable and direct TTS prosody control more effective. The improvements are measured objectively by F0 Frame Error (FFE) and subjectively with MOS and A/B comparison listening tests, respectively. The scatter diagrams of t-SNE also demonstrate the correlations between prosody attributes, which are well disentangled by minimizing their mutual information. Synthesized TTS samples can be found at https://xiaochunan.github.io/prosody/index.html.


Assuntos
Percepção da Fala , Fala , Sinais (Psicologia)
5.
Neural Netw ; 140: 223-236, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-33780874

RESUMO

In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.


Assuntos
Aprendizado de Máquina , Interface para o Reconhecimento da Fala , Redação
6.
J Acoust Soc Am ; 121(5 Pt1): 2936-45, 2007 May.
Artigo em Inglês | MEDLINE | ID: mdl-17550191

RESUMO

This paper studies automatic tone recognition in continuous Cantonese speech. Cantonese is a major Chinese dialect that is known for being rich in tones. Tone information serves as a useful knowledge source for automatic speech recognition of Cantonese. Cantonese tone recognition is difficult because the tones have similar shapes of pitch contours. The tones are differentiated mainly by their relative pitch heights. In natural speech, the pitch level of a tone may shift up and down and the F0 ranges of different tones overlap with each other, making them acoustically indistinguishable within the domain of a syllable. Our study shows that the relative pitch heights are largely preserved between neighboring tones. A novel method of supratone modeling is proposed for Cantonese tone recognition. Each supratone model characterizes the F0 contour of two or three tones in succession. The tone sequence of a continuous utterance is formed as an overlapped concatenation of supratone units. The most likely tone sequence is determined under phonological constraints on syllable-tone combinations. The proposed method attains an accuracy of 74.68% in speaker-independent tone recognition experiments. In particular, the confusion among the tones with similar contour shapes is greatly resolved.


Assuntos
Idioma , Percepção da Altura Sonora , Percepção da Fala , Feminino , Humanos , Masculino , Reconhecimento Psicológico , Espectrografia do Som , Acústica da Fala , Testes de Discriminação da Fala
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...