Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters










Database
Language
Publication year range
1.
Proc Natl Acad Sci U S A ; 92(22): 10046-51, 1995 Oct 24.
Article in English | MEDLINE | ID: mdl-7479724

ABSTRACT

Research in speech recognition and synthesis over the past several decades has brought speech technology to a point where it is being used in "real-world" applications. However, despite the progress, the perception remains that the current technology is not flexible enough to allow easy voice communication with machines. The focus of speech research is now on producing systems that are accurate and robust but that do not impose unnecessary constraints on the user. This chapter takes a critical look at the shortcomings of the current speech recognition and synthesis algorithms, discusses the technical challenges facing research, and examines the new directions that research in speech recognition and synthesis must take in order to form the basis of new solutions suitable for supporting a wide range of applications.


Subject(s)
Communication Aids for Disabled , Communication , Speech , Voice , Automation , Humans , Pattern Recognition, Automated , Technology/trends , User-Computer Interface
2.
J Acoust Soc Am ; 85(4): 1726-40, 1989 Apr.
Article in English | MEDLINE | ID: mdl-2708689

ABSTRACT

The perception of subphonemic differences between vowels was investigated using multidimensional scaling techniques. Three experiments were conducted with natural-sounding synthetic stimuli generated by linear predictive coding (LPC) formant synthesizers. In the first experiment, vowel sets near the pairs (i-I), (epsilon-ae), or (u-U) were synthesized containing 11 vowels each. Listeners judged the dissimilarities between all pairs of vowels within a set several times. These perceptual differences were mapped into distances between the vowels in an n-dimensional space using two-way multidimensional scaling. Results for each vowel set showed that the physical stimulus space, which was specified by the two parameters F1 and F2, was always mapped into a two-dimensional perceptual space. The best metric for modeling the perceptual distances was the Euclidean distance between F1 and F2 in barks. The second experiment investigated the perception of the same vowels from the first experiment, but embedded in a consonantal context. Following the same procedures as experiment 1, listeners' perception of the (bv) dissimilarities was not different from their perception of the isolated vowel dissimilarities. The third experiment investigated dissimilarity judgments for the three vowels (ae-alpha-lambda) located symmetrically in the F1 X F2 vowel space. While the perceptual space was again two dimensional, the influence of phonetic identity on vowel difference judgments was observed. Implications for determining metrics for subphonemic vowel differences using multidimensional scaling are discussed.


Subject(s)
Phonetics , Sound Spectrography , Speech Perception , Adult , Attention , Female , Humans , Pitch Perception
3.
Ann N Y Acad Sci ; 405: 18-32, 1983.
Article in English | MEDLINE | ID: mdl-6575642

ABSTRACT

Speech is a highly redundant signal. The redundant nature of speech is important for providing reliable communication over air pathways. A large part of this redundancy is useless for speech communication over digital channels. Speech coding aims at minimizing the information rate needed to reproduce a speech signal with specified fidelity. In this paper, we discuss factors that influence the design of efficient speech coders. The encoding and decoding processes invariably introduce error (noise and distortion) in the speech signal. Inability of the human ear to hear certain kinds of distortions in the speech signal plays a crucial role in producing high-quality speech at low bit rates. The physical difference between the waveforms of a given speech signal and its coded replica generally does not tell us much about the subjective quality of the coded signal. A signal-to-noise ratio as small as 10 dB can be tolerated in the coded signal provided the errors are distributed both in time and frequency domains where they are least audible. Recent work on auditory masking has provided us with new insights for optimizing the performance of speech coders. This paper reviews this work and discusses new speech coding methods that attempt to maximize the perceptual similarity between the original speech signal and its coded replica. These new methods make it possible to reproduce speech signals at very low bit rates with little or no audible distortion.


Subject(s)
Speech Acoustics , Speech , Vocal Cords/physiology , Cochlea/physiology , Computers , Humans , Models, Neurological , Phonetics , Sound Spectrography , Speech Perception/physiology
4.
J Acoust Soc Am ; 66(5): 1325-32, 1979 Nov.
Article in English | MEDLINE | ID: mdl-500970

ABSTRACT

A recent study [Olive and Spickenagel, J. Acoust. Soc. Am. 59, 993-996 (1976)] has shown that area parameters derived from linear prediction analysis can be linearly interpolated between dyad boundaries with very little distortion in the resultant synthesized speech. The success of area parameter interpolation raises a question: can other acoustic parameters, such as the power spectrum of the speech waveform, be similarly interpolated? The spectrum is of special interest because speech can be synthesized in real time from spectral parameters on a programmable digital filter. To study this question a speech analysis-synthesis system using spectral parameters (samples of power spectra at different frequencies) was simulated. These parameters were determined from the speech signal at every dyad boundary, and interpolated for intermediate values. Dyad boundaries (representing the limits of transition regions between phonemes) were determined manually. Informal listening tests comparing synthetic speech with and without linear interpolation showed slight degradation in the interpolated speech. This degradation is significantly reduced by using an additional point within the dyad boundaries for interpolation.


Subject(s)
Speech , Acoustics , Hearing Tests , Humans
5.
J Acoust Soc Am ; 64(5): 1310-8, 1978 Nov.
Article in English | MEDLINE | ID: mdl-744832

ABSTRACT

Speech analysis and synthesis by linear prediction is based on the assumption that the short-time spectral envelope of speech can be represented by a number of poles. An all-pole representation does not provide an accurate description of speech spectra, particularly for nasals and nasalized sounds. This paper presents a method for characterizing speech in terms of the parameters of a pole-zero model. In this method, an impulse response representing the composite filtering action of the glottal wave, the vocal tract, the radiation, and the speech recording system is first constructed from the speech signal. This impulse response is obtained by performing several stages of all-pole LPC analysis. The pole-zero parameters are determined from the impulse response by solving a set of simultaneous linear equations. The method, being noniterative, is very suitable for automatic analysis of speech. The method has been applied to real speech data and the results show that the speech spectra derived from the pole-zero model agree very closely with the actual spectra derived by direct Fourier analysis.


Subject(s)
Acoustics , Models, Theoretical , Speech , Mathematics , Speech/physiology , Speech Perception , Voice Quality
6.
J Acoust Soc Am ; 63(5): 1535-53, 1978 May.
Article in English | MEDLINE | ID: mdl-690333

ABSTRACT

We present numerical methods for studying the relationship between the shape of the vocal tract and its acoustic output. For a stationary vocal tract, the articulatory-acoustic relationship can be represented as a multidimensional function of a multidimensional argument: y=f(x), where x, y are vectors describing the vocal-tract shape and the resulting acoustic output, respectively. Assuming that y may be computed for any x, we develop a procedure for inverting f(x). Inversion by computer sorting consists of computing y for many values of x and sorting the resulting (y,x) pairs into a convenient order according to y; x for a given y is then obtained by looking up y in the sorted data. Application of this method for determining parameters of an articulatory model corresponding to a given set of formant frequencies is presented. A method is also described for finding articulatory regions (fibers) which map into a single point in the acoustic space. The local nature of f(x) is determined by linearization in a small neighborhood. Larger regions are explored by extending the linear neighborhoods in small steps. This method was applied for the study of compensatory articulation. Sounds produced by various articulations along a fiber were synthesized and were compared by informal listening tests. These tests show that, in many cases of interest, a given sound could be produced by many different vocal-tract shapes.


Subject(s)
Speech Production Measurement/methods , Speech/physiology , Acoustics , Computers , Glottis/physiology , Humans , Lip/physiology , Models, Biological , Phonetics
SELECTION OF CITATIONS
SEARCH DETAIL
...