Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
J Voice ; 2023 Nov 10.
Article in English | MEDLINE | ID: mdl-37953088

ABSTRACT

Auditory-perceptual assessment is widely used in clinical and pedagogical practice for speech and singing voice, yet several studies have shown poor intra- and inter-rater reliability in both clinical and singing voice contexts. Recent advances in artificial intelligence and machine learning offer models for automated classification and have demonstrated discriminatory power in both pathological and healthy voice. This study develops and tests an XGBoost decision tree based machine learning classifier to develop automated vocal mode classification in healthy singing voice. Classification models trained on mel-frequency cepstrum coefficients, MFCC-Zero-Time Windowing, glottal features, voice quality features, and α-ratios demonstrated 92% average F1-score accuracy in distinguishing metallic and non-metallic singing for male singers and 87% average F1-score for female singers. The model distinguished vocal modes with 70% and 69% average F1-score for male and female samples, respectively. Model performance was compared to human auditory-perceptual assessments of 64 corresponding samples performed by 41 professional singers. The model performed with approximating or subpar performance to human assessors on task-matched problems. The XGBoost gains observed across tested features reveal that the most important attributes for the tested classification problems were MFCCs and α-ratios between high and low frequency energy, with models trained on only these features achieving performance not statistically significantly different from the best tested models. The best automated models in this study do not yet match human auditory-perceptual discrimination but improve on previously reported F1-average accuracies in automated classification in singing voice.

2.
Lang Speech ; 66(3): 564-605, 2023 Sep.
Article in English | MEDLINE | ID: mdl-36000386

ABSTRACT

We present an implementation of DIANA, a computational model of spoken word recognition, to model responses collected in the Massive Auditory Lexical Decision (MALD) project. DIANA is an end-to-end model, including an activation and decision component that takes the acoustic signal as input, activates internal word representations, and outputs lexicality judgments and estimated response latencies. Simulation 1 presents the process of creating acoustic models required by DIANA to analyze novel speech input. Simulation 2 investigates DIANA's performance in determining whether the input signal is a word present in the lexicon or a pseudoword. In Simulation 3, we generate estimates of response latency and correlate them with general tendencies in participant responses in MALD data. We find that DIANA performs fairly well in free word recognition and lexical decision. However, the current approach for estimating response latency provides estimates opposite to those found in behavioral data. We discuss these findings and offer suggestions as to what a contemporary model of spoken word recognition should be able to do.


Subject(s)
Speech Perception , Speech , Humans , Reaction Time , Computer Simulation , Speech Perception/physiology , Acoustics
3.
Brain Sci ; 12(5)2022 May 23.
Article in English | MEDLINE | ID: mdl-35625067

ABSTRACT

This article presents DIANA, a new, process-oriented model of human auditory word recognition, which takes as its input the acoustic signal and can produce as its output word identifications and lexicality decisions, as well as reaction times. This makes it possible to compare its output with human listeners' behavior in psycholinguistic experiments. DIANA differs from existing models in that it takes more available neuro-physiological evidence on speech processing into account. For instance, DIANA accounts for the effect of ambiguity in the acoustic signal on reaction times following the Hick-Hyman law and it interprets the acoustic signal in the form of spectro-temporal receptive fields, which are attested in the human superior temporal gyrus, instead of in the form of abstract phonological units. The model consists of three components: activation, decision and execution. The activation and decision components are described in detail, both at the conceptual level (in the running text) and at the computational level (in the Appendices). While the activation component is independent of the listener's task, the functioning of the decision component depends on this task. The article also describes how DIANA could be improved in the future in order to even better resemble the behavior of human listeners.

4.
Front Psychol ; 12: 720017, 2021.
Article in English | MEDLINE | ID: mdl-34539520

ABSTRACT

A growing body of work in psycholinguistics suggests that morphological relations between word forms affect the processing of complex words. Previous studies have usually focused on a particular type of paradigmatic relation, for example the relation between paradigm members, or the relation between alternative forms filling a particular paradigm cell. However, potential interactions between different types of paradigmatic relations have remained relatively unexplored. This paper presents two corpus studies of variable plurals in Dutch to test hypotheses about potentially interacting paradigmatic effects. The first study shows that generalization across noun paradigms predicts the distribution of plural variants, and that this effect is diminished for paradigms in which the plural variants are more likely to have a strong representation in the mental lexicon. The second study demonstrates that the pronunciation of a target plural variant is affected by coactivation of the alternative variant, resulting in shorter segmental durations. This effect is dependent on the representational strength of the alternative plural variant. In sum, by exploring interactions between different types of paradigmatic relations, this paper provides evidence that storage of morphologically complex words may affect the role of generalization and coactivation during production.

5.
Behav Res Methods ; 53(2): 744-756, 2021 04.
Article in English | MEDLINE | ID: mdl-32869139

ABSTRACT

Despite advances in automatic speech recognition (ASR), human input is still essential for producing research-grade segmentations of speech data. Conventional approaches to manual segmentation are very labor-intensive. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, which combines aspects of ASR with limited human input. In developing POnSS, we identified several sub-tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with to streamline their task as much as possible. We evaluated segmentations made with POnSS against a baseline of segmentations of the same data made conventionally in Praat. We observed that POnSS achieved comparable reliability to segmentation using Praat, but required 23% less annotator time investment. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data.


Subject(s)
Image Processing, Computer-Assisted , Speech , Humans , Reproducibility of Results
6.
Psychol Rev ; 127(2): 281-304, 2020 03.
Article in English | MEDLINE | ID: mdl-31886696

ABSTRACT

That speakers can vary their speaking rate is evident, but how they accomplish this has hardly been studied. Consider this analogy: When walking, speed can be continuously increased, within limits, but to speed up further, humans must run. Are there multiple qualitatively distinct speech "gaits" that resemble walking and running? Or is control achieved by continuous modulation of a single gait? This study investigates these possibilities through simulations of a new connectionist computational model of the cognitive process of speech production, EPONA, that borrows from Dell, Burger, and Svec's (1997) model. The model has parameters that can be adjusted to fit the temporal characteristics of speech at different speaking rates. We trained the model on a corpus of disyllabic Dutch words produced at different speaking rates. During training, different clusters of parameter values (regimes) were identified for different speaking rates. In a 1-gait system, the regimes used to achieve fast and slow speech are qualitatively similar, but quantitatively different. In a multiple gait system, there is no linear relationship between the parameter settings associated with each gait, resulting in an abrupt shift in parameter values to move from speaking slowly to speaking fast. After training, the model achieved good fits in all three speaking rates. The parameter settings associated with each speaking rate were not linearly related, suggesting the presence of cognitive gaits. Thus, we provide the first computationally explicit account of the ability to modulate the speech production system to achieve different speaking styles. (PsycINFO Database Record (c) 2020 APA, all rights reserved).


Subject(s)
Executive Function , Models, Theoretical , Neural Networks, Computer , Psycholinguistics , Speech , Humans
7.
J Acoust Soc Am ; 145(2): EL161, 2019 02.
Article in English | MEDLINE | ID: mdl-30823812

ABSTRACT

Many psycholinguistic models of speech sequence planning make claims about the onset and offset times of planning units, such as words, syllables, and phonemes. These predictions typically go untested, however, since psycholinguists have assumed that the temporal dynamics of the speech signal is a poor index of the temporal dynamics of the underlying speech planning process. This article argues that this problem is tractable, and presents and validates two simple metrics that derive planning unit onset and offset times from the acoustic signal and articulatographic data.

8.
PLoS One ; 10(10): e0140732, 2015.
Article in English | MEDLINE | ID: mdl-26489021

ABSTRACT

In this paper we introduce MCA-NMF, a computational model of the acquisition of multimodal concepts by an agent grounded in its environment. More precisely our model finds patterns in multimodal sensor input that characterize associations across modalities (speech utterances, images and motion). We propose this computational model as an answer to the question of how some class of concepts can be learnt. In addition, the model provides a way of defining such a class of plausibly learnable concepts. We detail why the multimodal nature of perception is essential to reduce the ambiguity of learnt concepts as well as to communicate about them through speech. We then present a set of experiments that demonstrate the learning of such concepts from real non-symbolic data consisting of speech sounds, images, and motions. Finally we consider structure in perceptual signals and demonstrate that a detailed knowledge of this structure, named compositional understanding can emerge from, instead of being a prerequisite of, global understanding. An open-source implementation of the MCA-NMF learner as well as scripts and associated experimental data to reproduce the experiments are publicly available.


Subject(s)
Association Learning/physiology , Cognition/physiology , Computer Simulation , Algorithms , Humans , Multimodal Imaging , Pattern Recognition, Visual/physiology , Speech/physiology
9.
PLoS One ; 10(7): e0132245, 2015.
Article in English | MEDLINE | ID: mdl-26218504

ABSTRACT

During language acquisition, infants frequently encounter ambient noise. We present a computational model to address whether specific acoustic processing abilities are necessary to detect known words in moderate noise--an ability attested experimentally in infants. The model implements a general purpose speech encoding and word detection procedure. Importantly, the model contains no dedicated processes for removing or cancelling out ambient noise, and it can replicate the patterns of results obtained in several infant experiments. In addition to noise, we also addressed the role of previous experience with particular target words: does the frequency of a word matter, and does it play a role whether that word has been spoken by one or multiple speakers? The simulation results show that both factors affect noise robustness. We also investigated how robust word detection is to changes in speaker identity by comparing words spoken by known versus unknown speakers during the simulated test. This factor interacted with both noise level and past experience, showing that an increase in exposure is only helpful when a familiar speaker provides the test material. Added variability proved helpful only when encountering an unknown speaker. Finally, we addressed whether infants need to recognise specific words, or whether a more parsimonious explanation of infant behaviour, which we refer to as matching, is sufficient. Recognition involves a focus of attention on a specific target word, while matching only requires finding the best correspondence of acoustic input to a known pattern in the memory. Attending to a specific target word proves to be more noise robust, but a general word matching procedure can be sufficient to simulate experimental data stemming from young infants. A change from acoustic matching to targeted recognition provides an explanation of the improvements observed in infants around their first birthday. In summary, we present a computational model incorporating only the processes infants might employ when hearing words in noise. Our findings show that a parsimonious interpretation of behaviour is sufficient and we offer a formal account of emerging abilities.


Subject(s)
Child Language , Computer Simulation , Models, Biological , Speech Perception/physiology , Female , Humans , Infant , Male
10.
Front Psychol ; 4: 676, 2013.
Article in English | MEDLINE | ID: mdl-24109461

ABSTRACT

In this paper we use a computational model to investigate four assumptions that are tacitly present in interpreting the results of studies on infants' speech processing abilities using the Headturn Preference Procedure (HPP): (1) behavioral differences originate in different processing; (2) processing involves some form of recognition; (3) words are segmented from connected speech; and (4) differences between infants should not affect overall results. In addition, we investigate the impact of two potentially important aspects in the design and execution of the experiments: (a) the specific voices used in the two parts on HPP experiments (familiarization and test) and (b) the experimenter's criterion for what is a sufficient headturn angle. The model is designed to be maximize cognitive plausibility. It takes real speech as input, and it contains a module that converts the output of internal speech processing and recognition into headturns that can yield real-time listening preference measurements. Internal processing is based on distributed episodic representations in combination with a matching procedure based on the assumptions that complex episodes can be decomposed as positive weighted sums of simpler constituents. Model simulations show that the first assumptions hold under two different definitions of recognition. However, explicit segmentation is not necessary to simulate the behaviors observed in infant studies. Differences in attention span between infants can affect the outcomes of an experiment. The same holds for the experimenter's decision criterion. The speakers used in experiments affect outcomes in complex ways that require further investigation. The paper ends with recommendations for future studies using the HPP.

11.
Logoped Phoniatr Vocol ; 36(4): 168-74, 2011 Dec.
Article in English | MEDLINE | ID: mdl-21864051

ABSTRACT

OBJECTIVE: Investigation of applicability of neural network feature analysis of nasalance in speech to assess hypernasality in speech of patients treated for oral or oropharyngeal cancer. PATIENTS AND METHODS: Speech recordings of 51 patients and of 18 control speakers were evaluated regarding hypernasality, articulation, intelligibility, and patient-reported speech outcome. Feature analysis of nasalance was performed on /a/, /i/, and /u/ and on the entire stretch of speech. RESULTS: Nasalance distinguished significantly between patients and controls. Nasalance in /a/ and /i/ predicted best intelligibility, nasalance in /a/ predicted best articulation, and nasalance in /i/ and /u/ predicted best hypernasality. CONCLUSION: Feature analysis of nasalance in oral or oropharyngeal cancer patients is feasible; prediction of subjective parameters varies between moderate and poor.


Subject(s)
Mouth Neoplasms/therapy , Neural Networks, Computer , Oropharyngeal Neoplasms/therapy , Phonation , Signal Processing, Computer-Assisted , Speech Production Measurement , Voice Quality , Adult , Aged , Case-Control Studies , Feasibility Studies , Female , Humans , Male , Middle Aged , Mouth Neoplasms/physiopathology , Netherlands , Oropharyngeal Neoplasms/physiopathology , Phonetics , Sound Spectrography , Speech Intelligibility , Treatment Outcome , Young Adult
12.
J Acoust Soc Am ; 126(6): 3227-35, 2009 Dec.
Article in English | MEDLINE | ID: mdl-20000936

ABSTRACT

Articulatory and acoustic reduction can manifest itself in the temporal and spectral domains. This study introduces a measure of spectral reduction, which is based on the speech decoding techniques commonly used in automatic speech recognizers. Using data for four frequent Dutch affixes from a large corpus of spontaneous face-to-face conversations, it builds on an earlier study examining the effects of lexical frequency on durational reduction in spoken Dutch [Pluymaekers, M. et al. (2005). J. Acoust. Soc. Am. 118, 2561-2569], and compares the proposed measure of spectral reduction with duration as a measure of reduction. The results suggest that the spectral reduction scores capture other aspects of reduction than duration. While duration can--albeit to a moderate degree--be predicted by a number of linguistically motivated variables (such as word frequency, segmental context, and speech rate), the spectral reduction scores cannot. This suggests that the spectral reduction scores capture information that is not directly accounted for by the linguistically motivated variables. The results also show that the spectral reduction scores are able to predict a substantial amount of the variation in duration that the linguistically motivated variables do not account for.


Subject(s)
Acoustics , Speech , Algorithms , Databases as Topic , Humans , Interpersonal Relations , Language , Linguistics , Models, Theoretical , Phonetics , Speech Acoustics , Time Factors
13.
Folia Phoniatr Logop ; 61(3): 180-7, 2009.
Article in English | MEDLINE | ID: mdl-19571552

ABSTRACT

OBJECTIVE: Speech impairment often occurs in patients after treatment for head and neck cancer. New treatment modalities such as surgical reconstruction or (chemo)radiation techniques aim at sparing anatomical structures that are correlated with speech and swallowing. In randomized trials investigating efficacy of various treatment modalities or speech rehabilitation, objective speech analysis techniques may add to improve speech outcome assessment. The goal of the present study is to investigate the role of objective acoustic-phonetic analyses in a multidimensional speech assessment protocol. PATIENTS AND METHODS: Speech recordings of 51 patients (6 months after reconstructive surgery and postoperative radiotherapy for oral or oropharyngeal cancer) and of 18 control speakers were subjectively evaluated regarding intelligibility, nasal resonance, articulation, and patient-reported speech outcome (speech subscale of the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire-Head and Neck 35 module). Acoustic-phonetic analyses were performed to calculate formant values of the vowels /a, i, u/, vowel space, air pressure release of /k/ and spectral slope of /x/. RESULTS: Intelligibility, articulation, and nasal resonance were best predicted by vowel space and /k/. Within patients, /k/ and /x/ differentiated tumor site and stage. Various objective speech parameters were related to speech problems as reported by patients. CONCLUSION: Objective acoustic-phonetic analysis of speech of patients is feasible and contributes to further development of a speech assessment protocol.


Subject(s)
Mouth Neoplasms/therapy , Oropharyngeal Neoplasms/therapy , Phonetics , Speech Acoustics , Speech Production Measurement/methods , Speech , Adult , Aged , Air Pressure , Female , Humans , Male , Middle Aged , Mouth Neoplasms/radiotherapy , Mouth Neoplasms/surgery , Oropharyngeal Neoplasms/radiotherapy , Oropharyngeal Neoplasms/surgery , Sex Characteristics , Speech Articulation Tests , Speech Intelligibility , Surveys and Questionnaires , Treatment Outcome , Voice Quality , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL
...