Pesquisa | Portal Regional da BVS (teste)

1.

A perceptual similarity space for speech based on self-supervised speech representations.

Chernyak, Bronya R; Bradlow, Ann R; Keshet, Joseph; Goldrick, Matthew.

J Acoust Soc Am ; 155(6): 3915-3929, 2024 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-38904539

RESUMO

Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to text to extract a small set of pre-defined spectro-temporal properties from specific sounds in particular words. However, variation in these properties leaves much cross-talker variation in intelligibility unexplained. We examine an alternative approach utilizing a perceptual similarity space acquired using self-supervised learning. This approach encodes distinctions between speech samples without requiring pre-defined acoustic features or speech-to-text alignment. We show that L2 English speech samples are less tightly clustered in the space than L1 samples reflecting variability in English proficiency among L2 talkers. Critically, distances in this similarity space are perceptually meaningful: L1 English listeners have lower recognition accuracy for L2 speakers whose speech is more distant in the space from L1 speech. These results indicate that perceptual similarity may form the basis for an entirely new speech and language analysis approach.

Assuntos

Acústica da Fala , Inteligibilidade da Fala , Percepção da Fala , Humanos , Masculino , Feminino , Adulto , Adulto Jovem , Multilinguismo , Reconhecimento Psicológico , Ruído

2.

Automatic recognition of second language speech-in-noise.

Kim, Seung-Eun; Chernyak, Bronya R; Seleznova, Olga; Keshet, Joseph; Goldrick, Matthew; Bradlow, Ann R.

JASA Express Lett ; 4(2)2024 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-38350077

RESUMO

Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard-human transcription-is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.

Assuntos

Percepção da Fala , Humanos , Percepção da Fala/fisiologia , Ruído/efeitos adversos , Inteligibilidade da Fala/fisiologia , Interface para o Reconhecimento da Fala , Reconhecimento Psicológico

3.

Speech characteristics yield important clues about motor function: Speech variability in individuals at clinical high-risk for psychosis.

Hitczenko, Kasia; Segal, Yael; Keshet, Joseph; Goldrick, Matthew; Mittal, Vijay A.

Schizophrenia (Heidelb) ; 9(1): 60, 2023 Sep 16.

Artigo em Inglês | MEDLINE | ID: mdl-37717025

RESUMO

BACKGROUND AND HYPOTHESIS: Motor abnormalities are predictive of psychosis onset in individuals at clinical high risk (CHR) for psychosis and are tied to its progression. We hypothesize that these motor abnormalities also disrupt their speech production (a highly complex motor behavior) and predict CHR individuals will produce more variable speech than healthy controls, and that this variability will relate to symptom severity, motor measures, and psychosis-risk calculator risk scores. STUDY DESIGN: We measure variability in speech production (variability in consonants, vowels, speech rate, and pausing/timing) in N = 58 CHR participants and N = 67 healthy controls. Three different tasks are used to elicit speech: diadochokinetic speech (rapidly-repeated syllables e.g., papapa, pataka), read speech, and spontaneously-generated speech. STUDY RESULTS: Individuals in the CHR group produced more variable consonants and exhibited greater speech rate variability than healthy controls in two of the three speech tasks (diadochokinetic and read speech). While there were no significant correlations between speech measures and remotely-obtained motor measures, symptom severity, or conversion risk scores, these comparisons may be under-powered (in part due to challenges of remote data collection during the COVID-19 pandemic). CONCLUSION: This study provides a thorough and theory-driven first look at how speech production is affected in this at-risk population and speaks to the promise and challenges facing this approach moving forward.

4.

Using automated acoustic analysis to explore the link between planning and articulation in second language speech production.

Goldrick, Matthew; Shrem, Yosi; Kilbourn-Ceron, Oriana; Baus, Cristina; Keshet, Joseph.

Lang Cogn Neurosci ; 36(7): 824-839, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34485588

RESUMO

Speakers learning a second language show systematic differences from native speakers in the retrieval, planning, and articulation of speech. A key challenge in examining the interrelationship between these differences at various stages of production is the need for manual annotation of fine-grained properties of speech. We introduce a new method for automatically analyzing voice onset time (VOT), a key phonetic feature indexing differences in sound systems cross-linguistically. In contrast to previous approaches, our method allows reliable measurement of prevoicing, a dimension of VOT variation used by many languages. Analysis of VOTs, word durations, and reaction times from German-speaking learners of Spanish (Baus et al., 2013) suggest that while there are links between the factors impacting planning and articulation, these two processes also exhibit some degree of independence. We discuss the implications of these findings for theories of speech production and future research in bilingual language processing.

5.

Intrapersonal and interpersonal vocal affect dynamics during psychotherapy.

Paz, Adar; Rafaeli, Eshkol; Bar-Kalifa, Eran; Gilboa-Schectman, Eva; Gannot, Sharon; Laufer-Goldshtein, Bracha; Narayanan, Shrikanth; Keshet, Joseph; Atzil-Slonim, Dana.

J Consult Clin Psychol ; 89(3): 227-239, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33829810

RESUMO

OBJECTIVE: The present study implements an automatic method of assessing arousal in vocal data as well as dynamic system models to explore intrapersonal and interpersonal affect dynamics within psychotherapy and to determine whether these dynamics are associated with treatment outcomes. METHOD: The data of 21,133 mean vocal arousal observations were extracted from 279 therapy sessions in a sample of 30 clients treated by 24 therapists. Before and after each session, clients self-reported their well-being level, using the Outcome Rating Scale. RESULTS: Both clients' and therapists' vocal arousal showed intrapersonal dampening. Specifically, although both therapists and clients departed from their baseline, their vocal arousal levels were "pulled" back to these baselines. In addition, both clients and therapists exhibited interpersonal dampening. Specifically, both the clients' and the therapists' levels of arousal were "pulled" toward the other party's arousal level, and clients were "pulled" by their therapists' vocal arousal toward their own baseline. These dynamics exhibited a linear change over the course of treatment: whereas interpersonal dampening decreased over time, there was an increase in intrapersonal dampening over time. In addition, higher levels of interpersonal dampening were associated with better session outcomes. CONCLUSIONS: These findings demonstrate the advantages of using automatic vocal measures to capture nuanced intrapersonal and interpersonal affect dynamics in psychotherapy and demonstrate how these dynamics are associated with treatment gains. (PsycInfo Database Record (c) 2021 APA, all rights reserved).

Assuntos

Afeto/fisiologia , Transtornos Mentais/terapia , Relações Profissional-Paciente , Psicoterapia/métodos , Comportamento Verbal/fisiologia , Adulto , Nível de Alerta , Bases de Dados Factuais , Feminino , Humanos , Masculino , Autorrelato , Resultado do Tratamento

6.

Formant estimation and tracking: A deep learning approach.

Dissen, Yehoshua; Goldberger, Jacob; Keshet, Joseph.

J Acoust Soc Am ; 145(2): 642, 2019 02.

Artigo em Inglês | MEDLINE | ID: mdl-30823790

RESUMO

Formant frequency estimation and tracking are among the most fundamental problems in speech processing. In the estimation task, the input is a stationary speech segment such as the middle part of a vowel, and the goal is to estimate the formant frequencies, whereas in the task of tracking the input is a series of speech frames, and the goal is to track the trajectory of the formant frequencies throughout the signal. The use of supervised machine learning techniques trained on an annotated corpus of read-speech for these tasks is proposed. Two deep network architectures were evaluated for estimation: feed-forward multilayer-perceptrons and convolutional neural-networks and, correspondingly, two architectures for tracking: recurrent and convolutional recurrent networks. The inputs to the former are composed of linear predictive coding-based cepstral coefficients with a range of model orders and pitch-synchronous cepstral coefficients, where the inputs to the latter are raw spectrograms. The performance of the methods compares favorably with alternative methods for formant estimation and tracking. A network architecture is further proposed, which allows model adaptation to different formant frequency ranges that were not seen at training time. The adapted networks were evaluated on three datasets, and their performance was further improved.

7.

Predicting glottal closure insufficiency using fundamental frequency contour analysis.

Cohen, Jacob T; Cohen, Alma; Benyamini, Limor; Adi, Yossi; Keshet, Joseph.

Head Neck ; 41(7): 2324-2331, 2019 07.

Artigo em Inglês | MEDLINE | ID: mdl-30763459

RESUMO

BACKGROUND: Voice analysis has a limited role in a day-to-day voice clinic. We developed objective measurements of vocal folds (VF) glottal closure insufficiency (GCI) during phonation. METHODS: We examined 18 subjects with no history of voice impairment and 20 patients with unilateral VF paralysis before and after injection medialization laryngoplasty. Voice analysis was extracted. We measured settling time, slope, and area under the fundamental frequency curve from the phonation onset to its settling-time. RESULTS: The measured parameters, settling time, slope, and area under the curve were in correlation with the traditional acoustic voice assessments and clinical findings before treatment and after injection medialization laryngoplasty. CONCLUSION: We found that the fundamental frequency curve has several typical contours which correspond to different glottal closure conditions. We proposed a new set of parameters that captures the contour type, and showed that they could be used to quantitatively assess individuals with GCI.

Assuntos

Laringoplastia , Fonação , Software , Acústica da Fala , Paralisia das Pregas Vocais/terapia , Qualidade da Voz , Adulto , Idoso , Idoso de 80 Anos ou mais , Durapatita , Feminino , Géis , Humanos , Masculino , Pessoa de Meia-Idade , Estroboscopia

8.

The influence of lexical selection disruptions on articulation.

Goldrick, Matthew; McClain, Rhonda; Cibelli, Emily; Adi, Yossi; Gustafson, Erin; Moers, Cornelia; Keshet, Joseph.

J Exp Psychol Learn Mem Cogn ; 45(6): 1107-1141, 2019 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-30024252

RESUMO

Interactive models of language production predict that it should be possible to observe long-distance interactions; effects that arise at one level of processing influence multiple subsequent stages of representation and processing. We examine the hypothesis that disruptions arising in nonform-based levels of planning-specifically, lexical selection-should modulate articulatory processing. A novel automatic phonetic analysis method was used to examine productions in a paradigm yielding both general disruptions to formulation processes and, more specifically, overt errors during lexical selection. This analysis method allowed us to examine articulatory disruptions at multiple levels of analysis, from whole words to individual segments. Baseline performance by young adults was contrasted with young speakers' performance under time pressure (which previous work has argued increases interaction between planning and articulation) and performance by older adults (who may have difficulties inhibiting nontarget representations, leading to heightened interactive effects). The results revealed the presence of interactive effects. Our new analysis techniques revealed these effects were strongest in initial portions of responses, suggesting that speech is initiated as soon as the first segment has been planned. Interactive effects did not increase under response pressure, suggesting interaction between planning and articulation is relatively fixed. Unexpectedly, lexical selection disruptions appeared to yield some degree of facilitation in articulatory processing (possibly reflecting semantic facilitation of target retrieval) and older adults showed weaker, not stronger interactive effects (possibly reflecting weakened connections between lexical and form-level representations). (PsycINFO Database Record (c) 2019 APA, all rights reserved).

Assuntos

Fonética , Psicolinguística , Fala , Adolescente , Idoso , Envelhecimento/psicologia , Associação , Feminino , Humanos , Inibição Psicológica , Masculino , Pessoa de Meia-Idade , Redes Neurais de Computação , Reconhecimento Visual de Modelos , Leitura , Adulto Jovem

9.

Automatic speech recognition: A primer for speech-language pathology researchers.

Keshet, Joseph.

Int J Speech Lang Pathol ; 20(6): 599-609, 2018 11.

Artigo em Inglês | MEDLINE | ID: mdl-31274357

RESUMO

Automatic speech recognition (ASR) is increasingly becoming an integral component of our daily lives. This trend is in large part due to recent advances in machine learning, and specifically in deep learning, that have led to accurate ASR across numerous tasks. This has led to renewed interest in providing technological support to populations whose speech patterns are atypical, including identifying the presence of a specific pathology and its severity, comparing speech characteristics before and after a surgery and enhancing the quality of life of individuals with speech pathologies. The purpose of this primer is to bring readers with relatively little technical background up to speed on fundamentals and recent advances in ASR. It presents a detailed account of the anatomy of modern ASR, with examples of how it has been used in speech-language pathology research.

Assuntos

Aprendizado Profundo , Interface para o Reconhecimento da Fala , Patologia da Fala e Linguagem/métodos , Fala , Aprendizado Profundo/tendências , Humanos , Interface para o Reconhecimento da Fala/tendências , Patologia da Fala e Linguagem/tendências

10.

Assessing automatic VOT annotation using unimpaired and impaired speech.

Buz, Esteban; Buchwald, Adam; Fuchs, Tzeviya; Keshet, Joseph.

Int J Speech Lang Pathol ; 20(6): 624-634, 2018 11.

Artigo em Inglês | MEDLINE | ID: mdl-31274358

RESUMO

Investigating speech processes often involves analysing data gathered by phonetically annotating speech recordings. Yet, the manual annotation of speech can often be resource intensive-requiring substantial time and labour to complete. Recent advances in automatic annotation methods offer a way to reduce these annotation costs by replacing manual annotation. For researchers and clinicians, the viability of automatic methods depends whether one can draw similar conclusions about speech processes from automatically annotated speech as one would from manually annotated speech. Here, we evaluate how well one automatic annotation tool, AutoVOT, can approximate manual annotation. We do so by comparing analyses of automatically and manually annotated speech in two studies. We find that, with some caveats, we are able to draw the same conclusions about speech processes under both annotation methods. The findings suggest that automatic methods may be a viable way to reduce phonetic annotation costs in the right circumstances. We end with some guidelines on if and how well AutoVOT may be able to replace manual annotation in other data sets.

Assuntos

Interface para o Reconhecimento da Fala , Fala , Humanos

11.

SEQUENCE SEGMENTATION USING JOINT RNN AND STRUCTURED PREDICTION MODELS.

Adi, Yossi; Keshet, Joseph; Cibelli, Emily; Goldrick, Matthew.

Proc IEEE Int Conf Acoust Speech Signal Process ; 2017: 2422-2426, 2017 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-29033692

RESUMO

We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured loss function which can be designed to the given segmentation task. We demonstrate the effectiveness of our method by applying it to two simple tasks commonly used in phonetic studies: word segmentation and voice onset time segmentation. Results suggest the proposed model is superior to previous methods, obtaining state-of-the-art results on the tested datasets.

12.

Optical remote sensor for peanut kernel abortion classification.

Ozana, Nisan; Buchsbaum, Stav; Bishitz, Yael; Beiderman, Yevgeny; Schmilovitch, Zeev; Schwarz, Ariel; Shemer, Amir; Keshet, Joseph; Zalevsky, Zeev.

Appl Opt ; 55(15): 4005-10, 2016 May 20.

Artigo em Inglês | MEDLINE | ID: mdl-27411126

RESUMO

In this paper, we propose a simple, inexpensive optical device for remote measurement of various agricultural parameters. The sensor is based on temporal tracking of backreflected secondary speckle patterns generated when illuminating a plant with a laser and while applying periodic acoustic-based pressure stimulation. By analyzing different parameters using a support-vector-machine-based algorithm, peanut kernel abortion can be detected remotely. This paper presents experimental tests which are the first step toward an implementation of a noncontact device for the detection of agricultural parameters such as kernel abortion.

13.

Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy.

Cognition ; 149: 31-9, 2016 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-26779665

RESUMO

Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing.

Assuntos

Cognição , Reconhecimento Automatizado de Padrão/métodos , Fonética , Fala , Feminino , Humanos , Aprendizado de Máquina , Masculino , Processamento de Sinais Assistido por Computador , Acústica da Fala

14.

Automatic measurement of vowel duration via structured prediction.

Adi, Yossi; Keshet, Joseph; Cibelli, Emily; Gustafson, Erin; Clopper, Cynthia; Goldrick, Matthew.

J Acoust Soc Am ; 140(6): 4517, 2016 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-28040034

RESUMO

A key barrier to making phonetic studies scalable and replicable is the need to rely on subjective, manual annotation. To help meet this challenge, a machine learning algorithm was developed for automatic measurement of a widely used phonetic measure: vowel duration. Manually-annotated data were used to train a model that takes as input an arbitrary length segment of the acoustic signal containing a single vowel that is preceded and followed by consonants and outputs the duration of the vowel. The model is based on the structured prediction framework. The input signal and a hypothesized set of a vowel's onset and offset are mapped to an abstract vector space by a set of acoustic feature functions. The learning algorithm is trained in this space to minimize the difference in expectations between predicted and manually-measured vowel durations. The trained model can then automatically estimate vowel durations without phonetic or orthographic transcription. Results comparing the model to three sets of manually annotated data suggest it outperformed the current gold standard for duration measurement, an hidden Markov model-based forced aligner (which requires orthographic or phonetic transcription as an input).

Assuntos

Fonética , Acústica , Algoritmos , Aprendizado de Máquina , Acústica da Fala , Percepção da Fala

15.

VOWEL DURATION MEASUREMENT USING DEEP NEURAL NETWORKS.

Adi, Yossi; Keshet, Joseph; Goldrick, Matthew.

IEEE Int Workshop Mach Learn Signal Process ; 20152015 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-29034132

RESUMO

Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.

16.

Automatic measurement of voice onset time using discriminative structured prediction.

Sonderegger, Morgan; Keshet, Joseph.

J Acoust Soc Am ; 132(6): 3965-79, 2012 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-23231126

RESUMO

A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.

Assuntos

Algoritmos , Fonética , Processamento de Sinais Assistido por Computador , Acústica da Fala , Medida da Produção da Fala/métodos , Qualidade da Voz , Inteligência Artificial , Automação , Análise Discriminante , Humanos , Modelos Lineares , Reconhecimento Automatizado de Padrão , Periodicidade , Reprodutibilidade dos Testes , Espectrografia do Som , Fatores de Tempo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA