Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
Comput Speech Lang ; 29(1): 172-185, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25382935

ABSTRACT

For several decades now, there has been sporadic interest in automatically characterizing the speech impairment due to Parkinson's disease (PD). Most early studies were confined to quantifying a few speech features that were easy to compute. More recent studies have adopted a machine learning approach where a large number of potential features are extracted and the models are learned automatically from the data. In the same vein, here we characterize the disease using a relatively large cohort of 168 subjects, collected from multiple (three) clinics. We elicited speech using three tasks - the sustained phonation task, the diadochokinetic task and a reading task, all within a time budget of 4 minutes, prompted by a portable device. From these recordings, we extracted 1582 features for each subject using openSMILE, a standard feature extraction tool. We compared the effectiveness of three strategies for learning a regularized regression and find that ridge regression performs better than lasso and support vector regression for our task. We refine the feature extraction to capture pitch-related cues, including jitter and shimmer, more accurately using a time-varying harmonic model of speech. Our results show that the severity of the disease can be inferred from speech with a mean absolute error of about 5.5, explaining 61% of the variance and consistently well-above chance across all clinics. Of the three speech elicitation tasks, we find that the reading task is significantly better at capturing cues than diadochokinetic or sustained phonation task. In all, we have demonstrated that the data collection and inference can be fully automated, and the results show that speech-based assessment has promising practical application in PD. The techniques reported here are more widely applicable to other paralinguistic tasks in clinical domain.

2.
Comput Speech Lang ; 28(1)2014 Jan 01.
Article in English | MEDLINE | ID: mdl-24187435

ABSTRACT

Language is being increasingly harnessed to not only create natural human-machine interfaces but also to infer social behaviors and interactions. In the same vein, we investigate a novel spoken language task, of inferring social relationships in two-party conversations: whether the two parties are related as family, strangers or are involved in business transactions. For our study, we created a corpus of all incoming and outgoing calls from a few homes over the span of a year. On this unique naturalistic corpus of everyday telephone conversations, which is unlike Switchboard or any other public domain corpora, we demonstrate that standard natural language processing techniques can achieve accuracies of about 88%, 82%, 74% and 80% in differentiating business from personal calls, family from non-family calls, familiar from unfamiliar calls and family from other personal calls respectively. Through a series of experiments with our classifiers, we characterize the properties of telephone conversations and find: (a) that 30 words of openings (beginnings) are sufficient to predict business from personal calls, which could potentially be exploited in designing context sensitive interfaces in smart phones; (b) our corpus-based analysis does not support Schegloff and Sack's manual analysis of exemplars in which they conclude that pre-closings differ significantly between business and personal calls - closing fared no better than a random segment; and (c) the distribution of different types of calls are stable over durations as short as 1-2 months. In summary, our results show that social relationships can be inferred automatically in two-party conversations with sufficient accuracy to support practical applications.

3.
Article in English | MEDLINE | ID: mdl-25571060

ABSTRACT

Vocalization is an important clue in recognizing monkeys' behaviors. Previous studies have shown that the frequencies, the types and the lengths of vocalizations reveal significant information about social interactions in a group of monkeys. In this work, we describe a corpus of monkey vocalizations, recorded from Oregon National Primate Research Center with the goal of developing automatic methods for recognizing social behaviors of individuals in groups. The constraints of the problem necessitated using tiny low-power recorders, mounted on their collars. The recordings from each monkeys' recorder nonetheless contains vocalizations from not only the monkey wearing the recorder but also its spatial neighbors. The devices recorded vocalizations for two consecutive days, 12 hours each day, from each monkey in the group. Like in sensor networks, low power recorders are unreliable and have sample loss over long durations. Furthermore, the recordings contain high-levels of background noise, including clanging of metal collars against cages and conversations of caretakers. These practical issues poses an interesting challenge in processing the recordings. In this paper, we investigate our automated approaches to process the data efficiently, detect the vocalizations and align the recordings from the same sessions.


Subject(s)
Social Behavior , Vocalization, Animal , Animals , Automation , Hierarchy, Social , Macaca mulatta , Tape Recording , Video Recording
4.
Article in English | MEDLINE | ID: mdl-33288990

ABSTRACT

In this paper, we investigate the problem of detecting depression from recordings of subjects' speech using speech processing and machine learning. There has been considerable interest in this problem in recent years due to the potential for developing objective assessments from real-world behaviors, which may provide valuable supplementary clinical information or may be useful in screening. The cues for depression may be present in "what is said" (content) and "how it is said" (prosody). Given the limited amounts of text data, even in this relatively large study, it is difficult to employ standard method of learning models from n-gram features. Instead, we learn models using word representations in an alternative feature space of valence and arousal. This is akin to embedding words into a real vector space albeit with manual ratings instead of those learned with deep neural networks [1]. For extracting prosody, we employ standard feature extractors such as those implemented in openSMILE and compare them with features extracted from harmonic models that we have been developing in recent years. Our experiments show that our features from harmonic model improve the performance of detecting depression from spoken utterances than other alternatives. The context features provide additional improvements to achieve an accuracy of about 74%, sufficient to be useful in screening applications.

5.
Article in English | MEDLINE | ID: mdl-33642942

ABSTRACT

Methods are proposed for measuring affective valence and arousal in speech. The methods apply support vector regression to prosodic and text features to predict human valence and arousal ratings of three stimulus types: speech, delexicalized speech, and text transcripts. Text features are extracted from transcripts via a lookup table listing per-word valence and arousal values and computing per-utterance statistics from the per-word values. Prediction of arousal ratings of delexicalized speech and of speech from prosodic features was successful, with accuracy levels not far from limits set by the reliability of the human ratings. Prediction of valence for these stimulus types as well as prediction of both dimensions for text stimuli proved more difficult, even though the corresponding human ratings were as reliable. Text based features did add, however, to the accuracy of prediction of valence for speech stimuli. We conclude that arousal of speech can be measured reliably, but not valence, and that improving the latter requires better lexical features.

6.
Article in English | MEDLINE | ID: mdl-33680571

ABSTRACT

In this paper, we investigate the problem of detecting social contexts from the audio recordings of everyday life such as in life-logs. Unlike the standard corpora of telephone speech or broadcast news, these recordings have a wide variety of background noise. By nature, in such applications, it is difficult to collect and label all the representative noise for learning models in a fully supervised manner. The amount of labeled data that can be expected is relatively small compared to the available recordings. This lends itself naturally to unsupervised feature extraction using sparse auto-encoders, followed by supervised learning of a classifier for social contexts. We investigate different strategies for training these models and report results on a real-world application.

7.
Neuroimage ; 75: 165-175, 2013 Jul 15.
Article in English | MEDLINE | ID: mdl-23501054

ABSTRACT

Resting state functional connectivity MRI (rs-fcMRI) is a popular technique used to gauge the functional relatedness between regions in the brain for typical and special populations. Most of the work to date determines this relationship by using Pearson's correlation on BOLD fMRI timeseries. However, it has been recognized that there are at least two key limitations to this method. First, it is not possible to resolve the direct and indirect connections/influences. Second, the direction of information flow between the regions cannot be differentiated. In the current paper, we follow-up on recent work by Smith et al. (2011), and apply PC algorithm to both simulated data and empirical data to determine whether these two factors can be discerned with group average, as opposed to single subject, functional connectivity data. When applied on simulated individual subjects, the algorithm performs well determining indirect and direct connection but fails in determining directionality. However, when applied at group level, PC algorithm gives strong results for both indirect and direct connections and the direction of information flow. Applying the algorithm on empirical data, using a diffusion-weighted imaging (DWI) structural connectivity matrix as the baseline, the PC algorithm outperformed the direct correlations. We conclude that, under certain conditions, the PC algorithm leads to an improved estimate of brain network structure compared to the traditional connectivity analysis based on correlations.


Subject(s)
Algorithms , Brain/physiology , Image Interpretation, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Neural Pathways/physiology , Adult , Bayes Theorem , Female , Humans , Male
8.
Interspeech ; 2013: 191-194, 2013 Aug.
Article in English | MEDLINE | ID: mdl-33564670

ABSTRACT

In this paper, we report experiments on the Interspeech 2013 Autism Challenge, which comprises of two subtasks - detecting children with ASD and classifying them into four subtypes. We apply our recently developed algorithm to extract speech features that overcomes certain weaknesses of other currently available algorithms [1, 2]. From the input speech signal, we estimate the parameters of a harmonic model of the voiced speech for each frame including the fundamental frequency (f 0). From the fundamental frequencies and the reconstructed noise-free signal, we compute other derived features such as Harmonic-to-Noise Ratio (HNR), shimmer, and jitter. In previous work, we found that these features detect voiced segments and speech more accurately than other algorithms and that they are useful in rating the severity of a subject's Parkinson's disease [3]. Here, we employ these features, along with standard features such as energy, cepstral, and spectral features. With these features, we detect ASD using a regression and identify the sub-type using a classifier. We find that our features improve the performance, measured in terms of unweighted average recall (UAR), of detecting autism spectrum disorder by 2.3% and classifying the disorder into four categories by 2.8% over the baseline results.

9.
SLT Workshop Spok Lang Technol ; 2012: 438-442, 2012 Dec.
Article in English | MEDLINE | ID: mdl-33644784

ABSTRACT

We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.

10.
Interspeech ; 2012: 538-541, 2012 Sep.
Article in English | MEDLINE | ID: mdl-33855060

ABSTRACT

In this paper, we report our experiments on Interspeech 2012 Speaker Trait Pathology challenge task [2]. Specifically, we investigate two factors that impact the acoustic properties of the utterances collected in this task. Although the task treats utterances as independent data points, multiple utterances are recorded from individual speakers. Furthermore, the utterances correspond to readings of 17 given written sentences. In one experiment, we attempt to reduce variation due to speaker through dimensionality reduction. While these experiments showed promising results on development set, the performance did not translate to the evaluation test. In another, we learn classifiers conditioned on the sentences to capture sentence-specific signatures. This approach showed improved performance over the baseline on development set and the improvement translated to marginal gains on evaluation set. These experiments demonstrates the need to pay attention to the independence assumptions while collecting and defining clinical tasks.

11.
Proc Conf ; : 112-119, 2012.
Article in English | MEDLINE | ID: mdl-24419500

ABSTRACT

This study aims to infer the social nature of conversations from their content automatically. To place this work in context, our motivation stems from the need to understand how social disengagement affects cognitive decline or depression among older adults. For this purpose, we collected a comprehensive and naturalistic corpus comprising of all the incoming and outgoing telephone calls from 10 subjects over the duration of a year. As a first step, we learned a binary classifier to filter out business related conversation, achieving an accuracy of about 85%. This classification task provides a convenient tool to probe the nature of telephone conversations. We evaluated the utility of openings and closing in differentiating personal calls, and find that empirical results on a large corpus do not support the hypotheses by Schegloff and Sacks that personal conversations are marked by unique closing structures. For classifying different types of social relationships such as family vs other, we investigated features related to language use (entropy), hand-crafted dictionary (LIWC) and topics learned using unsupervised latent Dirichlet models (LDA). Our results show that the posteriors over topics from LDA provide consistently higher accuracy (60-81%) compared to LIWC or language use features in distinguishing different types of conversations.

12.
Article in English | MEDLINE | ID: mdl-21095825

ABSTRACT

Parkinson's disease is known to cause mild to profound communication impairments depending on the stage of progression of the disease. There is a growing interest in home-based assessment tools for measuring severity of Parkinson's disease and speech is an appealing source of evidence. This paper reports tasks to elicit a versatile sample of voice production, algorithms to extract useful information from speech and models to predict the severity of the disease. Apart from standard features from time domain (e.g., energy, speaking rate), spectral domain (e.g., pitch, spectral entropy) and cepstral domain (e.g, mel-frequency warped cepstral coefficients), we also estimate harmonic-to-noise ratio, shimmer and jitter using our recently developed algorithms. In a preliminary study, we evaluate the proposed paradigm on data collected through 2 clinics from 82 subjects in 116 assessment sessions. Our results show that the information extracted from speech, elicited through 3 tasks, can predict the severity of the disease to within a mean absolute error of 5.7 with respect to the clinical assessment using the Unified Parkinson's Disease Rating Scale; the range of target motor sub-scale is 0 to 108. Our analysis shows that elicitation of speech through less constrained task provides useful information not captured in widely employed phonation task. While still preliminary, our results demonstrate that the proposed computational approach has promising real-world applications such as in home-based assessment or in telemonitoring of Parkinson's disease.


Subject(s)
Parkinson Disease/pathology , Parkinson Disease/physiopathology , Severity of Illness Index , Speech/physiology , Humans , Regression Analysis , Reproducibility of Results
13.
Article in English | MEDLINE | ID: mdl-33659095

ABSTRACT

Speech pathologists often describe voice quality in hypokinetic dysarthria or Parkinsonism as harsh or breathy, which has been largely attributed to incomplete closure of vocal folds. Exploiting its harmonic nature, we separate voiced portion of the speech to obtain an objective estimate of this quality. The utility of the proposed approach was evaluated on predicting 116 clinical ratings of Parkinson's disease on 82 subjects. Our results show that the information extracted from speech, elicited through 3 tasks, can predict the motor subscore (range 0 to 108) of the clinical measure, the Unified Parkinson's Disease Rating Scale, within a mean absolute error of 5.7 and a standard deviation of about 2.0. While still preliminary, our results are significant and demonstrate that the proposed computational approach has promising real-world applications such as in home-based assessment or in telemonitoring of Parkinson's disease.

14.
Article in English | MEDLINE | ID: mdl-22754884

ABSTRACT

The ability to reliably infer the nature of telephone conversations opens up a variety of applications, ranging from designing context-sensitive user interfaces on smartphones, to providing new tools for social psychologists and social scientists to study and understand social life of different subpopulations within different contexts. Using a unique corpus of everyday telephone conversations collected from eight residences over the duration of a year, we investigate the utility of popular features, extracted solely from the content, in classifying business-oriented calls from others. Through feature selection experiments, we find that the discrimination can be performed robustly for a majority of the calls using a small set of features. Remarkably, features learned from unsupervised methods, specifically latent Dirichlet allocation, perform almost as well as with as those from supervised methods. The unsupervised clusters learned in this task shows promise of finer grain inference of social nature of telephone conversations.

15.
Comput Aided Surg ; 11(5): 220-30, 2006 Sep.
Article in English | MEDLINE | ID: mdl-17127647

ABSTRACT

This paper reports our progress in developing techniques for "parsing" raw motion data from a simple surgical task into a labeled sequence of surgical gestures. The ability to automatically detect and segment surgical motion can be useful in evaluating surgical skill, providing surgical training feedback, or documenting essential aspects of a procedure. If processed online, the information can be used to provide context-specific information or motion enhancements to the surgeon. However, in every case, the key step is to relate recorded motion data to a model of the procedure being performed. Robotic surgical systems such as the da Vinci system from Intuitive Surgical provide a rich source of motion and video data from surgical procedures. The application programming interface (API) of the da Vinci outputs 192 kinematics values at 10 Hz. Through a series of feature-processing steps, tailored to this task, the highly redundant features are projected to a compact and discriminative space. The resulting classifier is simple and effective.Cross-validation experiments show that the proposed approach can achieve accuracies higher than 90% when segmenting gestures in a 4-throw suturing task, for both expert and intermediate surgeons. These preliminary results suggest that gesture-specific features can be extracted to provide highly accurate surgical skill evaluation.


Subject(s)
Robotics/instrumentation , Surgery, Computer-Assisted/instrumentation , Algorithms , Clinical Competence , Humans , Man-Machine Systems , Models, Theoretical , Motion , Psychomotor Performance , Robotics/methods , Robotics/trends , Software , Surgery, Computer-Assisted/methods , Surgery, Computer-Assisted/trends , Time Factors , User-Computer Interface
16.
Article in English | MEDLINE | ID: mdl-16685920

ABSTRACT

Robotic surgical systems such as Intuitive Surgical's da Vinci system provide a rich source of motion and video data from surgical procedures. In principle, this data can be used to evaluate surgical skill, provide surgical training feedback, or document essential aspects of a procedure. If processed online, the data can be used to provide context-specific information or motion enhancements to the surgeon. However, in every case, the key step is to relate recorded motion data to a model of the procedure being performed. This paper examines our progress at developing techniques for "parsing" raw motion data from a surgical task into a labelled sequence of surgical gestures. Our current techniques have achieved >90% fully automated recognition rates on 15 datasets.


Subject(s)
Artificial Intelligence , Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Pattern Recognition, Automated/methods , Robotics/methods , Surgery, Computer-Assisted/methods , Video Recording/methods , Motion , Photography/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...