Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 672
Filter
1.
Int Arch Otorhinolaryngol ; 28(3): e473-e480, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38974622

ABSTRACT

Introduction In clinical practice, patients with the same degree and configuration of hearing loss, or even with normal audiometric thresholds, present substantially different performances in terms of speech perception. This probably happens because other factors, in addition to auditory sensitivity, interfere with speech perception. Thus, studies are needed to investigate the performance of listeners in unfavorable listening conditions to identify the processes that interfere in the speech perception of these subjects. Objective To verify the influence of age, temporal processing, and working memory on speech recognition in noise. Methods Thirty-eight adult and elderly individuals with normal hearing thresholds participated in the study. Participants were divided into two groups: The adult group (G1), composed of 10 individuals aged 21 to 33 years, and the elderly group (G2), with 28 participants aged 60 to 81 years. They underwent audiological assessment with the Portuguese Sentence List Test, Gaps-in-Noise test, Digit Span Memory test, Running Span Task, Corsi Block-Tapping test, and Visual Pattern test. Results The Running Span Task score proved to be a statistically significant predictor of the listening-in-noise variable. This result showed that the difference in performance between groups G1 and G2 in relation to listening in noise is due not only to aging, but also to changes in working memory. Conclusion The study showed that working memory is a predictor of listening performance in noise in individuals with normal hearing, and that this task can provide important information for investigation in individuals who have difficulty hearing in unfavorable environments.

2.
Sensors (Basel) ; 24(12)2024 Jun 14.
Article in English | MEDLINE | ID: mdl-38931629

ABSTRACT

Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.


Subject(s)
Algorithms , Speech Recognition Software , Humans , Speech/physiology , Nonlinear Dynamics , Pattern Recognition, Automated/methods
3.
Cogn Res Princ Implic ; 9(1): 35, 2024 Jun 05.
Article in English | MEDLINE | ID: mdl-38834918

ABSTRACT

Multilingual speakers can find speech recognition in everyday environments like restaurants and open-plan offices particularly challenging. In a world where speaking multiple languages is increasingly common, effective clinical and educational interventions will require a better understanding of how factors like multilingual contexts and listeners' language proficiency interact with adverse listening environments. For example, word and phrase recognition is facilitated when competing voices speak different languages. Is this due to a "release from masking" from lower-level acoustic differences between languages and talkers, or higher-level cognitive and linguistic factors? To address this question, we created a "one-man bilingual cocktail party" selective attention task using English and Mandarin speech from one bilingual talker to reduce low-level acoustic cues. In Experiment 1, 58 listeners more accurately recognized English targets when distracting speech was Mandarin compared to English. Bilingual Mandarin-English listeners experienced significantly more interference and intrusions from the Mandarin distractor than did English listeners, exacerbated by challenging target-to-masker ratios. In Experiment 2, 29 Mandarin-English bilingual listeners exhibited linguistic release from masking in both languages. Bilinguals experienced greater release from masking when attending to English, confirming an influence of linguistic knowledge on the "cocktail party" paradigm that is separate from primarily energetic masking effects. Effects of higher-order language processing and expertise emerge only in the most demanding target-to-masker contexts. The "one-man bilingual cocktail party" establishes a useful tool for future investigations and characterization of communication challenges in the large and growing worldwide community of Mandarin-English bilinguals.


Subject(s)
Attention , Multilingualism , Speech Perception , Humans , Speech Perception/physiology , Adult , Female , Male , Young Adult , Attention/physiology , Perceptual Masking/physiology , Psycholinguistics
4.
ACS Appl Mater Interfaces ; 16(25): 32727-32738, 2024 Jun 26.
Article in English | MEDLINE | ID: mdl-38864718

ABSTRACT

Enhancing the sensitivity of capacitive pressure sensors through microstructure design may compromise the reliability of the device and rely on intricate manufacturing processes. It is an effective way to solve this issue by balancing the intrinsic properties (elastic modulus and dielectric constant) of the dielectric layer materials. Here, we introduce a liquid metal (LM) hybrid elastomer prepared by a chain-extension-free polyurethane (PU) and LM. The synergistic strategies of extender-free and LM doping effectively reduce the elastic modulus (7.6 ± 0.2-2.1 ± 0.3 MPa) and enhance the dielectric constant (5.12-8.17 @1 kHz) of LM hybrid elastomers. Interestingly, the LM hybrid elastomer combines reprocessability, recyclability, and photothermal conversion. The obtained flexible pressure sensor can be used for detecting hand and throat muscle movements, and high-precision speech recognition of seven words has been using a convolutional neural network (CNN) in deep learning. This work provides an idea for designing and manufacturing wearable, recyclable, and intelligent control pressure sensors.

5.
Psychol Res Behav Manag ; 17: 2205-2232, 2024.
Article in English | MEDLINE | ID: mdl-38835654

ABSTRACT

Purpose: Speech disorders profoundly impact the overall quality of life by impeding social operations and hindering effective communication. This study addresses the gap in systematic reviews concerning machine learning-based assistive technology for individuals with speech disorders. The overarching purpose is to offer a comprehensive overview of the field through a Systematic Literature Review (SLR) and provide valuable insights into the landscape of ML-based solutions and related studies. Methods: The research employs a systematic approach, utilizing a Systematic Literature Review (SLR) methodology. The study extensively examines the existing literature on machine learning-based assistive technology for speech disorders. Specific attention is given to ML techniques, characteristics of exploited datasets in the training phase, speaker languages, feature extraction techniques, and the features employed by ML algorithms. Originality: This study contributes to the existing literature by systematically exploring the machine learning landscape in assistive technology for speech disorders. The originality lies in the focused investigation of ML-speech recognition for impaired speech disorder users over ten years (2014-2023). The emphasis on systematic research questions related to ML techniques, dataset characteristics, languages, feature extraction techniques, and feature sets adds a unique and comprehensive perspective to the current discourse. Findings: The systematic literature review identifies significant trends and critical studies published between 2014 and 2023. In the analysis of the 65 papers from prestigious journals, support vector machines and neural networks (CNN, DNN) were the most utilized ML technique (20%, 16.92%), with the most studied disease being Dysarthria (35/65, 54% studies). Furthermore, an upsurge in using neural network-based architectures, mainly CNN and DNN, was observed after 2018. Almost half of the included studies were published between 2021 and 2022).

6.
J Clin Hypertens (Greenwich) ; 26(6): 656-664, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38778548

ABSTRACT

Artificial intelligence (AI) telephone is reliable for the follow-up and management of hypertensives. It takes less time and is equivalent to manual follow-up to a high degree. We conducted a reliability study to evaluate the efficiency of AI telephone follow-up in the management of hypertension. During May 18 and June 30, 2020, 350 hypertensives managed by the Pengpu Community Health Service Center in Shanghai were recruited for follow-up, once by AI and once by a human. The second follow-up was conducted within 3-7 days (mean 5.5 days). The mean length time of two calls were compared by paired t-test, and Cohen's Kappa coefficient was used to evaluate the reliability of the results between the two follow-up visits. The mean length time of AI calls was shorter (4.15 min) than that of manual calls (5.24 min, P < .001). The answers related to the symptoms showed moderate to substantial consistency (κ:.465-.624, P < .001), and those related to the complications showed fair consistency (κ:.349, P < .001). In terms of lifestyle, the answer related to smoking showed a very high consistency (κ:.915, P < .001), while those addressing salt consumption, alcohol consumption, and exercise showed moderate to substantial consistency (κ:.402-.645, P < .001). There was moderate consistency in regular usage of medication (κ:.484, P < .001).


Subject(s)
Artificial Intelligence , Hypertension , Telephone , Humans , Hypertension/drug therapy , Hypertension/diagnosis , Hypertension/epidemiology , Female , Male , Middle Aged , Reproducibility of Results , China/epidemiology , Follow-Up Studies , Aged , Adult
7.
Hear Res ; 448: 109031, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38761554

ABSTRACT

In recent studies, psychophysiological measures have been used as markers of listening effort, but there is limited research on the effect of hearing loss on such measures. The aim of the current study was to investigate the effect of hearing acuity on physiological responses and subjective measures acquired during different levels of listening demand, and to investigate the relationship between these measures. A total of 125 participants (37 males and 88 females, age range 37-72 years, pure-tone average hearing thresholds at the best ear between -5.0 to 68.8 dB HL and asymmetry between ears between 0.0 and 87.5 dB) completed a listening task. A speech reception threshold (SRT) test was used with target sentences spoken by a female voice masked by male speech. Listening demand was manipulated using three levels of intelligibility: 20 % correct speech recognition, 50 %, and 80 % (IL20 %/IL50 %/IL80 %, respectively). During the task, peak pupil dilation (PPD), heart rate (HR), pre-ejection period (PEP), respiratory sinus arrhythmia (RSA), and skin conductance level (SCL) were measured. For each condition, subjective ratings of effort, performance, difficulty, and tendency to give up were also collected. Linear mixed effects models tested the effect of intelligibility level, hearing acuity, hearing asymmetry, and tinnitus complaints on the physiological reactivity (compared to baseline) and subjective measures. PPD and PEP reactivity showed a non-monotonic relationship with intelligibility level, but no such effects were found for HR, RSA, or SCL reactivity. Participants with worse hearing acuity had lower PPD at all intelligibility levels and showed lower PEP baseline levels. Additionally, PPD and SCL reactivity were lower for participants who reported suffering from tinnitus complaints. For IL80 %, but not IL50 % or IL20 %, participants with worse hearing acuity rated their listening effort to be relatively high compared to participants with better hearing. The reactivity of the different physiological measures were not or only weakly correlated with each other. Together, the results suggest that hearing acuity may be associated with altered sympathetic nervous system (re)activity. Research using psychophysiological measures as markers of listening effort to study the effect of hearing acuity on such measures are best served by the use of the PPD and PEP.


Subject(s)
Auditory Threshold , Hearing , Heart Rate , Speech Intelligibility , Speech Perception , Speech Reception Threshold Test , Humans , Male , Female , Middle Aged , Adult , Aged , Audiometry, Pure-Tone , Acoustic Stimulation , Perceptual Masking , Galvanic Skin Response , Pupil/physiology , Persons With Hearing Impairments/psychology
8.
Audiol Neurootol ; : 1-7, 2024 May 20.
Article in English | MEDLINE | ID: mdl-38768568

ABSTRACT

INTRODUCTION: This study aimed to verify the influence of speech stimulus presentation and speed on auditory recognition in cochlear implant (CI) users with poorer performance. METHODS: The cross-sectional observational study applied auditory speech perception tests to fifteen adults, using three different ways of presenting the stimulus, in the absence of competitive noise: monitored live voice (MLV); recorded speech at typical speed (RSTS); recorded speech at slow speed (RSSS). The scores were assessed using the Percent Sentence Recognition Index (PSRI). The data were inferentially analysed using the Friedman and Wilcoxon tests with a 95% confidence interval and 5% significance level (p < 0.05). RESULTS: The mean age was 41.1 years, the mean duration of CI use was 11.4 years, and the mean hearing threshold was 29.7 ± 5.9 dBHL. Test performance, as determined by the PSRI, was MLV = 42.4 ± 17.9%; RSTS = 20.3 ± 14.3%; RSSS = 40.6 ± 20.7%. There was a significant difference identified for RSTS compared to MLV and RSSS. CONCLUSION: The way the stimulus is presented and the speed at which it is presented enable greater auditory speech recognition in CI users, thus favouring comprehension when the tests are applied in the MLV and RSSS modalities.

9.
Diagnostics (Basel) ; 14(9)2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38732310

ABSTRACT

This study introduces a specialized Automatic Speech Recognition (ASR) system, leveraging the Whisper Large-v2 model, specifically adapted for radiological applications in the French language. The methodology focused on adapting the model to accurately transcribe medical terminology and diverse accents within the French language context, achieving a notable Word Error Rate (WER) of 17.121%. This research involved extensive data collection and preprocessing, utilizing a wide range of French medical audio content. The results demonstrate the system's effectiveness in transcribing complex radiological data, underscoring its potential to enhance medical documentation efficiency in French-speaking clinical settings. The discussion extends to the broader implications of this technology in healthcare, including its potential integration with electronic health records (EHRs) and its utility in medical education. This study also explores future research directions, such as tailoring ASR systems to specific medical specialties and languages. Overall, this research contributes significantly to the field of medical ASR systems, presenting a robust tool for radiological transcription in the French language and paving the way for advanced technology-enhanced healthcare solutions.

10.
Sensors (Basel) ; 24(10)2024 May 09.
Article in English | MEDLINE | ID: mdl-38793860

ABSTRACT

In environments where silent communication is essential, such as libraries and conference rooms, the need for a discreet means of interaction is paramount. Here, we present a single-electrode, contact-separated triboelectric nanogenerator (CS-TENG) characterized by robust high-frequency sensing capabilities and long-term stability. Integrating this TENG onto the inner surface of a mask allows for the capture of conversational speech signals through airflow vibrations, generating a comprehensive dataset. Employing advanced signal processing techniques, including short-time Fourier transform (STFT), Mel-frequency cepstral coefficients (MFCC), and deep learning neural networks, facilitates the accurate identification of speaker content and verification of their identity. The accuracy rates for each category of vocabulary and identity recognition exceed 92% and 90%, respectively. This system represents a pivotal advancement in facilitating secure and efficient unobtrusive communication in quiet settings, with promising implications for smart home applications, virtual assistant technology, and potential deployment in security and confidentiality-sensitive contexts.

11.
eNeuro ; 11(5)2024 May.
Article in English | MEDLINE | ID: mdl-38811162

ABSTRACT

This study compared the impact of spectral and temporal degradation on vocoded speech recognition between early-blind and sighted subjects. The participants included 25 early-blind subjects (30.32 ± 4.88 years; male:female, 14:11) and 25 age- and sex-matched sighted subjects. Tests included monosyllable recognition in noise at various signal-to-noise ratios (-18 to -4 dB), matrix sentence-in-noise recognition, and vocoded speech recognition with different numbers of channels (4, 8, 16, and 32) and temporal envelope cutoff frequencies (50 vs 500 Hz). Cortical-evoked potentials (N2 and P3b) were measured in response to spectrally and temporally degraded stimuli. The early-blind subjects displayed superior monosyllable and sentence recognition than sighted subjects (all p < 0.01). In the vocoded speech recognition test, a three-way repeated-measure analysis of variance (two groups × four channels × two cutoff frequencies) revealed significant main effects of group, channel, and cutoff frequency (all p < 0.001). Early-blind subjects showed increased sensitivity to spectral degradation for speech recognition, evident in the significant interaction between group and channel (p = 0.007). N2 responses in early-blind subjects exhibited shorter latency and greater amplitude in the 8-channel (p = 0.022 and 0.034, respectively) and shorter latency in the 16-channel (p = 0.049) compared with sighted subjects. In conclusion, early-blind subjects demonstrated speech recognition advantages over sighted subjects, even in the presence of spectral and temporal degradation. Spectral degradation had a greater impact on speech recognition in early-blind subjects, while the effect of temporal degradation was similar in both groups.


Subject(s)
Blindness , Speech Perception , Humans , Male , Female , Speech Perception/physiology , Adult , Blindness/physiopathology , Young Adult , Electroencephalography/methods , Acoustic Stimulation , Recognition, Psychology/physiology , Evoked Potentials, Auditory/physiology
12.
PeerJ Comput Sci ; 10: e1981, 2024.
Article in English | MEDLINE | ID: mdl-38660198

ABSTRACT

Background: In today's world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people's daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods: In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model's performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results: Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

13.
IEEE J Transl Eng Health Med ; 12: 382-389, 2024.
Article in English | MEDLINE | ID: mdl-38606392

ABSTRACT

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.


Subject(s)
Speech Perception , Speech , Humans , Speech Recognition Software , Dysarthria/diagnosis , Speech Disorders
14.
J Clin Med ; 13(5)2024 Feb 28.
Article in English | MEDLINE | ID: mdl-38592239

ABSTRACT

Background: Hearing in noise is challenging for cochlear implant users and requires significant listening effort. This study investigated the influence of ForwardFocus and number of maxima of the Advanced Combination Encoder (ACE) strategy, as well as age, on speech recognition threshold and listening effort in noise. Methods: A total of 33 cochlear implant recipients were included (age ≤ 40 years: n = 15, >40 years: n = 18). The Oldenburg Sentence Test was used to measure 50% speech recognition thresholds (SRT50) in fluctuating and stationary noise. Speech was presented frontally, while three frontal or rear noise sources were used, and the number of ACE maxima varied between 8 and 12. Results: ForwardFocus significantly improved the SRT50 when noise was presented from the back, independent of subject age. The use of 12 maxima further improved the SRT50 when ForwardFocus was activated and when noise and speech were presented frontally. Listening effort was significantly worse in the older age group compared to the younger age group and was reduced by ForwardFocus but not by increasing the number of ACE maxima. Conclusion: Forward Focus can improve speech recognition in noisy environments and reduce listening effort, especially in older cochlear implant users.

15.
Sensors (Basel) ; 24(7)2024 Apr 03.
Article in English | MEDLINE | ID: mdl-38610492

ABSTRACT

In recent years, attention to the realization of a distributed fiber-optic microphone for the detection and recognition of the human voice has increased, whereby the most popular schemes are based on φ-OTDR. Many issues related to the selection of optimal system parameters and the recognition of registered signals, however, are still unresolved. In this research, we conducted theoretical studies of these issues based on the φ-OTDR mathematical model and verified them with experiments. We designed an algorithm for fiber sensor signal processing, applied a testing kit, and designed a method for the quantitative evaluation of our obtained results. We also proposed a new setup model for lab tests of φ-OTDR single coordinate sensors, which allows for the quick variation of their parameters. As a result, it was possible to define requirements for the best quality of speech recognition; estimation using the percentage of recognized words yielded a value of 96.3%, and estimation with Levenshtein distance provided a value of 15.

16.
PeerJ Comput Sci ; 10: e1973, 2024.
Article in English | MEDLINE | ID: mdl-38660177

ABSTRACT

This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI's Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system's ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.

17.
Sensors (Basel) ; 24(8)2024 Apr 17.
Article in English | MEDLINE | ID: mdl-38676191

ABSTRACT

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.


Subject(s)
Noise , Speech Recognition Software , Humans , Cluster Analysis , Algorithms , Speech/physiology
18.
Front Neurosci ; 18: 1360300, 2024.
Article in English | MEDLINE | ID: mdl-38680445

ABSTRACT

Spiking neural network (SNN) distinguish themselves from artificial neural network (ANN) because of their inherent temporal processing and spike-based computations, enabling a power-efficient implementation in neuromorphic hardware. In this study, we demonstrate that data processing with spiking neurons can be enhanced by co-learning the synaptic weights with two other biologically inspired neuronal features: (1) a set of parameters describing neuronal adaptation processes and (2) synaptic propagation delays. The former allows a spiking neuron to learn how to specifically react to incoming spikes based on its past. The trained adaptation parameters result in neuronal heterogeneity, which leads to a greater variety in available spike patterns and is also found in the brain. The latter enables to learn to explicitly correlate spike trains that are temporally distanced. Synaptic delays reflect the time an action potential requires to travel from one neuron to another. We show that each of the co-learned features separately leads to an improvement over the baseline SNN and that the combination of both leads to state-of-the-art SNN results on all speech recognition datasets investigated with a simple 2-hidden layer feed-forward network. Our SNN outperforms the benchmark ANN on the neuromorphic datasets (Spiking Heidelberg Digits and Spiking Speech Commands), even with fewer trainable parameters. On the 35-class Google Speech Commands dataset, our SNN also outperforms a GRU of similar size. Our study presents brain-inspired improvements in SNN that enable them to excel over an equivalent ANN of similar size on tasks with rich temporal dynamics.

19.
BMC Res Notes ; 17(1): 95, 2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38553773

ABSTRACT

BACKGROUND: Verbatim transcription of qualitative audio data is a cornerstone of analytic quality and rigor, yet the time and energy required for such transcription can drain resources, delay analysis, and hinder the timely dissemination of qualitative insights. In recent years, software programs have presented a promising mechanism to accelerate transcription, but the broad application of such programs has been constrained due to expensive licensing or "per-minute" fees, data protection concerns, and limited availability of such programs in many languages. In this article, we outline our process of adapting a free, open-source, speech-to-text algorithm (Whisper by OpenAI) into a usable and accessible tool for qualitative transcription. Our program, which we have dubbed "Vink" for voice to ink, is available under a permissive open-source license (and thus free of cost). RESULTS: We conducted a proof-of-principle assessment of Vink's performance in transcribing authentic interview audio data in 14 languages. A majority of pilot-testers evaluated the software performance positively and indicated that they were likely to use the tool in their future research. Our usability assessment indicates that Vink is easy-to-use, and we performed further refinements based on pilot-tester feedback to increase user-friendliness. CONCLUSION: With Vink, we hope to contribute to facilitating rigorous qualitative research processes globally by reducing time and costs associated with transcription and by expanding free-of-cost transcription software availability to more languages. With Vink running on standalone computers, data privacy issues arising within many other solutions do not apply.


Subject(s)
Ink , User-Computer Interface , Speech , Software
20.
Assist Technol ; : 1-8, 2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38537126

ABSTRACT

The Voiceitt app is designed for people with dysarthric speech, to support vocal communication and access to voice-driven technologies. Sixty-six participants were recruited to test the Voiceitt app and share feedback with developers. Most had physical, sensory, or cognitive impairments in addition to atypical speech. The project team liaised with individuals, their families and local support teams to provide access to the app and associated equipment. Testing was user-led, with participants asked to identify and test use cases most relevant to their daily lives over three months or more. Ongoing technical support and training were provided remotely and in-person throughout their testing. Semi-structured interviews were used to collect feedback on users' experiences, with delivery adapted to individuals' needs and preferences. Informal feedback was collected through ongoing contact between participants, their families and support teams and the project team. User feedback has led to improvements to the user interface and functionality, including faster voice training, simplified navigation, the introduction of game-style features and of switch access as an alternative to touchscreen access. This work offers a case-study in meaningful engagement with diverse disabled users of assistive technology in commercial software development.

SELECTION OF CITATIONS
SEARCH DETAIL
...