Search | VHL Regional Portal

Impact of Audio Data Compression on Feature Extraction for Vocal Biomarker Detection: Validation Study.

Oreskovic, Jessica; Kaufman, Jaycee; Fossat, Yan.

JMIR Biomed Eng ; 9: e56246, 2024 Apr 15.

Article in English | MEDLINE | ID: mdl-38875677

ABSTRACT

BACKGROUND: Vocal biomarkers, derived from acoustic analysis of vocal characteristics, offer noninvasive avenues for medical screening, diagnostics, and monitoring. Previous research demonstrated the feasibility of predicting type 2 diabetes mellitus through acoustic analysis of smartphone-recorded speech. Building upon this work, this study explores the impact of audio data compression on acoustic vocal biomarker development, which is critical for broader applicability in health care. OBJECTIVE: The objective of this research is to analyze how common audio compression algorithms (MP3, M4A, and WMA) applied by 3 different conversion tools at 2 bitrates affect features crucial for vocal biomarker detection. METHODS: The impact of audio data compression on acoustic vocal biomarker development was investigated using uncompressed voice samples converted into MP3, M4A, and WMA formats at 2 bitrates (320 and 128 kbps) with MediaHuman (MH) Audio Converter, WonderShare (WS) UniConverter, and Fast Forward Moving Picture Experts Group (FFmpeg). The data set comprised recordings from 505 participants, totaling 17,298 audio files, collected using a smartphone. Participants recorded a fixed English sentence up to 6 times daily for up to 14 days. Feature extraction, including pitch, jitter, intensity, and Mel-frequency cepstral coefficients (MFCCs), was conducted using Python and Parselmouth. The Wilcoxon signed rank test and the Bonferroni correction for multiple comparisons were used for statistical analysis. RESULTS: In this study, 36,970 audio files were initially recorded from 505 participants, with 17,298 recordings meeting the fixed sentence criteria after screening. Differences between the audio conversion software, MH, WS, and FFmpeg, were notable, impacting compression outcomes such as constant or variable bitrates. Analysis encompassed diverse data compression formats and a wide array of voice features and MFCCs. Wilcoxon signed rank tests yielded P values, with those below the Bonferroni-corrected significance level indicating significant alterations due to compression. The results indicated feature-specific impacts of compression across formats and bitrates. MH-converted files exhibited greater resilience compared to WS-converted files. Bitrate also influenced feature stability, with 38 cases affected uniquely by a single bitrate. Notably, voice features showed greater stability than MFCCs across conversion methods. CONCLUSIONS: Compression effects were found to be feature specific, with MH and FFmpeg showing greater resilience. Some features were consistently affected, emphasizing the importance of understanding feature resilience for diagnostic applications. Considering the implementation of vocal biomarkers in health care, finding features that remain consistent through compression for data storage or transmission purposes is valuable. Focused on specific features and formats, future research could broaden the scope to include diverse features, real-time compression algorithms, and various recording methods. This study enhances our understanding of audio compression's influence on voice features and MFCCs, providing insights for developing applications across fields. The research underscores the significance of feature stability in working with compressed audio data, laying a foundation for informed voice data use in evolving technological landscapes.

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation.

Kulangareth, Nikhil Valsan; Kaufman, Jaycee; Oreskovic, Jessica; Fossat, Yan.

JMIR Biomed Eng ; 9: e56245, 2024 Mar 21.

Article in English | MEDLINE | ID: mdl-38875685

ABSTRACT

BACKGROUND: The digital era has witnessed an escalating dependence on digital platforms for news and information, coupled with the advent of "deepfake" technology. Deepfakes, leveraging deep learning models on extensive data sets of voice recordings and images, pose substantial threats to media authenticity, potentially leading to unethical misuse such as impersonation and the dissemination of false information. OBJECTIVE: To counteract this challenge, this study aims to introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio. METHODS: A total of 49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of up to 3 distinct voice cloning text-to-speech models and 3 control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a data set consisting of up to 9 cloned audio samples and 3 control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes. Five audio features corresponding to speech pause profiles were calculated. Differences between authentic and cloned audio for these features were assessed, and 5 classical machine learning algorithms were implemented using these features to create a prediction model. The generalization capability of the optimal model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants. RESULTS: Cloned audio exhibited significantly increased time between pauses (P<.001), decreased variation in speech segment length (P=.003), increased overall proportion of time speaking (P=.04), and decreased rates of micro- and macropauses in speech (both P=.01). Five machine learning models were implemented using these features, with the AdaBoost model demonstrating the highest performance, achieving a 5-fold cross-validation balanced accuracy of 0.81 (SD 0.05). Other models included support vector machine (balanced accuracy 0.79, SD 0.03), random forest (balanced accuracy 0.78, SD 0.04), logistic regression, and decision tree (balanced accuracies 0.76, SD 0.10 and 0.72, SD 0.06). When evaluating the optimal AdaBoost model, it achieved an overall test accuracy of 0.79 when predicting unseen data. CONCLUSIONS: The incorporation of perceptual, biological features into machine learning models demonstrates promising results in distinguishing between authentic human voices and cloned audio.

A Radar-Based Opioid Overdose Detection Device for Public Restrooms: Design, Development, and Evaluation Study.

Oreskovic, Jessica; Kaufman, Jaycee; Thommandram, Anirudh; Fossat, Yan.

JMIR Biomed Eng ; 8: e51754, 2023 Oct 24.

Article in English | MEDLINE | ID: mdl-38875668

ABSTRACT

BACKGROUND: The opioid epidemic is a growing crisis worldwide. While many interventions have been put in place to try to protect people from opioid overdoses, they typically rely on the person to take initiative in protecting themselves, requiring forethought, preparation, and action. Respiratory depression or arrest is the mechanism by which opioid overdoses become fatal, but it can be reversed with the timely administration of naloxone. OBJECTIVE: In this study, we described the development and validation of an opioid overdose detection radar (ODR), specifically designed for use in public restroom stalls. In-laboratory testing was conducted to validate the noncontact, privacy-preserving device against a respiration belt and to determine the accuracy and reliability of the device. METHODS: We used an ODR system with a high-frequency pulsed coherent radar sensor and a Raspberry Pi (Raspberry Pi Ltd), combining advanced technology with a compact and cost-effective setup to monitor respiration and detect opioid overdoses. To determine the optimal position for the ODR within the confined space of a restroom stall, iterative testing was conducted, considering the radar's bounded capture area and the limitations imposed by the stall's dimensions and layout. By adjusting the orientation of the ODR, we were able to identify the most effective placement where the device reliably tracked respiration in a number of expected positions. Experiments used a mock restroom stall setup that adhered to building code regulations, creating a controlled environment while maintaining the authenticity of a public restroom stall. By simulating different body positions commonly associated with opioid overdoses, the ODR's ability to accurately track respiration in various scenarios was assessed. To determine the accuracy of the ODR, testing was performed using a respiration belt as a reference. The radar measurements were compared with those obtained from the belt in experiments where participants were seated upright and slumped over. RESULTS: The results demonstrated favorable agreement between the radar and belt measurements, with an overall mean error in respiration cycle duration of 0.0072 (SD 0.54) seconds for all recorded respiration cycles (N=204). During the simulated overdose experiments where participants were slumped over, the ODR successfully tracked respiration with a mean period difference of 0.0091 (SD 0.62) seconds compared with the reference data. CONCLUSIONS: The findings suggest that the ODR has the potential to detect significant deviations in respiration patterns that may indicate an opioid overdose event. The success of the ODR in these experiments indicates the device should be further developed and implemented to enhance safety and emergency response measures in public restrooms. However, additional validation is required for unhealthy opioid-influenced respiratory patterns to guarantee the ODR's effectiveness in real-world overdose situations.

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL