Search | VHL Regional Portal

1.

Author Correction: Song lyrics have become simpler and more repetitive over the last five decades.

Parada-Cabaleiro, Emilia; Mayerl, Maximilian; Brandl, Stefan; Skowron, Marcin; Schedl, Markus; Lex, Elisabeth; Zangerle, Eva.

Sci Rep ; 14(1): 11712, 2024 May 22.

Article in English | MEDLINE | ID: mdl-38778218

2.

Song lyrics have become simpler and more repetitive over the last five decades.

Parada-Cabaleiro, Emilia; Mayerl, Maximilian; Brandl, Stefan; Skowron, Marcin; Schedl, Markus; Lex, Elisabeth; Zangerle, Eva.

Sci Rep ; 14(1): 5531, 2024 03 28.

Article in English | MEDLINE | ID: mdl-38548740

ABSTRACT

Music is ubiquitous in our everyday lives, and lyrics play an integral role when we listen to music. The complex relationships between lyrical content, its temporal evolution over the last decades, and genre-specific variations, however, are yet to be fully understood. In this work, we investigate the dynamics of English lyrics of Western, popular music over five decades and five genres, using a wide set of lyrics descriptors, including lyrical complexity, structure, emotion, and popularity. We find that pop music lyrics have become simpler and easier to comprehend over time: not only does the lexical complexity of lyrics decrease (for instance, captured by vocabulary richness or readability of lyrics), but we also observe that the structural complexity (for instance, the repetitiveness of lyrics) has decreased. In addition, we confirm previous analyses showing that the emotion described by lyrics has become more negative and that lyrics have become more personal over the last five decades. Finally, a comparison of lyrics view counts and listening counts shows that when it comes to the listeners' interest in lyrics, for instance, rock fans mostly enjoy lyrics from older songs; country fans are more interested in new songs' lyrics.

Subject(s)

Music , Music/psychology , Emotions , Vocabulary

3.

Exploring emotions in Bach chorales: a multi-modal perceptual and data-driven study.

Parada-Cabaleiro, Emilia; Batliner, Anton; Zentner, Marcel; Schedl, Markus.

R Soc Open Sci ; 10(12): 230574, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38126059

ABSTRACT

The relationship between music and emotion has been addressed within several disciplines, from more historico-philosophical and anthropological ones, such as musicology and ethnomusicology, to others that are traditionally more empirical and technological, such as psychology and computer science. Yet, understanding the link between music and emotion is limited by the scarce interconnections between these disciplines. Trying to narrow this gap, this data-driven exploratory study aims at assessing the relationship between linguistic, symbolic and acoustic features-extracted from lyrics, music notation and audio recordings-and perception of emotion. Employing a listening experiment, statistical analysis and unsupervised machine learning, we investigate how a data-driven multi-modal approach can be used to explore the emotions conveyed by eight Bach chorales. Through a feature selection strategy based on a set of more than 300 Bach chorales and a transdisciplinary methodology integrating approaches from psychology, musicology and computer science, we aim to initiate an efficient dialogue between disciplines, able to promote a more integrative and holistic understanding of emotions in music.

4.

Identifying languages in a novel dataset: ASMR-whispered speech.

Song, Meishu; Yang, Zijiang; Parada-Cabaleiro, Emilia; Jing, Xin; Yamamoto, Yoshiharu; Schuller, Björn.

Front Neurosci ; 17: 1120311, 2023.

Article in English | MEDLINE | ID: mdl-37397449

ABSTRACT

Introduction: The Autonomous Sensory Meridian Response (ASMR) is a combination of sensory phenomena involving electrostatic-like tingling sensations, which emerge in response to certain stimuli. Despite the overwhelming popularity of ASMR in the social media, no open source databases on ASMR related stimuli are yet available, which makes this phenomenon mostly inaccessible to the research community; thus, almost completely unexplored. In this regard, we present the ASMR Whispered-Speech (ASMR-WS) database. Methods: ASWR-WS is a novel database on whispered speech, specifically tailored to promote the development of ASMR-like unvoiced Language Identification (unvoiced-LID) systems. The ASMR-WS database encompasses 38 videos-for a total duration of 10 h and 36 min-and includes seven target languages (Chinese, English, French, Italian, Japanese, Korean, and Spanish). Along with the database, we present baseline results for unvoiced-LID on the ASMR-WS database. Results: Our best results on the seven-class problem, based on segments of 2s length, and on a CNN classifier and MFCC acoustic features, achieved 85.74% of unweighted average recall and 90.83% of accuracy. Discussion: For future work, we would like to focus more deeply on the duration of speech samples, as we see varied results with the combinations applied herein. To enable further research in this area, the ASMR-WS database, as well as the partitioning considered in the presented baseline, is made accessible to the research community.

5.

Automated composition of Galician Xota-tuning RNN-based composers for specific musical styles using deep Q-learning.

Mira, Rodrigo; Coutinho, Eduardo; Parada-Cabaleiro, Emilia; Schuller, Björn W.

PeerJ Comput Sci ; 9: e1356, 2023.

Article in English | MEDLINE | ID: mdl-37346708

ABSTRACT

Music composition is a complex field that is difficult to automate because the computational definition of what is good or aesthetically pleasing is vague and subjective. Many neural network-based methods have been applied in the past, but they lack consistency and in most cases, their outputs fail to impress. The most common issues include excessive repetition and a lack of style and structure, which are hallmarks of artificial compositions. In this project, we build on a model created by Magenta-the RL Tuner-extending it to emulate a specific musical genre-the Galician Xota. To do this, we design a new rule-set containing rules that the composition should follow to adhere to this style. We then implement them using reward functions, which are used to train the Deep Q Network that will be used to generate the pieces. After extensive experimentation, we achieve an implementation of our rule-set that effectively enforces each rule on the generated compositions, and outline a solid research methodology for future researchers looking to use this architecture. Finally, we propose some promising future work regarding further applications for this model and improvements to the experimental procedure.

6.

Emotion-aware music tower blocks (EmoMTB ): an intelligent audiovisual interface for music discovery and recommendation.

Melchiorre, Alessandro B; Penz, David; Ganhör, Christian; Lesota, Oleg; Fragoso, Vasco; Fritzl, Florian; Parada-Cabaleiro, Emilia; Schubert, Franz; Schedl, Markus.

Int J Multimed Inf Retr ; 12(1): 13, 2023.

Article in English | MEDLINE | ID: mdl-37274943

ABSTRACT

Music listening has experienced a sharp increase during the last decade thanks to music streaming and recommendation services. While they offer text-based search functionality and provide recommendation lists of remarkable utility, their typical mode of interaction is unidimensional, i.e., they provide lists of consecutive tracks, which are commonly inspected in sequential order by the user. The user experience with such systems is heavily affected by cognition biases (e.g., position bias, human tendency to pay more attention to first positions of ordered lists) as well as algorithmic biases (e.g., popularity bias, the tendency of recommender systems to overrepresent popular items). This may cause dissatisfaction among the users by disabling them to find novel music to enjoy. In light of such systems and biases, we propose an intelligent audiovisual music exploration system named EmoMTB . It allows the user to browse the entirety of a given collection in a free nonlinear fashion. The navigation is assisted by a set of personalized emotion-aware recommendations, which serve as starting points for the exploration experience. EmoMTB adopts the metaphor of a city, in which each track (visualized as a colored cube) represents one floor of a building. Highly similar tracks are located in the same building; moderately similar ones form neighborhoods that mostly correspond to genres. Tracks situated between distinct neighborhoods create a gradual transition between genres. Users can navigate this music city using their smartphones as control devices. They can explore districts of well-known music or decide to leave their comfort zone. In addition, EmoMTB integrates an emotion-aware music recommendation system that re-ranks the list of suggested starting points for exploration according to the user's self-identified emotion or the collective emotion expressed in EmoMTB 's Twitter channel. Evaluation of EmoMTB has been carried out in a threefold way: by quantifying the homogeneity of the clustering underlying the construction of the city, by measuring the accuracy of the emotion predictor, and by carrying out a web-based survey composed of open questions to obtain qualitative feedback from users.

7.

Perception and classification of emotions in nonsense speech: Humans versus machines.

Parada-Cabaleiro, Emilia; Batliner, Anton; Schmitt, Maximilian; Schedl, Markus; Costantini, Giovanni; Schuller, Björn.

PLoS One ; 18(1): e0281079, 2023.

Article in English | MEDLINE | ID: mdl-36716307

ABSTRACT

This article contributes to a more adequate modelling of emotions encoded in speech, by addressing four fallacies prevalent in traditional affective computing: First, studies concentrate on few emotions and disregard all other ones ('closed world'). Second, studies use clean (lab) data or real-life ones but do not compare clean and noisy data in a comparable setting ('clean world'). Third, machine learning approaches need large amounts of data; however, their performance has not yet been assessed by systematically comparing different approaches and different sizes of databases ('small world'). Fourth, although human annotations of emotion constitute the basis for automatic classification, human perception and machine classification have not yet been compared on a strict basis ('one world'). Finally, we deal with the intrinsic ambiguities of emotions by interpreting the confusions between categories ('fuzzy world'). We use acted nonsense speech from the GEMEP corpus, emotional 'distractors' as categories not entailed in the test set, real-life noises that mask the clear recordings, and different sizes of the training set for machine learning. We show that machine learning based on state-of-the-art feature representations (wav2vec2) is able to mirror the main emotional categories ('pillars') present in perceptual emotional constellations even in degradated acoustic conditions.

Subject(s)

Speech Perception , Speech , Humans , Emotions , Machine Learning , Acoustics , Perception

8.

Audio self-supervised learning: A survey.

Liu, Shuo; Mallol-Ragolta, Adria; Parada-Cabaleiro, Emilia; Qian, Kun; Jing, Xin; Kathan, Alexander; Hu, Bin; Schuller, Björn W.

Patterns (N Y) ; 3(12): 100616, 2022 Dec 09.

Article in English | MEDLINE | ID: mdl-36569546

ABSTRACT

Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

9.

Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection.

Liu, Shuo; Mallol-Ragolta, Adria; Yan, Tianhao; Qian, Kun; Parada-Cabaleiro, Emilia; Hu, Bin; Schuller, Bjorn W.

IEEE J Biomed Health Inform ; 26(8): 4291-4302, 2022 08.

Article in English | MEDLINE | ID: mdl-35522639

ABSTRACT

The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.

Subject(s)

COVID-19 , Masks , Humans , Neural Networks, Computer , SARS-CoV-2 , Speech

10.

The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning.

Costantini, Giovanni; Parada-Cabaleiro, Emilia; Casali, Daniele; Cesarini, Valerio.

Sensors (Basel) ; 22(7)2022 Mar 23.

Article in English | MEDLINE | ID: mdl-35408076

ABSTRACT

Machine Learning (ML) algorithms within a human-computer framework are the leading force in speech emotion recognition (SER). However, few studies explore cross-corpora aspects of SER; this work aims to explore the feasibility and characteristics of a cross-linguistic, cross-gender SER. Three ML classifiers (SVM, Naïve Bayes and MLP) are applied to acoustic features, obtained through a procedure based on Kononenko's discretization and correlation-based feature selection. The system encompasses five emotions (disgust, fear, happiness, anger and sadness), using the Emofilm database, comprised of short clips of English movies and the respective Italian and Spanish dubbed versions, for a total of 1115 annotated utterances. The results see MLP as the most effective classifier, with accuracies higher than 90% for single-language approaches, while the cross-language classifier still yields accuracies higher than 80%. The results show cross-gender tasks to be more difficult than those involving two languages, suggesting greater differences between emotions expressed by male versus female subjects than between different languages. Four feature domains, namely, RASTA, F0, MFCC and spectral energy, are algorithmically assessed as the most effective, refining existing literature and approaches based on standard sets. To our knowledge, this is one of the first studies encompassing cross-gender and cross-linguistic assessments on SER.

Subject(s)

Machine Learning , Speech , Bayes Theorem , Emotions , Female , Humans , Linguistics , Male

11.

An Exploratory Study on the Acoustic Musical Properties to Decrease Self-Perceived Anxiety.

Parada-Cabaleiro, Emilia; Batliner, Anton; Schedl, Markus.

Int J Environ Res Public Health ; 19(2)2022 01 16.

Article in English | MEDLINE | ID: mdl-35055816

ABSTRACT

Musical listening is broadly used as an inexpensive and safe method to reduce self-perceived anxiety. This strategy is based on the emotivist assumption claiming that emotions are not only recognised in music but induced by it. Yet, the acoustic properties of musical work capable of reducing anxiety are still under-researched. To fill this gap, we explore whether the acoustic parameters relevant in music emotion recognition are also suitable to identify music with relaxing properties. As an anxiety indicator, the positive statements from the six-item Spielberger State-Trait Anxiety Inventory, a self-reported score from 3 to 12, are taken. A user-study with 50 participants assessing the relaxing potential of four musical pieces was conducted; subsequently, the acoustic parameters were evaluated. Our study shows that when using classical Western music to reduce self-perceived anxiety, tonal music should be considered. In addition, it also indicates that harmonicity is a suitable indicator of relaxing music, while the role of scoring and dynamics in reducing non-pathological listener distress should be further investigated.

Subject(s)

Music , Acoustic Stimulation , Acoustics , Anxiety/prevention & control , Auditory Perception , Emotions , Humans , Music/psychology

12.

The perception of emotional cues by children in artificial background noise.

Parada-Cabaleiro, Emilia; Batliner, Anton; Baird, Alice; Schuller, Björn.

Int J Speech Technol ; 23(1): 169-182, 2020.

Article in English | MEDLINE | ID: mdl-34867074

ABSTRACT

Most typically developed individuals have the ability to perceive emotions encoded in speech; yet, factors such as age or environmental conditions can restrict this inherent skill. Noise pollution and multimedia over-stimulation are common components of contemporary society, and have shown to particularly impair a child's interpersonal skills. Assessing the influence of such features on the perception of emotion over different developmental stages will advance child-related research. The presented work evaluates how background noise and emotionally connoted visual stimuli affect a child's perception of emotional speech. A total of 109 subjects from Spain and Germany (4-14 years) evaluated 20 multi-modal instances of nonsense emotional speech, under several environmental and visual conditions. A control group of 17 Spanish adults performed the same perception test. Results suggest that visual stimulation, gender, and the two sub-cultures with different language background do not influence a child's perception; yet, background noise does compromise their ability to correctly identify emotion in speech-a phenomenon that seems to decrease with age.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL