Search | VHL Regional Portal

1.

Plain-to-clear speech video conversion for enhanced intelligibility.

Sachdeva, Shubam; Ruan, Haoyao; Hamarneh, Ghassan; Behne, Dawn M; Jongman, Allard; Sereno, Joan A; Wang, Yue.

Int J Speech Technol ; 26(1): 163-184, 2023.

Article in English | MEDLINE | ID: mdl-37008883

ABSTRACT

Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker's visual speech style; (3) we introduce "displacement factor" as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.

2.

Exploring the comparative adequacy of a unimanual and a bimanual stimulus-response setup for use with three-alternative choice response time tasks.

Öttl, Anton; Kim, Jonathan D; Behne, Dawn M; Gygax, Pascal; Hyönä, Jukka; Gabriel, Ute.

PLoS One ; 18(3): e0281377, 2023.

Article in English | MEDLINE | ID: mdl-36920982

ABSTRACT

Research often conceptualises complex social factors as being distinct binary categories (e.g., female vs male, feminine vs masculine). While this can be appropriate, the addition of an 'overlapping' category (e.g., non-binary, gender neutral) can contextualise the 'binary', both for participants (allowing more complex conceptualisations of the categories than the 'either/or' conceptualisation in binary tasks), and for the results (by providing a neutral baseline for comparison). However, it is not clear what the best response setup for such a task would be. In this study, we explore this topic through comparing a unimanual (N = 34) and a bimanual response setup (N = 32) for use with a three-alternative choice response time task. Crucially, one of the stimulus categories ('mixed') was composed of stimulus elements from the other two stimulus categories used in that task (Complex Task). A reference button task was included to isolate the motoric component of response registration (Simple Task). The results of the simple task indicated lower motoric costs for the unimanual compared to the bimanual setup. However, when statistically controlling for these motoric costs in the complex task, the bimanual setup had a lower error rate and faster response times than the unimanual setup. Further, in the complex task error rates and response times were higher for the mixed than the matched stimuli, indicating that responding to mixed stimuli is more challenging for encoding and/or decision making processes. This difference was more pronounced in the unimanual than the bimanual setup. Taken together these results indicate that the unimanual setup is more adequate for the reference button task, whereas the intricacy of overlapping categories in the complex task is better contained in the bimanual setup, i.e. when some response alternatives are allocated to one hand and other alternatives to the other hand.

Subject(s)

Biological Phenomena , Hand , Humans , Male , Female , Reaction Time , Hand/physiology , Upper Extremity , Concept Formation , Functional Laterality/physiology , Psychomotor Performance/physiology

3.

Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification.

Uhrig, Stefan; Perkis, Andrew; Möller, Sebastian; Svensson, U Peter; Behne, Dawn M.

Front Neurosci ; 15: 730744, 2021.

Article in English | MEDLINE | ID: mdl-35153653

ABSTRACT

This study investigates effects of spatial auditory cues on human listeners' response strategy for identifying two alternately active talkers ("turn-taking" listening scenario). Previous research has demonstrated subjective benefits of audio spatialization with regard to speech intelligibility and talker-identification effort. So far, the deliberate activation of specific perceptual and cognitive processes by listeners to optimize their task performance remained largely unexamined. Spoken sentences selected as stimuli were either clean or degraded due to background noise or bandpass filtering. Stimuli were presented via three horizontally positioned loudspeakers: In a non-spatial mode, both talkers were presented through a central loudspeaker; in a spatial mode, each talker was presented through the central or a talker-specific lateral loudspeaker. Participants identified talkers via speeded keypresses and afterwards provided subjective ratings (speech quality, speech intelligibility, voice similarity, talker-identification effort). In the spatial mode, presentations at lateral loudspeaker locations entailed quicker behavioral responses, which were significantly slower in comparison to a talker-localization task. Under clean speech, response times globally increased in the spatial vs. non-spatial mode (across all locations); these "response time switch costs," presumably being caused by repeated switching of spatial auditory attention between different locations, diminished under degraded speech. No significant effects of spatialization on subjective ratings were found. The results suggested that when listeners could utilize task-relevant auditory cues about talker location, they continued to rely on voice recognition instead of localization of talker sound sources as primary response strategy. Besides, the presence of speech degradations may have led to increased cognitive control, which in turn compensated for incurring response time switch costs.

4.

Effects of speech transmission quality on sensory processing indicated by the cortical auditory evoked potential.

Uhrig, Stefan; Perkis, Andrew; Behne, Dawn M.

J Neural Eng ; 17(4): 046021, 2020 08 20.

Article in English | MEDLINE | ID: mdl-32422617

ABSTRACT

OBJECTIVE: Degradations of transmitted speech have been shown to affect perceptual and cognitive processing in human listeners, as indicated by the P3 component of the event-related brain potential (ERP). However, research suggests that previously observed P3 modulations might actually be traced back to earlier neural modulations in the time range of the P1-N1-P2 complex of the cortical auditory evoked potential (CAEP). This study investigates whether auditory sensory processing, as reflected by the P1-N1-P2 complex, is already systematically altered by speech quality degradations. APPROACH: Electrophysiological data from two studies were analyzed to examine effects of speech transmission quality (high-quality, noisy, bandpass-filtered) for spoken words on amplitude and latency parameters of individual P1, N1 and P2 components. MAIN RESULTS: In the resultant ERP waveforms, an initial P1-N1-P2 manifested at stimulus onset, while a second N1-P2 occurred within the ongoing stimulus. Bandpass-filtered versus high-quality word stimuli evoked a faster and larger initial N1 as well as a reduced initial P2, hence exhibiting effects as early as the sensory stage of auditory information processing. SIGNIFICANCE: The results corroborate the existence of systematic quality-related modulations in the initial N1-P2, which may potentially have carried over into P3 modulations demonstrated by previous studies. In future psychophysiological speech quality assessments, rigorous control procedures are needed to ensure the validity of P3-based indication of speech transmission quality. An alternative CAEP-based assessment approach is discussed, which promises to be more efficient and less constrained than the established approach based on P3.

Subject(s)

Speech Perception , Speech , Acoustic Stimulation , Cognition , Electroencephalography , Evoked Potentials, Auditory , Humans

5.

Considerations in Audio-Visual Interaction Models: An ERP Study of Music Perception by Musicians and Non-musicians.

Sorati, Marzieh; Behne, Dawn M.

Front Psychol ; 11: 594434, 2020.

Article in English | MEDLINE | ID: mdl-33551911

ABSTRACT

Previous research with speech and non-speech stimuli suggested that in audiovisual perception, visual information starting prior to the onset of corresponding sound can provide visual cues, and form a prediction about the upcoming auditory sound. This prediction leads to audiovisual (AV) interaction. Auditory and visual perception interact and induce suppression and speeding up of the early auditory event-related potentials (ERPs) such as N1 and P2. To investigate AV interaction, previous research examined N1 and P2 amplitudes and latencies in response to audio only (AO), video only (VO), audiovisual, and control (CO) stimuli, and compared AV with auditory perception based on four AV interaction models (AV vs. AO+VO, AV-VO vs. AO, AV-VO vs. AO-CO, AV vs. AO). The current study addresses how different models of AV interaction express N1 and P2 suppression in music perception. Furthermore, the current study took one step further and examined whether previous musical experience, which can potentially lead to higher N1 and P2 amplitudes in auditory perception, influenced AV interaction in different models. Musicians and non-musicians were presented the recordings (AO, AV, VO) of a keyboard /C4/ key being played, as well as CO stimuli. Results showed that AV interaction models differ in their expression of N1 and P2 amplitude and latency suppression. The calculation of model (AV-VO vs. AO) and (AV-VO vs. AO-CO) has consequences for the resulting N1 and P2 difference waves. Furthermore, while musicians, compared to non-musicians, showed higher N1 amplitude in auditory perception, suppression of amplitudes and latencies for N1 and P2 was similar for the two groups across the AV models. Collectively, these results suggest that when visual cues from finger and hand movements predict the upcoming sound in AV music perception, suppression of early ERPs is similar for musicians and non-musicians. Notably, the calculation differences across models do not lead to the same pattern of results for N1 and P2, demonstrating that the four models are not interchangeable and are not directly comparable.

6.

Perception of audiovisual infant directed speech.

Englund, Nunne; Behne, Dawn M.

Scand J Psychol ; 61(2): 218-226, 2020 Apr.

Article in English | MEDLINE | ID: mdl-31820436

ABSTRACT

Infant perception often deals with audiovisual speech input and a first step in processing this input is to perceive both visual and auditory information. The speech directed to infants has special characteristics and may enhance visual aspects of speech. The current study was designed to explore the impact of visual enhancement in infant-directed speech (IDS) on audiovisual mismatch detection in a naturalistic setting. Twenty infants participated in an experiment with a visual fixation task conducted in participants' homes. Stimuli consisted of IDS and adult-directed speech (ADS) syllables with a plosive and the vowel /a:/, /i:/ or /u:/. These were either audiovisually congruent or incongruent. Infants looked longer at incongruent than congruent syllables and longer at IDS than ADS syllables, indicating that IDS and incongruent stimuli contain cues that can make audiovisual perception challenging and thereby attract infants' gaze.

Subject(s)

Language , Speech Perception/physiology , Speech , Visual Perception/physiology , Acoustic Stimulation , Cues , Female , Humans , Infant , Male

7.

Assessing the Formation of Experience-Based Gender Expectations in an Implicit Learning Scenario.

Öttl, Anton; Behne, Dawn M.

Front Psychol ; 8: 1485, 2017.

Article in English | MEDLINE | ID: mdl-28936186

ABSTRACT

The present study investigates the formation of new word-referent associations in an implicit learning scenario, using a gender-coded artificial language with spoken words and visual referents. Previous research has shown that when participants are explicitly instructed about the gender-coding system underlying an artificial lexicon, they monitor the frequency of exposure to male vs. female referents within this lexicon, and subsequently use this probabilistic information to predict the gender of an upcoming referent. In an explicit learning scenario, the auditory and visual gender cues are necessarily highlighted prior to acqusition, and the effects previously observed may therefore depend on participants' overt awareness of these cues. To assess whether the formation of experience-based expectations is dependent on explicit awareness of the underlying coding system, we present data from an experiment in which gender-coding was acquired implicitly, thereby reducing the likelihood that visual and auditory gender cues are used strategically during acquisition. Results show that even if the gender coding system was not perfectly mastered (as reflected in the number of gender coding errors), participants develop frequency based expectations comparable to those previously observed in an explicit learning scenario. In line with previous findings, participants are quicker at recognizing a referent whose gender is consistent with an induced expectation than one whose gender is inconsistent with an induced expectation. At the same time however, eyetracking data suggest that these expectations may surface earlier in an implicit learning scenario. These findings suggest that experience-based expectations are robust against manner of acquisition, and contribute to understanding why similar expectations observed in the activation of stereotypes during the processing of natural language stimuli are difficult or impossible to suppress.

8.

Experience-Based Probabilities Modulate Expectations in a Gender-Coded Artificial Language.

Öttl, Anton; Behne, Dawn M.

Front Psychol ; 7: 1250, 2016.

Article in English | MEDLINE | ID: mdl-27602009

ABSTRACT

The current study combines artificial language learning with visual world eyetracking to investigate acquisition of representations associating spoken words and visual referents using morphologically complex pseudowords. Pseudowords were constructed to consistently encode referential gender by means of suffixation for a set of imaginary figures that could be either male or female. During training, the frequency of exposure to pseudowords and their imaginary figure referents were manipulated such that a given word and its referent would be more likely to occur in either the masculine form or the feminine form, or both forms would be equally likely. Results show that these experience-based probabilities affect the formation of new representations to the extent that participants were faster at recognizing a referent whose gender was consistent with the induced expectation than a referent whose gender was inconsistent with this expectation. Disambiguating gender information available from the suffix did not mask the induced expectations. Eyetracking data provide additional evidence that such expectations surface during online lexical processing. Taken together, these findings indicate that experience-based information is accessible during the earliest stages of processing, and are consistent with the view that language comprehension depends on the activation of perceptual memory traces.

9.

Perceived synchrony for realistic and dynamic audiovisual events.

Eg, Ragnhild; Behne, Dawn M.

Front Psychol ; 6: 736, 2015.

Article in English | MEDLINE | ID: mdl-26082738

ABSTRACT

In well-controlled laboratory experiments, researchers have found that humans can perceive delays between auditory and visual signals as short as 20 ms. Conversely, other experiments have shown that humans can tolerate audiovisual asynchrony that exceeds 200 ms. This seeming contradiction in human temporal sensitivity can be attributed to a number of factors such as experimental approaches and precedence of the asynchronous signals, along with the nature, duration, location, complexity and repetitiveness of the audiovisual stimuli, and even individual differences. In order to better understand how temporal integration of audiovisual events occurs in the real world, we need to close the gap between the experimental setting and the complex setting of everyday life. With this work, we aimed to contribute one brick to the bridge that will close this gap. We compared perceived synchrony for long-running and eventful audiovisual sequences to shorter sequences that contain a single audiovisual event, for three types of content: action, music, and speech. The resulting windows of temporal integration showed that participants were better at detecting asynchrony for the longer stimuli, possibly because the long-running sequences contain multiple corresponding events that offer audiovisual timing cues. Moreover, the points of subjective simultaneity differ between content types, suggesting that the nature of a visual scene could influence the temporal perception of events. An expected outcome from this type of experiment was the rich variation among participants' distributions and the derived points of subjective simultaneity. Hence, the designs of similar experiments call for more participants than traditional psychophysical studies. Heeding this caution, we conclude that existing theories on multisensory perception are ready to be tested on more natural and representative stimuli.

10.

Audio-visual identification of place of articulation and voicing in white and babble noise.

Alm, Magnus; Behne, Dawn M; Wang, Yue; Eg, Ragnhild.

J Acoust Soc Am ; 126(1): 377-87, 2009 Jul.

Article in English | MEDLINE | ID: mdl-19603894

ABSTRACT

Research shows that noise and phonetic attributes influence the degree to which auditory and visual modalities are used in audio-visual speech perception (AVSP). Research has, however, mainly focused on white noise and single phonetic attributes, thus neglecting the more common babble noise and possible interactions between phonetic attributes. This study explores whether white and babble noise differentially influence AVSP and whether these differences depend on phonetic attributes. White and babble noise of 0 and -12 dB signal-to-noise ratio were added to congruent and incongruent audio-visual stop consonant-vowel stimuli. The audio (A) and video (V) of incongruent stimuli differed either in place of articulation (POA) or voicing. Responses from 15 young adults show that, compared to white noise, babble resulted in more audio responses for POA stimuli, and fewer for voicing stimuli. Voiced syllables received more audio responses than voiceless syllables. Results can be attributed to discrepancies in the acoustic spectra of both the noise and speech target. Voiced consonants may be more auditorily salient than voiceless consonants which are more spectrally similar to white noise. Visual cues contribute to identification of voicing, but only if the POA is visually salient and auditorily susceptible to the noise type.

Subject(s)

Auditory Perception , Noise , Phonation , Phonetics , Visual Perception , Adult , Analysis of Variance , Female , Humans , Male , Neuroma, Acoustic , Psychoacoustics , Speech Perception , Young Adult

11.

Linguistic experience and audio-visual perception of non-native fricatives.

Wang, Yue; Behne, Dawn M; Jiang, Haisheng.

J Acoust Soc Am ; 124(3): 1716-26, 2008 Sep.

Article in English | MEDLINE | ID: mdl-19045662

ABSTRACT

This study examined the effects of linguistic experience on audio-visual (AV) perception of non-native (L2) speech. Canadian English natives and Mandarin Chinese natives differing in degree of English exposure [long and short length of residence (LOR) in Canada] were presented with English fricatives of three visually distinct places of articulation: interdentals nonexistent in Mandarin and labiodentals and alveolars common in both languages. Stimuli were presented in quiet and in a cafe-noise background in four ways: audio only (A), visual only (V), congruent AV (AVc), and incongruent AV (AVi). Identification results showed that overall performance was better in the AVc than in the A or V condition and better in quiet than in cafe noise. While the Mandarin long LOR group approximated the native English patterns, the short LOR group showed poorer interdental identification, more reliance on visual information, and greater AV-fusion with the AVi materials, indicating the failure of L2 visual speech category formation with the short LOR non-natives and the positive effects of linguistic experience with the long LOR non-natives. These results point to an integrated network in AV speech processing as a function of linguistic background and provide evidence to extend auditory-based L2 speech learning theories to the visual domain.

Subject(s)

Auditory Perception , Cues , Multilingualism , Speech Intelligibility , Speech Perception , Visual Perception , Acoustic Stimulation , Adolescent , Adult , Audiometry, Speech , Canada , China/ethnology , Facial Expression , Female , Humans , Learning , Lipreading , Male , Noise , Perceptual Masking , Young Adult

12.

Infant directed speech in natural interaction--Norwegian vowel quantity and quality.

Englund, Kjellrun T; Behne, Dawn M.

J Psycholinguist Res ; 34(3): 259-80, 2005 May.

Article in English | MEDLINE | ID: mdl-16050445

ABSTRACT

An interactive face-to-face setting is used to study natural infant directed speech (IDS) compared to adult directed speech (ADS). With distinctive vowel quantity and vowel quality, Norwegian IDS was used in a natural quasi-experimental design. Six Norwegian mothers were recorded over a period of 6 months alone with their infants and in an adult conversation. Vowel duration and spectral attributes of the vowels /a:/, /i:/ and /u:/, and their short counterparts /a/ /i/ and /u/ were analysed. Repeated measures analyses show that effects of vowel quantity did not differ between ADS and IDS, and for back vowel qualities, the vowel space was shifted upwards in IDS compared to ADS suggesting that fronted articulations in natural IDS may visually enhance speech to infants.

Subject(s)

Interpersonal Relations , Speech , Adult , Cues , Humans , Infant , Norway , Smiling , Vocabulary

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL