Search | VHL Regional Portal

Encoding of speech in convolutional layers and the brain stem based on language experience.

Begus, Gasper; Zhou, Alan; Zhao, T Christina.

Sci Rep ; 13(1): 6480, 2023 04 20.

Article in English | MEDLINE | ID: mdl-37081119

ABSTRACT

Comparing artificial neural networks with outputs of neuroimaging techniques has recently seen substantial advances in (computer) vision and text-based language models. Here, we propose a framework to compare biological and artificial neural computations of spoken language representations and propose several new challenges to this paradigm. The proposed technique is based on a similar principle that underlies electroencephalography (EEG): averaging of neural (artificial or biological) activity across neurons in the time domain, and allows to compare encoding of any acoustic property in the brain and in intermediate convolutional layers of an artificial neural network. Our approach allows a direct comparison of responses to a phonetic property in the brain and in deep neural networks that requires no linear transformations between the signals. We argue that the brain stem response (cABR) and the response in intermediate convolutional layers to the exact same stimulus are highly similar without applying any transformations, and we quantify this observation. The proposed technique not only reveals similarities, but also allows for analysis of the encoding of actual acoustic properties in the two signals: we compare peak latency (i) in cABR relative to the stimulus in the brain stem and in (ii) intermediate convolutional layers relative to the input/output in deep convolutional networks. We also examine and compare the effect of prior language exposure on the peak latency in cABR and in intermediate convolutional layers. Substantial similarities in peak latency encoding between the human brain and intermediate convolutional networks emerge based on results from eight trained networks (including a replication experiment). The proposed technique can be used to compare encoding between the human brain and intermediate convolutional layers for any acoustic property and for other neuroimaging techniques.

Subject(s)

Neural Networks, Computer , Speech , Humans , Electroencephalography , Brain Stem/diagnostic imaging , Language

Toward understanding the communication in sperm whales.

Andreas, Jacob; Begus, Gasper; Bronstein, Michael M; Diamant, Roee; Delaney, Denley; Gero, Shane; Goldwasser, Shafi; Gruber, David F; de Haas, Sarah; Malkin, Peter; Pavlov, Nikolay; Payne, Roger; Petri, Giovanni; Rus, Daniela; Sharma, Pratyusha; Tchernov, Dan; Tønnesen, Pernille; Torralba, Antonio; Vogt, Daniel; Wood, Robert J.

iScience ; 25(6): 104393, 2022 Jun 17.

Article in English | MEDLINE | ID: mdl-35663036

ABSTRACT

Machine learning has been advancing dramatically over the past decade. Most strides are human-based applications due to the availability of large-scale datasets; however, opportunities are ripe to apply this technology to more deeply understand non-human communication. We detail a scientific roadmap for advancing the understanding of communication of whales that can be built further upon as a template to decipher other forms of animal and non-human communication. Sperm whales, with their highly developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent model for advanced tools that can be applied to other animals in the future. We outline the key elements required for the collection and processing of massive datasets, detecting basic communication units and language-like higher-level structures, and validating models through interactive playback experiments. The technological capabilities developed by such an undertaking hold potential for cross-applications in broader communities investigating non-human communication and behavioral research.

CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks.

Begus, Gasper.

Neural Netw ; 139: 305-325, 2021 Jul.

Article in English | MEDLINE | ID: mdl-33873122

ABSTRACT

How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs: ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN). These combine Deep Convolutional GAN architecture for audio data (WaveGAN; Donahue et al., 2019) with the information theoretic extension of GAN - InfoGAN (Chen et al., 2016) - and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. In addition to the Generator and Discriminator networks, the architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from the TIMIT corpus learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on suit and dark outputs innovative start, even though it never saw start or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability and understanding how deep neural networks learn meaningful representations, as well as potential for unsupervised text-to-speech generation in the GAN framework.

Subject(s)

Machine Learning , Natural Language Processing , Acoustics , Speech Recognition Software

Generative Adversarial Phonology: Modeling Unsupervised Phonetic and Phonological Learning With Neural Networks.

Begus, Gasper.

Front Artif Intell ; 3: 44, 2020.

Article in English | MEDLINE | ID: mdl-33733161

ABSTRACT

Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations. This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture and proposes a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties. The Generative Adversarial architecture is uniquely appropriate for modeling phonetic and phonological learning because the network is trained on unannotated raw acoustic data and learning is unsupervised without any language-specific assumptions or pre-assumed levels of abstraction. A Generative Adversarial Network was trained on an allophonic distribution in English, in which voiceless stops surface as aspirated word-initially before stressed vowels, except if preceded by a sibilant [s]. The network successfully learns the allophonic alternation: the network's generated speech signal contains the conditional distribution of aspiration duration. The paper proposes a technique for establishing the network's internal representations that identifies latent variables that correspond to, for example, presence of [s] and its spectral properties. By manipulating these variables, we actively control the presence of [s] and its frication amplitude in the generated outputs. This suggests that the network learns to use latent variables as an approximation of phonetic and phonological representations. Crucially, we observe that the dependencies learned in training extend beyond the training interval, which allows for additional exploration of learning representations. The paper also discusses how the network's architecture and innovative outputs resemble and differ from linguistic behavior in language acquisition, speech disorders, and speech errors, and how well-understood dependencies in speech data can help us interpret how neural networks learn their representations.

Effects of ejective stops on preceding vowel duration.

Begus, Gasper.

J Acoust Soc Am ; 142(4): 2168, 2017 10.

Article in English | MEDLINE | ID: mdl-29092567

ABSTRACT

One of the most widely studied observations in linguistic phonetics is that, all else being equal, vowels are longer before voiced than before voiceless obstruents. The causes of this phonetic generalization are, however, poorly understood and several competing explanations have been proposed. No studies have so far measured vowel duration before stops with yet another laryngeal feature: ejectives. This study fills this gap and presents results from an experiment that measures vowel duration before stops with all three laryngeal features in Georgian and models effects of both closure and voice onset time (VOT) on preceding vowel duration at the same time. The results show that vowels have significantly different durations before all three series of stops, voiced, ejective, and voiceless aspirated, even when closure and VOT durations are controlled for. The results also suggest that closure and VOT durations are inversely correlated with preceding vowel duration. These results combined bear several implications for the discussion of causes of vowel duration differences: the data support the hypotheses that claim that laryngeal gestures, temporal compensation, and closure velocity affect vowel duration. Some explanations, especially perceptual and airflow expenditure explanations, are considerably weakened by the results.

Subject(s)

Phonetics , Speech , Voice , Adult , Female , Humans , Language , Male , Speech Production Measurement , Young Adult

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL