Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
Neural Netw ; 161: 494-504, 2023 Apr.
Article in English | MEDLINE | ID: mdl-36805264

ABSTRACT

Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision rates on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.


Subject(s)
Speech Perception , Vocabulary , Humans , Speech , Language , Neural Networks, Computer
2.
Front Neurorobot ; 16: 844753, 2022.
Article in English | MEDLINE | ID: mdl-35966371

ABSTRACT

A cognitive agent performing in the real world needs to learn relevant concepts about its environment (e.g., objects, color, and shapes) and react accordingly. In addition to learning the concepts, it needs to learn relations between the concepts, in particular spatial relations between objects. In this paper, we propose three approaches that allow a cognitive agent to learn spatial relations. First, using an embodied model, the agent learns to reach toward an object based on simple instructions involving left-right relations. Since the level of realism and its complexity does not permit large-scale and diverse experiences in this approach, we devise as a second approach a simple visual dataset for geometric feature learning and show that recent reasoning models can learn directional relations in different frames of reference. Yet, embodied and simple simulation approaches together still do not provide sufficient experiences. To close this gap, we thirdly propose utilizing knowledge bases for disembodied spatial relation reasoning. Since the three approaches (i.e., embodied learning, learning from simple visual data, and use of knowledge bases) are complementary, we conceptualize a cognitive architecture that combines these approaches in the context of spatial relation learning.

3.
Article in English | MEDLINE | ID: mdl-35867361

ABSTRACT

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on  âˆ¼  2400-h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.

4.
Front Neurorobot ; 14: 52, 2020.
Article in English | MEDLINE | ID: mdl-33154720

ABSTRACT

Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterizing the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilize the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organize hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.

5.
Front Integr Neurosci ; 14: 10, 2020.
Article in English | MEDLINE | ID: mdl-32174816

ABSTRACT

Selective attention plays an essential role in information acquisition and utilization from the environment. In the past 50 years, research on selective attention has been a central topic in cognitive science. Compared with unimodal studies, crossmodal studies are more complex but necessary to solve real-world challenges in both human experiments and computational modeling. Although an increasing number of findings on crossmodal selective attention have shed light on humans' behavioral patterns and neural underpinnings, a much better understanding is still necessary to yield the same benefit for intelligent computational agents. This article reviews studies of selective attention in unimodal visual and auditory and crossmodal audiovisual setups from the multidisciplinary perspectives of psychology and cognitive neuroscience, and evaluates different ways to simulate analogous mechanisms in computational models and robotics. We discuss the gaps between these fields in this interdisciplinary review and provide insights about how to use psychological findings and theories in artificial intelligence from different perspectives.

6.
Front Neurorobot ; 13: 93, 2019.
Article in English | MEDLINE | ID: mdl-31798437

ABSTRACT

The problem of generating structured Knowledge Graphs (KGs) is difficult and open but relevant to a range of tasks related to decision making and information augmentation. A promising approach is to study generating KGs as a relational representation of inputs (e.g., textual paragraphs or natural images), where nodes represent the entities and edges represent the relations. This procedure is naturally a mixture of two phases: extracting primary relations from input, and completing the KG with reasoning. In this paper, we propose a hybrid KG builder that combines these two phases in a unified framework and generates KGs from scratch. Specifically, we employ a neural relation extractor resolving primary relations from input and a differentiable inductive logic programming (ILP) model that iteratively completes the KG. We evaluate our framework in both textual and visual domains and achieve comparable performance on relation extraction datasets based on Wikidata and the Visual Genome. The framework surpasses neural baselines by a noticeable gap in reasoning out dense KGs and overall performs particularly well for rare relations.

7.
Front Neurorobot ; 12: 78, 2018.
Article in English | MEDLINE | ID: mdl-30546302

ABSTRACT

Artificial autonomous agents and robots interacting in complex environments are required to continually acquire and fine-tune knowledge over sustained periods of time. The ability to learn from continuous streams of information is referred to as lifelong learning and represents a long-standing challenge for neural network models due to catastrophic forgetting in which novel sensory experience interferes with existing representations and leads to abrupt decreases in the performance on previously acquired knowledge. Computational models of lifelong learning typically alleviate catastrophic forgetting in experimental scenarios with given datasets of static images and limited complexity, thereby differing significantly from the conditions artificial agents are exposed to. In more natural settings, sequential information may become progressively available over time and access to previous experience may be restricted. Therefore, specialized neural network mechanisms are required that adapt to novel sequential experience while preventing disruptive interference with existing representations. In this paper, we propose a dual-memory self-organizing architecture for lifelong learning scenarios. The architecture comprises two growing recurrent networks with the complementary tasks of learning object instances (episodic memory) and categories (semantic memory). Both growing networks can expand in response to novel sensory experience: the episodic memory learns fine-grained spatiotemporal representations of object instances in an unsupervised fashion while the semantic memory uses task-relevant signals to regulate structural plasticity levels and develop more compact representations from episodic experience. For the consolidation of knowledge in the absence of external sensory input, the episodic memory periodically replays trajectories of neural reactivations. We evaluate the proposed model on the CORe50 benchmark dataset for continuous object recognition, showing that we significantly outperform current methods of lifelong learning in three different incremental learning scenarios.

8.
Neural Netw ; 96: 137-149, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29017140

ABSTRACT

Lifelong learning is fundamental in autonomous robotics for the acquisition and fine-tuning of knowledge through experience. However, conventional deep neural models for action recognition from videos do not account for lifelong learning but rather learn a batch of training data with a predefined number of action classes and samples. Thus, there is the need to develop learning systems with the ability to incrementally process available perceptual cues and to adapt their responses over time. We propose a self-organizing neural architecture for incrementally learning to classify human actions from video sequences. The architecture comprises growing self-organizing networks equipped with recurrent neurons for processing time-varying patterns. We use a set of hierarchically arranged recurrent networks for the unsupervised learning of action representations with increasingly large spatiotemporal receptive fields. Lifelong learning is achieved in terms of prediction-driven neural dynamics in which the growth and the adaptation of the recurrent networks are driven by their capability to reconstruct temporally ordered input sequences. Experimental results on a classification task using two action benchmark datasets show that our model is competitive with state-of-the-art methods for batch learning also when a significant number of sample labels are missing or corrupted during training sessions. Additional experiments show the ability of our model to adapt to non-stationary input avoiding catastrophic interference.


Subject(s)
Machine Learning/trends , Neural Networks, Computer , Pattern Recognition, Visual , Humans , Neurons/physiology , Pattern Recognition, Visual/physiology , Robotics/methods , Robotics/trends
9.
Neural Netw ; 72: 140-51, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26548943

ABSTRACT

Emotional state recognition has become an important topic for human-robot interaction in the past years. By determining emotion expressions, robots can identify important variables of human behavior and use these to communicate in a more human-like fashion and thereby extend the interaction possibilities. Human emotions are multimodal and spontaneous, which makes them hard to be recognized by robots. Each modality has its own restrictions and constraints which, together with the non-structured behavior of spontaneous expressions, create several difficulties for the approaches present in the literature, which are based on several explicit feature extraction techniques and manual modality fusion. Our model uses a hierarchical feature representation to deal with spontaneous emotions, and learns how to integrate multiple modalities for non-verbal emotion recognition, making it suitable to be used in an HRI scenario. Our experiments show that a significant improvement of recognition accuracy is achieved when we use hierarchical features and multimodal information, and our model improves the accuracy of state-of-the-art approaches from 82.5% reported in the literature to 91.3% for a benchmark dataset on spontaneous emotion expressions.


Subject(s)
Emotions , Learning , Recognition, Psychology , Robotics , Humans
10.
Front Neurorobot ; 9: 3, 2015.
Article in English | MEDLINE | ID: mdl-26106323

ABSTRACT

The visual recognition of complex, articulated human movements is fundamental for a wide range of artificial systems oriented toward human-robot communication, action classification, and action-driven perception. These challenging tasks may generally involve the processing of a huge amount of visual information and learning-based mechanisms for generalizing a set of training actions and classifying new samples. To operate in natural environments, a crucial property is the efficient and robust recognition of actions, also under noisy conditions caused by, for instance, systematic sensor errors and temporarily occluded persons. Studies of the mammalian visual system and its outperforming ability to process biological motion information suggest separate neural pathways for the distinct processing of pose and motion features at multiple levels and the subsequent integration of these visual cues for action perception. We present a neurobiologically-motivated approach to achieve noise-tolerant action recognition in real time. Our model consists of self-organizing Growing When Required (GWR) networks that obtain progressively generalized representations of sensory inputs and learn inherent spatio-temporal dependencies. During the training, the GWR networks dynamically change their topological structure to better match the input space. We first extract pose and motion features from video sequences and then cluster actions in terms of prototypical pose-motion trajectories. Multi-cue trajectories from matching action frames are subsequently combined to provide action dynamics in the joint feature space. Reported experiments show that our approach outperforms previous results on a dataset of full-body actions captured with a depth sensor, and ranks among the best results for a public benchmark of domestic daily actions.

11.
Front Neurorobot ; 7: 15, 2013.
Article in English | MEDLINE | ID: mdl-24109451

ABSTRACT

As a fundamental research topic, autonomous indoor robot navigation continues to be a challenge in unconstrained real-world indoor environments. Although many models for map-building and planning exist, it is difficult to integrate them due to the high amount of noise, dynamics, and complexity. Addressing this challenge, this paper describes a neural model for environment mapping and robot navigation based on learning spatial knowledge. Considering that a person typically moves within a room without colliding with objects, this model learns the spatial knowledge by observing the person's movement using a ceiling-mounted camera. A robot can plan and navigate to any given position in the room based on the acquired map, and adapt it based on having identified possible obstacles. In addition, salient visual features are learned and stored in the map during navigation. This anchoring of visual features in the map enables the robot to find and navigate to a target object by showing an image of it. We implement this model on a humanoid robot and tests are conducted in a home-like environment. Results of our experiments show that the learned sensorimotor map masters complex navigation tasks.

12.
PLoS Comput Biol ; 7(11): e1002253, 2011 Nov.
Article in English | MEDLINE | ID: mdl-22072953

ABSTRACT

Various optimality principles have been proposed to explain the characteristics of coordinated eye and head movements during visual orienting behavior. At the same time, researchers have suggested several neural models to underly the generation of saccades, but these do not include online learning as a mechanism of optimization. Here, we suggest an open-loop neural controller with a local adaptation mechanism that minimizes a proposed cost function. Simulations show that the characteristics of coordinated eye and head movements generated by this model match the experimental data in many aspects, including the relationship between amplitude, duration and peak velocity in head-restrained and the relative contribution of eye and head to the total gaze shift in head-free conditions. Our model is a first step towards bringing together an optimality principle and an incremental local learning mechanism into a unified control scheme for coordinated eye and head movements.


Subject(s)
Eye Movements/physiology , Head Movements/physiology , Models, Neurological , Adaptation, Physiological , Computational Biology , Humans , Learning/physiology , Restraint, Physical , Saccades/physiology , Visual Perception/physiology
13.
Neural Netw ; 22(5-6): 586-92, 2009.
Article in English | MEDLINE | ID: mdl-19616917

ABSTRACT

The brain is able to perform actions based on an adequate internal representation of the world, where task-irrelevant features are ignored and incomplete sensory data are estimated. Traditionally, it is assumed that such abstract state representations are obtained purely from the statistics of sensory input for example by unsupervised learning methods. However, more recent findings suggest an influence of the dopaminergic system, which can be modeled by a reinforcement learning approach. Standard reinforcement learning algorithms act on a single layer network connecting the state space to the action space. Here, we involve in a feature detection stage and a memory layer, which together, construct the state space for a learning agent. The memory layer consists of the state activation at the previous time step as well as the previously chosen action. We present a temporal difference based learning rule for training the weights from these additional inputs to the state layer. As a result, the performance of the network is maintained both, in the presence of task-irrelevant features, and at randomly occurring time steps during which the input is invisible. Interestingly, a goal-directed forward model emerges from the memory weights, which only covers the state-action pairs that are relevant to the task. The model presents a link between reinforcement learning, feature detection and forward models and may help to explain how reward systems recruit cortical circuits for goal-directed feature detection and prediction.


Subject(s)
Neural Networks, Computer , Algorithms , Goals , Humans , Learning , Memory , Reinforcement, Psychology , Time Factors
14.
Neural Comput ; 20(5): 1261-84, 2008 May.
Article in English | MEDLINE | ID: mdl-18194109

ABSTRACT

Current models for learning feature detectors work on two timescales: on a fast timescale, the internal neurons' activations adapt to the current stimulus; on a slow timescale, the weights adapt to the statistics of the set of stimuli. Here we explore the adaptation of a neuron's intrinsic excitability, termed intrinsic plasticity, which occurs on a separate timescale. Here, a neuron maintains homeostasis of an exponentially distributed firing rate in a dynamic environment. We exploit this in the context of a generative model to impose sparse coding. With natural image input, localized edge detectors emerge as models of V1 simple cells. An intermediate timescale for the intrinsic plasticity parameters allows modeling aftereffects. In the tilt aftereffect, after a viewer adapts to a grid of a certain orientation, grids of a nearby orientation will be perceived as tilted away from the adapted orientation. Our results show that adapting the neurons' gain-parameter but not the threshold-parameter accounts for this effect. It occurs because neurons coding for the adapting stimulus attenuate their gain, while others increase it. Despite its simplicity and low maintenance, the intrinsic plasticity model accounts for more experimental details than previous models without this mechanism.


Subject(s)
Models, Neurological , Neuronal Plasticity/physiology , Neurons/physiology
15.
16.
Neural Netw ; 19(4): 339-53, 2006 May.
Article in English | MEDLINE | ID: mdl-16352416

ABSTRACT

We describe a hybrid generative and predictive model of the motor cortex. The generative model is related to the hierarchically directed cortico-cortical (or thalamo-cortical) connections and unsupervised training leads to a topographic and sparse hidden representation of its sensory and motor input. The predictive model is related to lateral intra-area and inter-area cortical connections, functions as a hetero-associator attractor network and is trained to predict the future state of the network. Applying partial input, the generative model can map sensory input to motor actions and can thereby perform learnt action sequences of the agent within the environment. The predictive model can additionally predict a longer perception- and action sequence (mental simulation). The models' performance is demonstrated on a visually guided robot docking manoeuvre. We propose that the motor cortex might take over functions previously learnt by reinforcement in the basal ganglia and relate this to mirror neurons and imitation.


Subject(s)
Models, Biological , Models, Psychological , Motor Cortex/physiology , Reinforcement, Psychology , Brain Mapping , Humans , Motor Cortex/anatomy & histology , Neural Pathways/physiology , Predictive Value of Tests , Psychomotor Performance
SELECTION OF CITATIONS
SEARCH DETAIL
...