Search | VHL Regional Portal

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge.

Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru.

IEEE Trans Pattern Anal Mach Intell ; 39(4): 652-663, 2017 04.

Article in English | MEDLINE | ID: mdl-28055847

ABSTRACT

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research.

PRIME: probabilistic initial 3D model generation for single-particle cryo-electron microscopy.

Elmlund, Hans; Elmlund, Dominika; Bengio, Samy.

Structure ; 21(8): 1299-306, 2013 Aug 06.

Article in English | MEDLINE | ID: mdl-23931142

ABSTRACT

Low-dose electron microscopy of cryo-preserved individual biomolecules (single-particle cryo-EM) is a powerful tool for obtaining information about the structure and dynamics of large macromolecular assemblies. Acquiring images with low dose reduces radiation damage, preserves atomic structural details, but results in low signal-to-noise ratio of the individual images. The projection directions of the two-dimensional images are random and unknown. The grand challenge is to achieve the precise three-dimensional (3D) alignment of many (tens of thousands to millions) noisy projection images, which may then be combined to obtain a faithful 3D map. An accurate initial 3D model is critical for obtaining the precise 3D alignment required for high-resolution (<10 Å) map reconstruction. We report a method (PRIME) that, in a single step and without prior structural knowledge, can generate an accurate initial 3D map directly from the noisy images.

Subject(s)

Cryoelectron Microscopy/methods , Macromolecular Substances/ultrastructure , Imaging, Three-Dimensional/methods , Models, Molecular , Models, Statistical , Ribosomes/ultrastructure , Software

Sound retrieval and ranking using sparse auditory representations.

Lyon, Richard F; Rehn, Martin; Bengio, Samy; Walters, Thomas C; Chechik, Gal.

Neural Comput ; 22(9): 2390-416, 2010 Sep 01.

Article in English | MEDLINE | ID: mdl-20569181

ABSTRACT

To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive pole-zero filter cascade (PZFC) auditory filter bank and sparse-code feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC front end.

Subject(s)

Auditory Perception/physiology , Models, Neurological , Humans , Sound

A discriminative kernel-based approach to rank images from text queries.

Grangier, David; Bengio, Samy.

IEEE Trans Pattern Anal Mach Intell ; 30(8): 1371-84, 2008 Aug.

Article in English | MEDLINE | ID: mdl-18566492

ABSTRACT

This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

Subject(s)

Algorithms , Artificial Intelligence , Image Interpretation, Computer-Assisted/methods , Information Storage and Retrieval/methods , Natural Language Processing , Pattern Recognition, Automated/methods , Discriminant Analysis , Image Enhancement/methods , Vocabulary, Controlled

Performance generalization in biometric authentication using joint user-specific and sample bootstraps.

Poh, Norman; Martin, Alvin; Bengio, Samy.

IEEE Trans Pattern Anal Mach Intell ; 29(3): 492-8, 2007 Mar.

Article in English | MEDLINE | ID: mdl-17224618

ABSTRACT

Biometric authentication performance is often depicted by a detection error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account the three mentioned sources of variability. This is an extension to the Bolle et al.'s bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75 percent of DET coverage, an unseen DET curve with eight times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and, hence, more useful.

Subject(s)

Algorithms , Artificial Intelligence , Biometry/methods , Face/anatomy & histology , Image Interpretation, Computer-Assisted/methods , Pattern Recognition, Automated/methods , Speech Recognition Software , Computer Simulation , Humans , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity

Automatic analysis of multimodal group actions in meetings.

McCowan, Iain; Gatica-Perez, Daniel; Bengio, Samy; Lathoud, Guillaume; Barnard, Mark; Zhang, Dong.

IEEE Trans Pattern Anal Mach Intell ; 27(3): 305-17, 2005 Mar.

Article in English | MEDLINE | ID: mdl-15747787

ABSTRACT

This paper investigates the recognition of group actions in meetings. A framework is employed in which group actions result from the interactions of the individual participants. The group actions are modeled using different HMM-based approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modeling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.

Subject(s)

Algorithms , Artificial Intelligence , Behavioral Sciences/methods , Group Processes , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Social Behavior , Cluster Analysis , Computer Simulation , Humans , Models, Biological , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity

Evaluation of formant-like features on an automatic vowel classification task.

de Wet, Febe; Weber, Katrin; Boves, Louis; Cranen, Bert; Bengio, Samy; Bourlard, Hervé.

J Acoust Soc Am ; 116(3): 1781-92, 2004 Sep.

Article in English | MEDLINE | ID: mdl-15478445

ABSTRACT

Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.

Subject(s)

Phonetics , Speech Acoustics , Algorithms , Databases, Factual , Discriminant Analysis , Female , Humans , Male , Markov Chains , Models, Biological , Noise , Sex Factors

Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.

Vinciarelli, Alessandro; Bengio, Samy; Bunke, Horst.

IEEE Trans Pattern Anal Mach Intell ; 26(6): 709-20, 2004 Jun.

Article in English | MEDLINE | ID: mdl-18579932

ABSTRACT

This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, the error rate is reduced by approximately 50 percent for single writer data and by approximately 25 percent for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

Subject(s)

Artificial Intelligence , Biometry/methods , Electronic Data Processing/methods , Handwriting , Image Interpretation, Computer-Assisted/methods , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Algorithms , Computer Graphics , Documentation , Image Enhancement/methods , Markov Chains , Models, Statistical , Numerical Analysis, Computer-Assisted , Reproducibility of Results , Sensitivity and Specificity , Signal Processing, Computer-Assisted , Subtraction Technique , User-Computer Interface

10.

A parallel mixture of SVMs for very large scale problems.

Collobert, Ronan; Bengio, Samy; Bengio, Yoshua.

Neural Comput ; 14(5): 1105-14, 2002 May.

Article in English | MEDLINE | ID: mdl-11972909

ABSTRACT

Support vector machines (SVMs) are the state-of-the-art models for many classification problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a significant improvement in generalization was observed.

Subject(s)

Algorithms , Artificial Intelligence , Software

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL