Search | VHL Regional Portal

Justifying and generalizing contrastive divergence.

Bengio, Yoshua; Delalleau, Olivier.

Neural Comput ; 21(6): 1601-21, 2009 Jun.

Article in English | MEDLINE | ID: mdl-19018704

ABSTRACT

We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation--running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

Subject(s)

Bias , Learning/physiology , Likelihood Functions , Models, Statistical , Humans

Learning eigenfunctions links spectral embedding and kernel PCA.

Bengio, Yoshua; Delalleau, Olivier; Le Roux, Nicolas; Paiement, Jean-François; Vincent, Pascal; Ouimet, Marie.

Neural Comput ; 16(10): 2197-219, 2004 Oct.

Article in English | MEDLINE | ID: mdl-15333211

ABSTRACT

In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nyström formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.

Subject(s)

Algorithms , Artificial Intelligence , Learning/physiology , Models, Statistical , Neural Networks, Computer , Cluster Analysis , Generalization, Psychological , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL