Search | VHL Regional Portal

1.

On the Downstream Performance of Compressed Word Embeddings.

May, Avner; Zhang, Jian; Dao, Tri; Ré, Christopher.

Adv Neural Inf Process Syst ; 32: 11782-11793, 2019 Dec.

Article in English | MEDLINE | ID: mdl-31885428

ABSTRACT

Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging-existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression. We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to 2× lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.

2.

Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation.

Zhang, Jian; May, Avner; Dao, Tri; Ré, Christopher.

Proc Mach Learn Res ; 89: 1264-1274, 2019 Apr.

Article in English | MEDLINE | ID: mdl-31777846

ABSTRACT

We investigate how to train kernel approximation methods that generalize well under a memory budget. Building on recent theoretical work, we define a measure of kernel approximation error which we find to be more predictive of the empirical generalization performance of kernel approximation methods than conventional metrics. An important consequence of this definition is that a kernel approximation matrix must be high rank to attain close approximation. Because storing a high-rank approximation is memory intensive, we propose using a low-precision quantization of random Fourier features (LP-RFFs) to build a high-rank approximation under a memory budget. Theoretically, we show quantization has a negligible effect on generalization performance in important settings. Empirically, we demonstrate across four benchmark datasets that LP-RFFs can match the performance of full-precision RFFs and the Nyström method, with 3x-10x and 50x-460x less memory, respectively.

3.

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations.

Dao, Tri; Gu, Albert; Eichhorn, Matthew; Rudra, Atri; Ré, Christopher.

Proc Mach Learn Res ; 97: 1517-1527, 2019 Jun.

Article in English | MEDLINE | ID: mdl-31777847

ABSTRACT

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the O(N log N) Cooley-Tukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points-the first time a structured approach has done so-with 4× faster inference speed and 40× fewer parameters.

4.

A Kernel Theory of Modern Data Augmentation.

Dao, Tri; Gu, Albert; Ratner, Alexander J; Smith, Virginia; De Sa, Christopher; Ré, Christopher.

Proc Mach Learn Res ; 97: 1528-1537, 2019 Jun.

Article in English | MEDLINE | ID: mdl-31777848

ABSTRACT

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear naturally with respect to this model, even when we do not employ kernel classification. Next, we analyze more directly the effect of augmentation on kernel classifiers, showing that data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. These frameworks both serve to illustrate the ways in which data augmentation affects the downstream learning model, and the resulting analyses provide novel connections between prior work in invariant kernels, tangent propagation, and robust optimization. Finally, we provide several proof-of-concept applications showing that our theory can be useful for accelerating machine learning workflows, such as reducing the amount of computation needed to train using augmented data, and predicting the utility of a transformation prior to training.

5.

Learning Compressed Transforms with Low Displacement Rank.

Thomas, Anna T; Gu, Albert; Dao, Tri; Rudra, Atri; Ré, Christopher.

Adv Neural Inf Process Syst ; 2018: 9052-9060, 2018 Dec.

Article in English | MEDLINE | ID: mdl-31130799

ABSTRACT

The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a rich class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the low-rank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multi-layer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fully-connected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks even outperform general unstructured layers while using more than 20X fewer parameters.

6.

Gaussian Quadrature for Kernel Features.

Dao, Tri; De Sa, Christopher; Ré, Christopher.

Adv Neural Inf Process Syst ; 30: 6109-6119, 2017 Dec.

Article in English | MEDLINE | ID: mdl-29398882

ABSTRACT

Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that O(Îµ-2) samples are required to achieve an approximation error of at most Îµ. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any Î³ > 0, to achieve error Îµ with O(eÎ³ + Îµ-1/Î³) samples as Îµ goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL