Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Front Artif Intell ; 5: 891624, 2022.
Article in English | MEDLINE | ID: mdl-35615470

ABSTRACT

Analog crossbar arrays comprising programmable non-volatile resistors are under intense investigation for acceleration of deep neural network training. However, the ubiquitous asymmetric conductance modulation of practical resistive devices critically degrades the classification performance of networks trained with conventional algorithms. Here we first describe the fundamental reasons behind this incompatibility. Then, we explain the theoretical underpinnings of a novel fully-parallel training algorithm that is compatible with asymmetric crosspoint elements. By establishing a powerful analogy with classical mechanics, we explain how device asymmetry can be exploited as a useful feature for analog deep learning processors. Instead of conventionally tuning weights in the direction of the error function gradient, network parameters can be programmed to successfully minimize the total energy (Hamiltonian) of the system that incorporates the effects of device asymmetry. Our technique enables immediate realization of analog deep learning accelerators based on readily available device technologies.

2.
Front Artif Intell ; 4: 699148, 2021.
Article in English | MEDLINE | ID: mdl-34568813

ABSTRACT

Deep neural networks (DNNs) are typically trained using the conventional stochastic gradient descent (SGD) algorithm. However, SGD performs poorly when applied to train networks on non-ideal analog hardware composed of resistive device arrays with non-symmetric conductance modulation characteristics. Recently we proposed a new algorithm, the Tiki-Taka algorithm, that overcomes this stringent symmetry requirement. Here we build on top of Tiki-Taka and describe a more robust algorithm that further relaxes other stringent hardware requirements. This more robust second version of the Tiki-Taka algorithm (referred to as TTv2) 1. decreases the number of device conductance states requirement from 1000s of states to only 10s of states, 2. increases the noise tolerance to the device conductance modulations by about 100x, and 3. increases the noise tolerance to the matrix-vector multiplication performed by the analog arrays by about 10x. Empirical simulation results show that TTv2 can train various neural networks close to their ideal accuracy even at extremely noisy hardware settings. TTv2 achieves these capabilities by complementing the original Tiki-Taka algorithm with lightweight and low computational complexity digital filtering operations performed outside the analog arrays. Therefore, the implementation cost of TTv2 compared to SGD and Tiki-Taka is minimal, and it maintains the usual power and speed benefits of using analog hardware for training workloads. Here we also show how to extract the neural network from the analog hardware once the training is complete for further model deployment. Similar to Bayesian model averaging, we form analog hardware compatible averages over the neural network weights derived from TTv2 iterates. This model average then can be transferred to another analog or digital hardware with notable improvements in test accuracy, transcending the trained model itself. In short, we describe an end-to-end training and model extraction technique for extremely noisy crossbar-based analog hardware that can be used to accelerate DNN training workloads and match the performance of full-precision SGD.

3.
Front Neurosci ; 15: 767953, 2021.
Article in English | MEDLINE | ID: mdl-35069098

ABSTRACT

Recent progress in novel non-volatile memory-based synaptic device technologies and their feasibility for matrix-vector multiplication (MVM) has ignited active research on implementing analog neural network training accelerators with resistive crosspoint arrays. While significant performance boost as well as area- and power-efficiency is theoretically predicted, the realization of such analog accelerators is largely limited by non-ideal switching characteristics of crosspoint elements. One of the most performance-limiting non-idealities is the conductance update asymmetry which is known to distort the actual weight change values away from the calculation by error back-propagation and, therefore, significantly deteriorates the neural network training performance. To address this issue by an algorithmic remedy, Tiki-Taka algorithm was proposed and shown to be effective for neural network training with asymmetric devices. However, a systematic analysis to reveal the required asymmetry specification to guarantee the neural network performance has been unexplored. Here, we quantitatively analyze the impact of update asymmetry on the neural network training performance when trained with Tiki-Taka algorithm by exploring the space of asymmetry and hyper-parameters and measuring the classification accuracy. We discover that the update asymmetry level of the auxiliary array affects the way the optimizer takes the importance of previous gradients, whereas that of main array affects the frequency of accepting those gradients. We propose a novel calibration method to find the optimal operating point in terms of device and network parameters. By searching over the hyper-parameter space of Tiki-Taka algorithm using interpolation and Gaussian filtering, we find the optimal hyper-parameters efficiently and reveal the optimal range of asymmetry, namely the asymmetry specification. Finally, we show that the analysis and calibration method be applicable to spiking neural networks.

4.
Front Neurosci ; 14: 103, 2020.
Article in English | MEDLINE | ID: mdl-32174807

ABSTRACT

Hardware architectures composed of resistive cross-point device arrays can provide significant power and speed benefits for deep neural network training workloads using stochastic gradient descent (SGD) and backpropagation (BP) algorithm. The training accuracy on this imminent analog hardware, however, strongly depends on the switching characteristics of the cross-point elements. One of the key requirements is that these resistive devices must change conductance in a symmetrical fashion when subjected to positive or negative pulse stimuli. Here, we present a new training algorithm, so-called the "Tiki-Taka" algorithm, that eliminates this stringent symmetry requirement. We show that device asymmetry introduces an unintentional implicit cost term into the SGD algorithm, whereas in the "Tiki-Taka" algorithm a coupled dynamical system simultaneously minimizes the original objective function of the neural network and the unintentional cost term due to device asymmetry in a self-consistent fashion. We tested the validity of this new algorithm on a range of network architectures such as fully connected, convolutional and LSTM networks. Simulation results on these various networks show that the accuracy achieved using the conventional SGD algorithm with symmetric (ideal) device switching characteristics is matched in accuracy achieved using the "Tiki-Taka" algorithm with non-symmetric (non-ideal) device switching characteristics. Moreover, all the operations performed on the arrays are still parallel and therefore the implementation cost of this new algorithm on array architectures is minimal; and it maintains the aforementioned power and speed benefits. These algorithmic improvements are crucial to relax the material specification and to realize technologically viable resistive crossbar arrays that outperform digital accelerators for similar training tasks.

5.
Nanotechnology ; 31(2): 025204, 2020 Jan 10.
Article in English | MEDLINE | ID: mdl-31553955

ABSTRACT

Training of deep neural networks (DNNs) is a computationally intensive task and requires massive volumes of data transfer. Performing these operations with the conventional von Neumann architectures creates unmanageable time and power costs. Recent studies have shown that mixed-signal designs involving resistive crossbar architectures are capable of achieving acceleration factors as high as 30 000 × over the state of the art digital processors. These approaches involve utilization of non-volatile memory elements as local processors. However, no technology has been developed to-date that can satisfy the strict device requirements for the unit cell. This paper presents the superconducting nanowire-based processing element as a crosspoint device. The unit cell has many programmable non-volatile states that can be used to perform analog multiplication. Importantly, these states are intrinsically discrete due to quantization of flux, which provides symmetric switching characteristics. Operation of these devices in a crossbar is described and verified with electro-thermal circuit simulations. Finally, validation of the concept in an actual DNN training task is shown using an emulator.

6.
Front Neurosci ; 13: 753, 2019.
Article in English | MEDLINE | ID: mdl-31417340

ABSTRACT

Analog arrays are a promising emerging hardware technology with the potential to drastically speed up deep learning. Their main advantage is that they employ analog circuitry to compute matrix-vector products in constant time, irrespective of the size of the matrix. However, ConvNets map very unfavorably onto analog arrays when done in a straight-forward manner, because kernel matrices are typically small and the constant time operation needs to be sequentially iterated a large number of times. Here, we propose to parallelize the training by replicating the kernel matrix of a convolution layer on distinct analog arrays, and randomly divide parts of the compute among them. With this modification, analog arrays execute ConvNets with a large acceleration factor that is proportional to the number of kernel matrices used per layer (here tested 16-1024). Despite having more free parameters, we show analytically and in numerical experiments that this new convolution architecture is self-regularizing and implicitly learns similar filters across arrays. We also report superior performance on a number of datasets and increased robustness to adversarial attacks. Our investigation suggests to revise the notion that emerging hardware architectures that feature analog arrays for fast matrix-vector multiplication are not suitable for ConvNets.

7.
Front Neurosci ; 12: 745, 2018.
Article in English | MEDLINE | ID: mdl-30405334

ABSTRACT

In our previous work we have shown that resistive cross point devices, so called resistive processing unit (RPU) devices, can provide significant power and speed benefits when training deep fully connected networks as well as convolutional neural networks. In this work, we further extend the RPU concept for training recurrent neural networks (RNNs) namely LSTMs. We show that the mapping of recurrent layers is very similar to the mapping of fully connected layers and therefore the RPU concept can potentially provide large acceleration factors for RNNs as well. In addition, we study the effect of various device imperfections and system parameters on training performance. Symmetry of updates becomes even more crucial for RNNs; already a few percent asymmetry results in an increase in the test error compared to the ideal case trained with floating point numbers. Furthermore, the input signal resolution to the device arrays needs to be at least 7 bits for successful training. However, we show that a stochastic rounding scheme can reduce the input signal resolution back to 5 bits. Further, we find that RPU device variations and hardware noise are enough to mitigate overfitting, so that there is less need for using dropout. Here we attempt to study the validity of the RPU approach by simulating large scale networks. For instance, the models studied here are roughly 1500 times larger than the more often studied multilayer perceptron models trained on the MNIST dataset in terms of the total number of multiplication and summation operations performed per epoch.

8.
Front Neurosci ; 11: 538, 2017.
Article in English | MEDLINE | ID: mdl-29066942

ABSTRACT

In a previous work we have detailed the requirements for obtaining maximal deep learning performance benefit by implementing fully connected deep neural networks (DNN) in the form of arrays of resistive devices. Here we extend the concept of Resistive Processing Unit (RPU) devices to convolutional neural networks (CNNs). We show how to map the convolutional layers to fully connected RPU arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm. We find that the noise and bound limitations imposed by the analog nature of the computations performed on the arrays significantly affect the training accuracy of the CNNs. Noise and bound management techniques are presented that mitigate these problems without introducing any additional complexity in the analog circuits and that can be addressed by the digital circuits. In addition, we discuss digitally programmable update management and device variability reduction techniques that can be used selectively for some of the layers in a CNN. We show that a combination of all those techniques enables a successful application of the RPU concept for training CNNs. The techniques discussed here are more general and can be applied beyond CNN architectures and therefore enables applicability of the RPU approach to a large class of neural network architectures.

9.
ACS Appl Mater Interfaces ; 9(20): 17024-17033, 2017 May 24.
Article in English | MEDLINE | ID: mdl-28452464

ABSTRACT

Combinations of sub 1 µm absorber films with high-work-function back surface contact layers are expected to induce large enough internal fields to overcome adverse effects of bulk defects on thin-film photovoltaic performance, particularly in earth-abundant kesterites. However, there are numerous experimental challenges involving back surface engineering, which includes exfoliation, thinning, and contact layer optimization. In the present study, a unique combination of nanocharacterization tools, including nano-Auger, Kelvin probe force microscopy (KPFM), and cryogenic focused ion beam measurements, are employed to gauge the possibility of surface potential modification in the absorber back surface via direct deposition of high-work-function metal oxides on exfoliated surfaces. Nano-Auger measurements showed large compositional nonuniformities on the exfoliated surfaces, which can be minimized by a brief bromine-methanol etching step. Cross-sectional nano-Auger and KPFM measurements on Au/MoO3/Cu2ZnSn(S,Se)4 (CZTSSe) showed an upward band bending as large as 400 meV within the CZTSSe layer, consistent with the high work function of MoO3, despite Au incorporation into the oxide layer. Density functional theory simulations of the atomic structure for bulk amorphous MoO3 demonstrated the presence of large voids within MoO3 enabling Au in-diffusion. With a less diffusive metal electrode such as Pt or Pd, upward band bending beyond this level is expected to be achieved.

10.
Front Neurosci ; 10: 333, 2016.
Article in English | MEDLINE | ID: mdl-27493624

ABSTRACT

In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We evaluate the effect of various RPU device features/non-idealities and system parameters on performance in order to derive the device and system level specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30, 000 × compared to state-of-the-art microprocessors while providing power efficiency of 84, 000 GigaOps∕s∕W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisting of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration, and analysis of multimodal sensory data flows from a massive number of IoT (Internet of Things) sensors.

11.
Adv Mater ; 26(44): 7427-31, 2014 Nov 26.
Article in English | MEDLINE | ID: mdl-25155874

ABSTRACT

High-efficiency Cu2ZnSn(S,Se)4 solar cells are reported by applying In2S3/CdS double emitters. This new structure offers a high doping concentration within the Cu2ZnSn(S,Se)4 solar cells, resulting in a substantial enhancement in open-circuit voltage. The 12.4% device is obtained with a record open-circuit voltage deficit of 593 mV.

SELECTION OF CITATIONS
SEARCH DETAIL
...