Search | VHL Regional Portal

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.

Donahue, Jeff; Hendricks, Lisa Anne; Rohrbach, Marcus; Venugopalan, Subhashini; Guadarrama, Sergio; Saenko, Kate; Darrell, Trevor.

IEEE Trans Pattern Anal Mach Intell ; 39(4): 677-691, 2017 04.

Article in English | MEDLINE | ID: mdl-27608449

ABSTRACT

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

Region-Based Convolutional Networks for Accurate Object Detection and Segmentation.

Girshick, Ross; Donahue, Jeff; Darrell, Trevor; Malik, Jitendra.

IEEE Trans Pattern Anal Mach Intell ; 38(1): 142-58, 2016 Jan.

Article in English | MEDLINE | ID: mdl-26656583

ABSTRACT

Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012-achieving a mAP of 62.4 percent. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL