Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580-587
Abstract
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years.  The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context.  In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012---achieving a mAP of 53.3%.  Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.  Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features.  We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features.  Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.
Related Material
[pdf]
[
bibtex]
@InProceedings{Girshick_2014_CVPR,
author = {Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra},
title = {Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2014}
}