Regions with Convolutional Neural Network

R-CNN is a state-of-the-art visual object detection system that combines bottom-up region proposals with rich features computed by a convolutional neural network. At the time of its release, R-CNN improved the previous best detection performance on PASCAL VOC 2012 by 30% relative, going from 40.9% to 53.3% mean average precision. Unlike the previous best results, R-CNN achieves this performance without using contextual rescoring or an ensemble of feature types.

R-CNN

알고리즘은 아래와 같이 요약할 수 있다.

가장 먼저 나온 객체탐지 객체 제안(Region Proposal) Network.
R-CNN은 2,000개의 후보 영역(Window)을 생성시켜 각 영역을 검증한다.
객체 제안(Region Proposal)을 위한 알고리즘은 Selective Search, EdgeBox 등이 있음.
Classification으로 기존의 CNN을 사용.
테스트 할 대상 이미지 내에서 2,000개의 영역이 중복되어 생성되기 때문에 계산량이 큰 문제가 발생한다.
각 후보 영역(Window)을 227x227¹ 크기로 변환한 후 Network(CNN)를 통해 특징을 생성.
최종적으로 이 값을 SVM을 통해 분류한다.

성능은 좋지만, 아래와 같은 단점이 있다.

Region proposal을 CNN에서 Classification할 때 image를 Warp/Crop을 하기 때문에 이미지 변형/손실로 인한 성능 저하.
약 2000개 Region proposal을 뽑고 모두 CNN Computation을 돌리기 때문에 속도 저하.
Region Proposal에 쓰는 알고리즘들은 GPU의 빠른 연산에서 이득을 못보는 CPU연산이다. CPU와 GPU사이의 병목현상 (Bottleneck)으로 작용함.

위의 세 가지 단점중 첫 번째와, 두 번째는 SPPnet에서 해결되었다. 두 번째 단점은 소개되는 Fast-RCNN에서, 세 번째 단점은 Faster-RCNN에서 해결된다.

RCNN-Object_detection_system_overview.png

Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [39] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%. On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat [34], which had the previous best result at 24.3%.