Faster-RCNN:Paper

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Faster R-CNN의 논문 번역.

Pre-defined References

¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶ ¹⁷ ¹⁸ ¹⁹ ²⁰ ²¹ ²² ²³ ²⁴ ²⁵ ²⁶ ²⁷ ²⁸ ²⁹ ³⁰ ³¹ ³² ³³ ³⁴ ³⁵ ³⁶ ³⁷ ³⁸ ³⁹

Abstract

ENG

KOR

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet ⁴⁰ and Fast R-CNN ⁴¹ have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model ⁴², our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

최첨단 물체 감지 네트워크는 객체의 위치를 가설 영역 제안 알고리즘에 따라 달라집니다. SPPnet ⁴³과 같은 빠른 진보는 R-CNN ⁴⁴ 병목 영역 제안 연산 노출이 검출 네트워크의 실행 시간을 단축하고있다. 이 작품에서 우리는 따라서 거의 비용이없는 지역의 제안을 가능하게 검출 네트워크와 전체 이미지 길쌈 기능을 공유하는 지역 제안 네트워크 (RPN)를 소개합니다. RPN 동시에 각각의 위치에서 경계와 objectness 점수를 객체 예측 완전 컨볼 루션 네트워크입니다. RPN 검출 빠른 R-CNN으로 사용되는 고품질 영역 제안을 생성하기 위해 엔드 - 투 - 엔드 훈련된다. 우리는 또한 "주의"메커니즘 신경 네트워크의 최근 인기있는 용어를 사용하여 기능 - 자신의 길쌈을 공유하여 하나의 네트워크로 RPN 및 고속 R-CNN을 병합, RPN 구성 요소가 어디 있는지하는 통합 네트워크를 말한다. 매우 깊은 VGG-16 모델 ⁴⁵의 경우, 우리의 검출 시스템은, GPU에 (모든 단계를 포함) 5fps의 프레임 레이트를 갖지만 PASCAL VOC 2007 2012 최첨단 객체 검출의 정확도를 달성 한 MS COCO 이미지 당 300의 제안으로 데이터 세트. ILSVRC 및 COCO 2015 대회에서 빠른 R-CNN과 RPN 여러 트랙의 항목을이기는 첫번째 장소의 기초입니다. 코드는 공개적으로 사용할 수있게되었습니다.

Index Terms: Object Detection, Region Proposal, Convolutional Neural Network.

INTRODUCTION

ENG	KOR
Recent advances in object detection are driven by the success of region proposal methods (e.g., ⁴⁶) and region-based convolutional neural networks (RCNNs) ⁴⁷. Although region-based CNNs were computationally expensive as originally developed in ⁴⁸, their cost has been drastically reduced thanks to sharing convolutions across proposals ⁴⁹, ⁵⁰. The latest incarnation, Fast R-CNN ⁵¹, achieves near real-time rates using very deep networks ⁵², when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.	물체 검출 최근 발전은 영역 제안 방법의 성공에 의해 구동된다 (예컨대, ⁵³)과 영역 기반 컨볼 루션 신경망 (RCNNs) ⁵⁴. 지역 기반 CNNs 원래 ⁵⁵ 개발로 계산 비싼했지만, 그 비용이 대폭 제안을 통해 공유 회선 덕분에 감소되었다 ⁵⁶, ⁵⁷. 최신 화신, ⁵⁸ 빠른 R-CNN은 매우 깊은 네트워크를 사용하여 실시간 요금 근처에 달성 ⁵⁹, 지역의 제안에 소요되는 시간을 무시하는 경우. 지금, 제안 최첨단 탐지 시스템의 테스트 시간 계산 병목 현상입니다.
Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search ⁶⁰, one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks ⁶¹, Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes ⁶² currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.	지역 제안 방법은 일반적으로 저렴 기능과 경제적 인 추론 방식에 의존하고있다. 선택적 검색 ⁶³, 가장 인기있는 방법 중 하나는 탐욕 설계 낮은 수준의 기능에 기초하여 슈퍼 픽셀을 병합. 효율적인 탐지 네트워크 ⁶⁴에 비해 그러나, 선택적 검색은 크기의 순서는 CPU 구현의 이미지에 2 초, 느립니다. EdgeBoxes ⁶⁵ 현재 이미지 당 0.2 초, 제안 품질과 속도의 최적 균형을 제공합니다. 그럼에도 불구하고, 영역 제안서 단계는 여전히 검출 네트워크만큼 실행 시간을 소비한다.
One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to reimplement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.	하나의 연구에서 사용되는 영역 제안 방법은 CPU 상에 구현되는 동안 고속 영역 기반 CNNs 같은 런타임 비교는 불공평하게, GPU를 활용할 수 있습니다. 제안 계산을 가속화하는 확실한 방법은 GPU를 위해 그것을 다시 구현하는 것입니다. 이것은 효과적인 엔지니어링 용액 일 수도 있지만 재 구현 하류 감지 네트워크를 무시하고, 따라서 계산을 공유하는 중요한 기회를 그리워.
In this paper, we show that an algorithmic change—computing proposals with a deep convolutional neural network—leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks ⁶⁶, ⁶⁷. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).	본 논문에서는 보여 그 제안의 계산이 거의 비용없이 검출 네트워크의 계산 주어 우아하고 효율적인 솔루션에 깊은 길쌈 신경망 - 리드 알고리즘 변경 컴퓨팅 제안. 이를 위해 첨단 물체 감지 네트워크와 컨벌루션 층 주 신규 지역 제안 통신망 (RPNs)를 도입 ⁶⁸, ⁶⁹. 시험시 회선을 공유함으로써 제안 산출 한계 비용이 작은 (예를 들면, 화상은 10ms마다).
Our observation is that the convolutional feature maps used by region-based detectors, like Fast RCNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) ⁷⁰ and can be trained end-toend specifically for the task for generating detection proposals.	우리의 관찰 영역 기반 검출기에서 사용 컨벌루션 기능 맵 빠르고 RCNN 같은 영역도 제안을 생성하기 위해 사용될 수 있다는 것이다. 이러한 길쌈 기능의 정상에, 우리는 동시에 일반 그리드에 각 위치에서의 영역 경계와 objectness 점수를 퇴행 몇 가지 추가 길쌈 레이어를 추가하여 RPN을 구성. RPN 따라서 완전 컨볼 루션 네트워크 (FCN)의 일종이다 ⁷¹ 및 검출 제안을 생성하는 작업을 위해 특별히 최종 toend 훈련 될 수있다.
RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods ⁷², ⁷³, ⁷⁴, ⁷⁵ that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel “anchor” boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.	RPNs 효율적 비늘 종횡비 광범위한 영역 제안을 예측하도록 설계된다. 널리 방법 ⁷⁶, ⁷⁷ ⁷⁸, ⁷⁹ 화상 피라미드를 사용하는 (도 1) 또는 필터의 피라미드 (도 1, b), 우리가 도입 신규 "앵커"박스 달리 그 여러 저울과 가로 세로 비율에서 참조 역할을합니다. 우리의 방식은 다수의 비늘이나 종횡비 이미지 또는 필터를 열거 피할 회귀 참조 피라미드 (c도 1)로 간주 할 수있다. 훈련 및 단일 스케일 이미지와 속도를 실행하여 혜택을 사용하여 테스트 할 때이 모델은 잘 수행합니다.

Faster-RCNN_-_figure1.jpg
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

ENG	KOR
To unify RPNs with Fast R-CNN ⁸⁰ object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks. (Since the publication of the conference version of this paper ⁸¹, we have also found that RPNs can be trained jointly with Fast R-CNN networks leading to less training time)	빠른 R-CNN ⁸² 물체 감지 네트워크와 RPNs를 통합하기 위해, 우리는 고정 된 제안을 유지하면서, 지역의 제안서 작업 및 객체 검출을위한 다음 미세 조정을위한 미세 조정 번갈아 훈련 기법을 제안한다. 이 방식은 빠르게 수렴하고 두 작업 사이에 공유되는 길쌈 기능과 함께 통합 네트워크를 생성합니다. (본 논문 ⁸³의 컨퍼런스 버전의 출판 이후, 우리는 또한 RPNs가 공동으로 빠른 R-CNN 네트워크가 적은 교육 시간에 선도적으로 훈련 할 수 있음을 발견했다)
We comprehensively evaluate our method on the PASCAL VOC detection benchmarks ⁸⁴ where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time—the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of ⁸⁵, our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset ⁸⁶ and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).	우리는 포괄적으로 PASCAL VOC 검출 벤치 마크에 우리의 방법을 평가 ⁸⁷ 여기서 빠른 R-CNNs 선택적 검색의 강한베이스 라인보다 더 빠른 R-CNNs 생산 검출 정확도와 RPNs. 한편, 우리의 방법에 선택적 검색의 거의 모든 계산 부담 포기 테스트 시간을-제안에 대한 효과적인 실행 시간은 10 밀리 초입니다. 따라서, ⁸⁸, 우리의 검출 방법이 여전히 GPU에 (모든 단계를 포함) 5fps의 프레임 레이트를 가지고의 비용이 매우 깊은 모델을 사용하고,은 속도와 정확성 모두의 측면에서 실제 객체 검출 시스템이다. 우리는 또한 MS 코코 데이터 세트 ⁸⁹에 결과를보고하고 COCO 데이터를 사용하여 PASCAL VOC에 대한 개선 사항을 조사. 코드 (MATLAB에서) https://github.com/shaoqingren/faster_rcnn에서 공개적으로 제공하고 https://github.com/rbgirshick/py-faster-rcnn (파이썬)되었습니다.
A preliminary version of this manuscript was published previously ⁹⁰. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection ⁹¹, part-based detection ⁹², instance segmentation ⁹³, and image captioning ⁹⁴. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests ⁹⁵, with user engagement improvements reported.	이 원고의 예비 버전은 이전에 ⁹⁶ 출판되었다. 이후 RPN 빠른 R-CNN의 틀이 채택 된 그러한 3D 객체 검출과 같은 다른 방법으로 일반화 ⁹⁷ 부분 기반 탐지 ⁹⁸ 예 세그먼트 ⁹⁹, 및 화상 자막 ¹⁰⁰ . 우리의 신속하고 효과적인 물체 감지 시스템도보고 사용자 참여의 개선 등 Pinterests ¹⁰¹에서와 같은 상용 시스템에 내장하고있다.
In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries ¹⁰² in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in ¹⁰³). Faster R-CNN and RPN are also used by several other leading entries in these competitions ( http://image-net.org/challenges/LSVRC/2015/results ). These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy	ILSVRC 및 COCO 2015 대회에서 빠른 R-CNN과 RPN은 ImageNet 감지, ImageNet 현지화, 코코 감지, 코코 분할의 트랙에서 여러 1 위를 항목 ¹⁰⁴의 기초가된다. RPNs 완전히 쉽게 더 깊고 더 표현 기능을 활용할 수 있습니다 따라서 데이터에서 영역을 제안 배우고, (예 년에 채택 된 101 층의 잔류 그물로 ¹⁰⁵). 빠른 R-CNN과 RPN은 다음 콘테스트에 여러 다른 주요 항목에 의해 사용된다 ( http://image-net.org/challenges/LSVRC/2015/results ). 이러한 결과는 우리의 방법은 실제 사용에 대한 비용 효율적인 솔루션뿐만 아니라 객체 검출의 정확도를 향상시키는 효과적인 방법뿐만 아니라 제안

S. Ren is with University of Science and Technology of China, Hefei, China. This work was done when S. Ren was an intern at Microsoft Research. Email: [email protected]
K. He and J. Sun are with Visual Computing Group, Microsoft Research. E-mail: {kahe,jiansun}@microsoft.com
R. Girshick is with Facebook AI Research. The majority of this work was done when R. Girshick was with Microsoft Research. E-mail: [email protected]

ENG	KOR
Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in ¹⁰⁶, ¹⁰⁷, ¹⁰⁸. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search ¹⁰⁹, CPMC ¹¹⁰, MCG ¹¹¹) and those based on sliding windows (e.g., objectness in windows ¹¹², EdgeBoxes ¹¹³). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search ¹¹⁴ object detectors, RCNN ¹¹⁵, and Fast R-CNN ¹¹⁶).	개체 제안. 오브젝트 제안 방법에 큰 문헌이있다. 포괄적 인 설문 조사 및 개체 제안 방법의 비교는 ¹¹⁷, ¹¹⁸, ¹¹⁹에서 확인할 수 있습니다. 널리 개체 제안 방법 등이 사용되는 슈퍼 픽셀을 그룹화를 기반으로하는 (예를 들어, 선택적 검색 ¹²⁰, CPMC ¹²¹, MCG ¹²²) 및 그 슬라이딩 윈도우 (예를 들어, objectness 창에서 ¹²³, EdgeBoxes ¹²⁴기반으로). 개체 제안 방법은 검출기의 외부 모듈을 독립적으로 채택되었다 (예를 들어, 선택적 검색 ¹²⁵ 개체 감지기, RCNN ¹²⁶, 및 고속 R-CNN ¹²⁷).
Deep Networks for Object Detection. The R-CNN method ¹²⁸ trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in ¹²⁹). Several papers have proposed ways of using deep networks for predicting object bounding boxes ¹³⁰, ¹³¹, ¹³², ¹³³. In the OverFeat method ¹³⁴, a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods ¹³⁵, ¹³⁶ generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “singlebox” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN ¹³⁷. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method ¹³⁸ is developed for learning segmentation proposals.	물체 감지에 대한 깊은 네트워크. R-CNN 방법 ¹³⁹ 오브젝트 카테고리 또는 배경으로 제안 영역 분류 CNNs의 종단 열차. R-CNN은 주로 분류로 재생, 그리고 (상자 회귀를 경계로 정제 제외) 객체의 경계를 예측하지 않습니다. 그 정확성이 지역 제안 모듈의 성능에 따라 달라집니다 (의 비교를 참조하십시오 ¹⁴⁰). 여러 논문 오브젝트 경계 박스 ¹⁴¹을 예측 깊은 네트워크를 사용하는 방법을 제안 하였다 ¹⁴², ¹⁴³, ¹⁴⁴. OverFeat 방법 ¹⁴⁵에서 완벽하게 연결 층은 상자가 하나의 개체를 가정 현지화 작업에 대한 좌표를 예측하는 훈련을한다. 완전 연결 층은 여러 classspecific 객체를 검출하는 길쌈 층으로 설정되어 있습니다. 멀티 박스 방법 ¹⁴⁶, ¹⁴⁷ OverFeat의 singlebox패션 일반화, 마지막으로 완전히 연결된 계층 동시에 여러 클래스에 얽매이지 박스를 예측하는 네트워크 영역 제안을 생성합니다. 이 클래스에 얽매이지 박스는 R-CNN ¹⁴⁸의 제안으로 사용됩니다. 멀티 박스 제안서 네트워크는 우리의 완전 컨볼 루션 방식 대조적으로, (예를 들어 224 × 224)를 하나의 이미지 자르기 또는 여러 큰 이미지 작물에 적용됩니다. 멀티 박스는 제안 및 탐지 네트워크 사이의 기능을 공유하지 않습니다. 우리는 우리의 방법과 관련하여 나중에 더 깊이 OverFeat 및 멀티 박스에 대해 설명합니다. 우리의 작업과 동시, DeepMask 방법 ¹⁴⁹ 분할 제안을 학습 개발되고있다.
Shared computation of convolutions ¹⁵⁰, ¹⁵¹, ¹⁵², ¹⁵³, ¹⁵⁴ has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper ¹⁵⁵ computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) ¹⁵⁶ on shared convolutional feature maps is developed for efficient region-based object detection ¹⁵⁷, ¹⁵⁸ and semantic segmentation ¹⁵⁹. Fast R-CNN ¹⁶⁰ enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.	컨볼 루션의 연산 공유 ¹⁶¹ ¹⁶², ¹⁶³, ¹⁶⁴, ¹⁶⁵, 정확하면서도 효율적 시인성 향상을 위해 관심을 받고있다. OverFeat 용지 ¹⁶⁶ 분류, 현지화 및 검출을위한 이미지 피라미드에서 길쌈 기능을 계산한다. 적응 크기 풀링 (SPP)은 ¹⁶⁷ 공유 길쌈 기능지도에 효율적으로 지역 기반 물체 검출 ¹⁶⁸, ¹⁶⁹과 의미 분할 ¹⁷⁰가 개발되고있다. 빠른 R-CNN ¹⁷¹ 공유 길쌈 기능에 대한 엔드 - 투 - 엔드 검출기 교육을 가능하게하고 뛰어난 정확도와 속도를 보여줍니다.

FASTER R-CNN

Figure 2: Faster R-CNN is a single, unified network for object detection. The RPN module serves as the 'attention' of this unified network

ENG

KOR

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector ¹⁷² that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with ‘attention’ ¹⁷³ mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

빠른 R-CNN라고 우리 물체 감지 시스템은 두 개의 모듈로 구성된다. 제 모듈 영역을 제안 깊은 완전 컨볼 루션 네트워크이고, 제 2 모듈은 제안 된 영역을 사용하는 고속 R-CNN 검출기 ¹⁷⁴이다. 전체 시스템은 물체 감지 (그림 2)에 대한 하나의 통합 된 네트워크입니다. '관심'¹⁷⁵ 메커니즘 신경 네트워크의 최근 인기있는 용어를 사용하여, RPN 모듈은 어디 있는지하는 빠른 R-CNN 모듈을 알려줍니다. 3.1 절에서 우리는 디자인과 지역의 제안에 대한 네트워크의 특성을 소개합니다. 3.2 절에서 우리는 공유 기능이 두 모듈 훈련을위한 알고리즘을 개발한다.

Region Proposal Networks

ENG	KOR
A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. (“Region” is a generic term and in this paper we only consider rectangular regions, as is common for many methods (e.g., ¹⁷⁶, ¹⁷⁷, ¹⁷⁸). “Objectness” measures membership to a set of object classes vs. background.) We model this process with a fully convolutional network ¹⁷⁹, which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network ¹⁸⁰, we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model ¹⁸¹ (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model ¹⁸² (VGG-16), which has 13 shareable convolutional layers.	지역 제안 네트워크 (RPN)를 입력으로 (어떤 크기의) 이미지를 받아 직사각형 물체 제안의 집합을 출력 objectness score. ("지역"는 일반적인 용어이며,이 논문에서 우리는 사각형 영역을 고려, 공통으로 많은 방법 (예를 들어, ¹⁸³, ¹⁸⁴, ¹⁸⁵). 객체 클래스 대 배경의 세트에 "Objectness"측정 회원.) 각각 우리는 우리가 설명 완전 컨볼 루션 네트워크 ¹⁸⁶과이 프로세스를 모델링 이 섹션. 우리의 궁극적 인 목표는 빠른 R-CNN 물체 감지 네트워크 ¹⁸⁷로 계산을 공유 할 수 있기 때문에, 우리는 두 그물 길쌈 층 공통 세트를 공유하는 것으로 가정한다. 우리의 실험에서, 우리는 조사 ZEILER 5 공유 길쌈 층을 가지고 있으며, 퍼거스 모델 ¹⁸⁸ (ZF), Simonyan 및 Zisserman 모델 ¹⁸⁹ (13) 공유 길쌈 층이 (VGG-16),.
To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an n × n spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU ¹⁹⁰ following). This feature is fed into two sibling fullyconnected layers—a box-regression layer (reg) and a box-classification layer (cls). We use n = 3 in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).	지역의 제안을 생성하기 위해, 우리는 마지막 공유 길쌈 층에 의해 길쌈 기능지도 출력 위에 작은 네트워크를 밀어 넣습니다. 이 작은 네트워크는 입력으로 입력 길쌈 기능지도 n 개의 공간 창 × N 걸립니다. 각 슬라이딩 윈도우는 (ReLU 다음 ¹⁹¹와 함께, ZF 256-D와 VGG 512-D) 낮은 차원 기능에 매핑됩니다. 이 기능은 두 형제 fullyconnected 층 - 상자 회귀 층 (REG)과 상자 분류 층 (CLS)로 공급된다. 우리는 입력 영상의 효과적인 수용 필드 (ZF와 VGG 171 및 228 픽셀, 각각) 큰 것을주의, 본 논문에서 N = 3 사용합니다. 이 미니 네트워크는 그림 3 (왼쪽)에서 하나의 위치에 도시되어있다. 미니 네트워크가 슬라이딩 윈도우 방식으로 작동하기 때문에, 완전 연결 층은 모든 공간 위치에서 공유되어 있습니다. 이 아키텍처는 자연스럽게 (등록 번호 및 CLS에 대한 각각) 1 × 1 길쌈 층 형제 두 뒤에 N 길쌈 층 × N으로 구현된다.

Faster-RCNN_-_figure3.jpg
Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.

Anchors

ENG	KOR
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal (For simplicity we implement the cls layer as a two-class softmax layer. Alternatively, one may use logistic regression to produce k scores.). The k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For a convolutional feature map of a size W × H (typically ∼2,400), there are W Hk anchors in total.	각 슬라이딩 윈도우 위치에서, 우리는 동시에 각각의 위치에 대한 최대 가능한 제안의 수 (K)로 표시되는 여러 영역 제안을 예측한다. 그래서 등록 층 CLS 층을 출력 대상의 확률을 추정 각 제안서 반대하지 2K 점수 K 박스 좌표를 코딩 4K 출력을 가지며,(단순화를 위해 우리는 두 클래스 softmax를 층으로 CLS 층을 구현합니다. 대안 적으로, 하나는 K 점수를 생성하기 위해 로지스틱 회귀 분석을 이용할 수있다.). K 제안은 우리가 앵커를 호출 기준 상자를, 케이에 대해 매개 변수가 있습니다. 앵커는 해당 이동 구간을 중심으로하고, (좌측,도 3) 스케일의 종횡비와 관련된다. 기본적으로 우리는 각각의 슬라이딩 위치에 K = 9 앵커를 산출, 3 저울과 3 화면 비율을 사용합니다. W 크기의 길쌈 기능지도를 들면 H × (일반적으로 ~2,400), 총 W 홍콩 앵커가 있습니다.
Translation-Invariant Anchors	변환-불변 앵커
An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method (As is the case of FCNs ¹⁹², our network is translation invariant up to the network’s total stride.). As a comparison, the MultiBox method ¹⁹³ uses k-means to generate 800 anchors, which are not translation invariant. So MultiBox does not guarantee that the same proposal is generated if an object is translated.	우리의 접근의 중요한 속성은 앵커의 용어와 앵커에 대해 제안을 계산하는 기능을 모두 번역 불변 것입니다. 하나의 이미지에서 객체를 해석하는 경우 제안은 번역되어야하고 동일한 기능을 어느 위치에 제안을 예측할 수 있어야한다. 이 번역 불변 속성은 우리의 방법에 의해 보장된다 (FCNs의 경우와 마찬가지로 ¹⁹⁴, 우리의 네트워크는 번역은 네트워크의 총 보폭까지 불변이다.). 비교, 멀티 박스 방법 ¹⁹⁵ 불변 번역하지 않은 800 앵커를 생성하는 K-수단을 사용합니다. 그래서 멀티 박스는 객체가 변환되는 경우 동일한 제안이 생성되는 것을 보장하지 않습니다.
The translation-invariant property also reduces the model size. MultiBox has a (4 + 1) × 800-dimensional fully-connected output layer, whereas our method has a (4 + 2) × 9-dimensional convolutional output layer in the case of k = 9 anchors. As a result, our output layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 for VGG-16), two orders of magnitude fewer than MultiBox’s output layer that has 6.1 × 106 parameters (1536 × (4 + 1) × 800 for GoogleNet ¹⁹⁶ in MultiBox ¹⁹⁷). If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox (Considering the feature projection layers, our proposal layers’ parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 10^6; MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + 64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 10^6.). We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.	번역 불변 속성은 또한 모델의 크기를 줄일 수 있습니다. 우리의 방법은 K = 9의 앵커 경우 (4 + 2) × 9 차원 컨벌루션 출력 층을 갖는 반면, 멀티 박스, (4 + 1) × 800 차원 완전히 연결된 출력 층을 갖는다. 그 결과, 우리 출력층 2.8 × 104 파라미터 (512 × (4 + 2) VGG-16 × 9) 6.1 × 106 파라미터 (1536 × (4 + 1을 갖는 멀티 박스의 출력 층보다 적은 2 차의 크기를 가지고 ) ¹⁹⁸) 멀티 박스에서 GoogleNet ¹⁹⁹의 800 ×. 기능 투영 층을 고려하면 제안서 층 여전히 MultiBox 크기보다 적은 파라미터 순서를 갖는다 (기능 투사 층 고려할 때, 우리의 제안 층 '매개 변수 수는 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 10 ^ 6입니다; 멀티 박스 제안서 층 '매개 변수 수는 1536 + 1536 × 5 × 800 = 27 × 10 ^ 6 × 7 × 7 × (+ 64 64 + 96 + 64).). 우리는 우리의 방법은 PASCAL VOC와 같은 작은 데이터 세트에 overfitting 덜 위험이 예상된다.
Multi-Scale Anchors as Regression References	회귀 참고로 멀티 스케일 앵커
Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM ²⁰⁰ and CNNbased methods ²⁰¹, ²⁰², ²⁰³. The images are resized at multiple scales, and feature maps (HOG ²⁰⁴ or deep convolutional features ²⁰⁵, ²⁰⁶, ²⁰⁷) are computed for each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM ²⁰⁸, models of different aspect ratios are trained separately using different filter sizes (such as 5×7 and 7×5). If this way is used to address multiple scales, it can be thought of as a “pyramid of filters” (Figure 1(b)). The second way is usually adopted jointly with the first way ²⁰⁹.	앵커의 우리의 디자인은 여러 저울 (그리고 가로 세로 비율)을 해결하기위한 새로운 방식을 제시한다. 도 1에 도시 된 바와 같이, 멀티 - 스케일 예측을위한 두 가지 방법으로 인기가 있었다. 첫 번째 방법은 DPM으로, 예를 들어 영상 / 기능 피라미드에 기초한다 ²¹⁰와 CNNbased 방법 ²¹¹ ²¹², ²¹³. 이미지는 복수의 스케일로 크기가 조정되고, 기능 맵 (HOG ²¹⁴ 딥 컨벌루션 기능 ²¹⁵ ²¹⁶, ²¹⁷) 각각의 스케일에 대해 계산된다 (도 1 (a)). 이 방법들은 유용하지만, 시간 소모적이다. 두 번째 방법은 기능지도 다중 스케일 (및 / 또는 종횡비)의 슬라이딩 윈도우를 사용하는 것이다. 예를 들어, DPM ²¹⁸에서, 상이한 종횡비의 모델 (예를 들면 5 × 7, 7 × 5) 다른 크기의 필터를 사용하여 개별적으로 훈련된다. 이 방법은 다수의 비늘을 해결하기 위해 사용되는 경우, 그것은 "피라미드 필터"로 생각 될 수있다 (도 1 (b)). 두 번째 방법은 일반적으로 첫 번째 방법 ²¹⁹와 공동으로 채용하고 있습니다.
As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).	비교, 우리의 앵커 기반의 방법보다 비용 효율적인 앵커의 피라미드에 내장되어 있습니다. 우리의 방법은 여러 분류 비늘 종횡비 박스를 고정 참조하여 바운딩 박스 회귀. 그것은 단지 이미지와 단일 규모의 기능지도에 의존하고, 하나의 크기 (기능지도를 슬라이딩 창) 필터를 사용합니다. 우리는 실험을 통해 여러 스케일과 크기 (표 8)를 해결하기위한이 제도의 효과를 보여줍니다.
Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector ²²⁰. The design of multiscale anchors is a key component for sharing features without extra cost for addressing scales.	또한 빠른 R-CNN 검출기 ²²¹에 의해 수행되는 것처럼 때문에 앵커에 따라이 멀티 스케일 설계 우리는 단지 단일 크기의 이미지를 계산 컨벌루션 기능을 사용할 수있다. 다중 스케일 앵커의 디자인은 비늘을 해결하기위한 추가 비용없이 기능을 공유하기위한 핵심 구성 요소입니다.

Loss Function

ENG	KOR
For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-overUnion (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.	교육 RPNs을 위해, 우리는 각각의 앵커 (객체 여부가되는) 이진 클래스 레이블을 지정합니다. 우리는 앵커되어 2 가지 긍정적 인 레이블을 할당 (I) 앵커 / IOU를 가지고 접지 진실 상자 높은 교차 - overUnion (IOU) 오버랩 또는 (II)의 앵커와 앵커 이상 0.7 겹쳐 모든 지상 진실 상자. 단일 지상 진실 상자가 여러 앵커에 긍정적 인 라벨을 할당 할 수 있습니다. 일반적으로 두 번째 조건은 양의 샘플을 결정하기에 충분하다; 그러나 우리는 아직도 일부 드문 경우에 두 번째 조건이 더 긍정적 인 샘플을 찾을 수있는 이유에 대한 첫 번째 조건을 채택한다. 그 IOU 비율은 모든 지상 진실 박스보다 낮은 0.3의 경우 우리는 비 양성 앵커에 부정적인 레이블을 할당합니다. 긍정적이나 부정적인도 있습니다 앵커는 교육 목적에 기여하지 않는다.
With these definitions, we minimize an objective function following the multi-task loss in Fast R-CNN ²²². Our loss function for an image is defined as:	이러한 정의, 우리는 빠른 R-CNN ²²³의 다중 작업 손실 다음과 같은 목적 함수를 최소화 할 수 있습니다. 이미지에 대한 우리의 손실 함수는 다음과 같이 정의된다 :
Faster-RCNN_-_equation1.jpg
Here, i is the index of an anchor in a mini-batch and p_i is the predicted probability of anchor i being an object. The ground-truth label p_i^ is 1 if the anchor is positive, and is 0 if the anchor is negative. t_i is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t_i^ is that of the ground-truth box associated with a positive anchor. The classification loss Lcls is log loss over two classes (object vs. not object). For the regression loss, we use Lreg (t_i, t_i^) = R(t_i − t_i^) where R is the robust loss function (smooth L1) defined in ²²⁴. The term p_i^(Lreg) means the regression loss is activated only for positive anchors (p_i^ = 1) and is disabled otherwise (p_i^* = 0). The outputs of the cls and reg layers consist of {pi} and {ti} respectively.	여기에, 나는 미니 배치에서 앵커의 인덱스이며, p_i 내가 객체 인 앵커의 예측 가능성이다. 지상 진실 라벨 p_i ^ * 앵커가 양수이면 1이고, 앵커가 음수 인 경우 0이다. t_i는 예측 된 경계 상자의 4 매개 변수 좌표를 나타내는 벡터이며, t_i ^ 하는 것은 긍정적 인 앵커와 관련된 지상 진실 상자의입니다. 분류 손실 LCLS는 두 개의 클래스를 통해 로그 손실 (반대하지 대 개체)입니다. R ²²⁵에 정의 된 강력한 손실 함수 (부드러운 L1)는 - (t_i ^ * t_i) 회귀 손실, 우리는 Lreg (t_i, t_i ^ ) = R을 사용합니다. 용어 p_i ^ * (Lreg)가 회귀 손실 만 긍정적 인 앵커 활성화 수단 (p_i ^ * = 1), 그렇지 않으면 사용할 수 없습니다 (p_i이 ^ * = 0). CLS와 레지 층의 출력은 {파이}과 {TI} 각각 구성되어있다.
The two terms are normalized by Ncls and Nreg and weighted by a balancing parameter λ. In our current implementation (as in the released code), the cls term in Eqn.(1) is normalized by the mini-batch size (i.e., Ncls = 256) and the reg term is normalized by the number of anchor locations (i.e., Nreg ∼ 2, 400). By default we set λ = 10, and thus both cls and reg terms are roughly equally weighted. We show by experiments that the results are insensitive to the values of λ in a wide range (Table 9). We also note that the normalization as above is not required and could be simplified.	이 용어는 균형 파라미터 λ에 의해 Ncls 및 Nreg에 의해 정규화 및 가중된다. ,는 식의 CLS 용어. (1) 미니 배치 크기에 의해 정규화 현재 구현에서 (해제 코드에서와 같이) (즉, = 256 Ncls) 및 등록 기간은 (앵커 위치의 숫자로 정규화 즉, , Nreg ~ 2, 400). 기본적으로 우리는 λ 10 = 설정하고, 따라서 모두 CLS 및 등록 조건은 거의 동일 가중된다. 우리는 결과가 넓은 범위 (표 9)에서 λ의 값에 둔감 실험을 통해 보여준다. 또한 상기와 같이 정규화를 요구하지 않고 간소화 할 수 있습니다.
For bounding box regression, we adopt the parameterizations of the 4 coordinates following ²²⁶:	박스 회귀 경계를 들어, 우리는 다음과 같은 네 좌표의 파라미터 화를 채용 ²²⁷ :
Faster-RCNN_-_equation2.jpg
where x, y, w, and h denote the box’s center coordinates and its width and height. Variables x, xa, and x^* are for the predicted box, anchor box, and groundtruth box respectively (likewise for y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.	w X, Y, 그리고 시간이 나타내는 위치 상자의 중심 좌표와 너비와 높이. 변수 X, XA이며, x ^ *입니다 (시간, w 마찬가지로 Y에 대한) 예측 상자, 앵커 박스 및 groundtruth 상자를 각각합니다. 이것은 주변 접지 진실 상자 앵커 상자 바운딩 박스 회귀로 생각할 수있다.
Nevertheless, our method achieves bounding-box regression by a different manner from previous RoIbased (Region of Interest) methods ²²⁸, ²²⁹. In ²³⁰, ²³¹, bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.	그럼에도 불구하고, 우리의 방법은 다른 방법에 의해 경계 박스 회귀를 달성 이전 RoIbased (관심 지역)에서 방법 ²³², ²³³. ²³⁴, ²³⁵, 바운딩 박스 회귀 임의적 크기의 ROI에서 풀링 된 기능을 수행하고, 상기 가중치는 회귀 모든 영역 크기에 의해 공유된다. 우리의 제형에 회귀 사용 기능 피쳐 맵의 동일한 공간의 크기 (3 × 3)으로한다. 다양한 크기를 설명하기 위해, K 경계 박스 회귀 변수의 집합 알게된다. 각 회귀 한 스케일 한 종횡비 부담이며, K 개의 회귀 가중치를 공유하지 않는다. 이와 같이,이 기능은, 고정 사이즈 / 스케일 앵커 설계 덕분에 있더라도 다양한 크기의 박스를 예측하는 것이 여전히 가능하다.

Training RPNs

ENG	KOR
The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD) ²³⁶. We follow the “image-centric” sampling strategy from ²³⁷ to train this network. Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.	RPN은 역 전파 및 확률 그라데이션 하강 (SGD) ²³⁸에 의해 엔드 - 투 - 엔드 훈련을 할 수 있습니다. ²³⁹이 네트워크를 훈련에서 우리는 "이미지 중심의"표본 추출 방법을 따릅니다. 각 미니 배치는 많은 양 및 음의 예 앵커가 포함 된 하나의 이미지에서 발생한다. 모든 앵커 손실 함수를 최적화하는 것이 가능하지만, 마이너스 방향이 샘플 의지 바이어스들은 지배 같다. 대신, 우리는 랜덤하게 샘플링 포지티브 및 네거티브 앵커가 1의 비율이 미니 일괄 손실 함수를 계산하기위한 화상 (256)에 앵커 샘플 1. 부정적인 사람과 미만 128 긍정적 인 샘플 이미지가있는 경우, 우리 패드 미니 배치.
We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by pretraining a model for ImageNet classification ²⁴⁰, as is standard practice ²⁴¹. We tune all layers of the ZF net, and conv3 1 and up for the VGG net to conserve memory ²⁴². We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. We use a momentum of 0.9 and a weight decay of 0.0005 ²⁴³. Our implementation uses Caffe ²⁴⁴.	우리는 무작위로 표준 편차는 0.01으로 평균이 0 가우스 분포에서 가중치를 그림으로써 모든 새로운 레이어를 초기화합니다. 모든 다른 층 (즉, 공유 컨벌루션 층) 표준 관행이다 같이 ImageNet 분류 ²⁴⁵에 대한 모델을 pretraining으로 초기화된다 ²⁴⁶. 우리 조정 ZF 그물의 모든 레이어 및 conv3 1 위로 VGG 그물을위한 메모리를 절약하기 위해 ²⁴⁷. 우리는 PASCAL VOC 데이터 세트의 다음 20K 미니 배치에 대한 학습 60K 미니 일괄 0.001의 속도 및 0.0001을 사용합니다. 우리는 0.9의 모멘텀과 0.0005 ²⁴⁸의 중량 붕괴를 사용합니다. 우리의 구현은 CAFFE ²⁴⁹을 사용합니다.

ENG	KOR
Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN ²⁵⁰. Next we describe algorithms that learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2).	지금까지 우리는이 제안을 이용할 것이다 지역 기반 물체 검출 CNN을 고려하지 않고, 지역 제안 생성을위한 네트워크를 훈련하는 방법을 설명했다. 감지 네트워크를 위해, 우리는 빠른 R-CNN 채택 ²⁵¹. 다음에 우리는 공유 길쌈 층 (그림 2)와 RPN 및 고속 R-CNN로 구성된 통합 네트워크를 배울 알고리즘을 설명합니다.
Both RPN and Fast R-CNN, trained independently, will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks. We discuss three ways for training networks with features shared:	독립적으로 훈련을 모두 RPN 및 고속 R-CNN은 다른 방법으로 자신의 길쌈 레이어를 수정합니다. 따라서 우리는 두 네트워크 사이의 컨벌루션 층을 공유하기보다는 두 개의 네트워크를 학습 할 수있는 기술을 개발할 필요가있다. 우리는 공유 기능 교육 네트워크를위한 세 가지 방법에 대해 설명:
(i) Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN. The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.	(ⅰ) 교류 교육. 이 솔루션에서는 먼저 RPN을 훈련하고 빠른 R-CNN를 양성하는 제안을 사용합니다. 빠른 R-CNN으로 튜닝 네트워크는 RPN을 초기화하는 데 사용되며, 이러한 과정은 반복된다. 이는 본 논문의 모든 실험에 사용되는 솔루션입니다.
(ii) Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one network during training as in Figure 2. In each SGD iteration, the forward pass generates region proposals which are treated just like fixed, pre-computed proposals when training a Fast R-CNN detector. The backward propagation takes place as usual, where for the shared layers the backward propagated signals from both the RPN loss and the Fast R-CNN loss are combined. This solution is easy to implement. But this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses, so is approximate. In our experiments, we have empirically found this solver produces close results, yet reduces the training time by about 25-50% comparing with alternating training. This solver is included in our released Python code.	(ii) 대략 합동 훈련. 이 솔루션으로, RPN 및 빠른 R-CNN 네트워크는 각 SGD 반복에서 그림 2에서와 같이 훈련 기간 동안 하나의 네트워크로 통합되어, 앞으로 패스 고속 훈련 지역 단지 등이 고정 처리 제안, 미리 계산 된 제안을 생성 R-CNN 검출기. 역 전파는 공유 층의 RPN 손실 및 빠른 R-CNN 손실 모두에서 역방향 전파 신호가 결합되는 경우, 평소와 같이 일어난다. 이 해결책은 구현하기가 용이하다. 하지만이 솔루션은 파생 w.r.t.를 무시 또한 네트워크 응답이다 제안 상자 '좌표는, 그래서 대략이다. 우리의 실험에서, 우리는 경험적으로이 해결사 가까운 결과를 아직 교육 교류에 비해 약 25-50%하여 교육 시간을 단축 발견했다. 이 해석은 우리의 발표 파이썬 코드에 포함되어 있습니다.
(iii) Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are also functions of the input. The RoI pooling layer ²⁵² in Fast R-CNN accepts the convolutional features and also the predicted bounding boxes as input, so a theoretically valid backpropagation solver should also involve gradients w.r.t. the box coordinates. These gradients are ignored in the above approximate joint training. In a non-approximate joint training solution, we need an RoI pooling layer that is differentiable w.r.t. the box coordinates. This is a nontrivial problem and a solution can be given by an “RoI warping” layer as developed in ²⁵³, which is beyond the scope of this paper.	(iii) 비 대략적인 합동 훈련. 상술 한 바와 같이, RPN 예측 경계 박스는 입력의 함수이다. 빠른 R-CNN의 투자 수익 (ROI) 풀링 층 ²⁵⁴ 길쌈 기능과 또한 입력으로 예측 된 경계 상자, 그래서 이론적으로 유효한 역 전파 해결사는 그라디언트 w.r.t. 포함한다을 받아 상자를 조정합니다. 이 경사는 위의 대략적인 합동 훈련에 무시됩니다. 비 대략 합동 훈련 용액에서는 미분 w.r.t.가 ROI를 풀링 층을 필요 상자를 조정합니다. 이 사소 문제이며이 문서의 범위를 벗어 인 ²⁵⁵에서 개발로 솔루션은 "RoI에 왜곡"층에 의해 제공 될 수있다.
4-Step Alternating Training. In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. In the first step, we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share convolutional layers. In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers. Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. As such, both networks share the same convolutional layers and form a unified network. A similar alternating training can be run for more iterations, but we have observed negligible improvements.	4 단계 교류 교육. 본 논문에서는 최적화 교류를 통해 공유 기능을 배울 수있는 실용적인 4 단계 훈련 알고리즘을 채택한다. 3.1.3 절에서 설명한 바와 같이, 제 1 단계에서는 RPN 훈련. 이 네트워크는 ImageNet 미리 훈련 된 모델로 초기화하고, 엔드 - 투 - 엔드 지역의 제안 작업을 미세 조정. 두 번째 단계에서는 단계 1에서 생성 RPN 제안하여 빠른 R-CNN에 의해 분리 검출 된 네트워크를 훈련. 이 감지 네트워크는 또한 ImageNet 사전 훈련 모델에 의해 초기화됩니다. 이 시점에서 두 네트워크는 길쌈 층을 공유하지 않는다. 세 번째 단계에서, 우리는 RPN 교육을 초기화 검출기 네트워크를 사용하지만, 우리는 공유 길쌈 층에만 미세 조정 RPN에 고유 한 레이어를 수정합니다. 이제 두 네트워크는 길쌈 레이어를 공유 할 수 있습니다. 마지막으로, 공유 길쌈 층은 우리가 미세 조정 빠른 R-CNN의 고유 층을 고정시키고. 이와 같이, 두 네트워크는 동일한 컨벌루션 층을 공유하고 통합 네트워크를 형성한다. 유사한 교류 훈련은 더 반복에 출마 할 수 있지만, 우리는 무시할 개선을 관찰했다.

Faster-RCNN_-_table1.jpg

Implementation Details

ENG	KOR
We train and test both region proposal and object detection networks on images of a single scale ²⁵⁶, ²⁵⁷. We re-scale the images such that their shorter side is s = 600 pixels ²⁵⁸. Multi-scale feature extraction (using an image pyramid) may improve accuracy but does not exhibit a good speed-accuracy trade-off ²⁵⁹. On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.	우리는 단일 크기의 이미지를 감지 네트워크를 훈련 두 영역 안을 테스트 오브젝트 ²⁶⁰, ²⁶¹. 우리는 다시 스케일 이미지들은 짧은 변을들되도록 픽셀 = 600 ²⁶². 정확도를 개선 할 수있다 (화상 피라미드를 사용하여)이지만 양호한 속도 정확도 트레이드 - 오프가 발생하지 않는 다중 스케일 피쳐 추출 ²⁶³. 재 스케일링 된 이미지에서 마지막 컨벌루션 층에 모두 ZF 및 VGG 그물의 총 스트라이드는 16 픽셀이고, 따라서 (~500 × 375)을 리사이징하기 전에 전형적인 PASCAL 이미지 ~ 10 픽셀이다. 정확도가 더 작은 보폭으로 개선 될 수 있지만 심지어 큰 보폭, 좋은 결과를 제공합니다.
For anchors, we use 3 scales with box areas of 128^2, 256^2 , and 512^2 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation experiments on their effects in the next section. As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales, saving considerable running time. Figure 3 (right) shows the capability of our method for a wide range of scales and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZF net. We note that our algorithm allows predictions that are larger than the underlying receptive field. Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.	1, 1 : 2, 2 : 1의 앵커를 위해, 우리는 2 ^ (128)의 박스 부분 3 배율 256 ^ 2 ^ 2, 512 픽셀 및 하나의 3 종횡비를 사용한다. 이러한 하이퍼 매개 변수를주의 깊게 특정 데이터 세트를 위해 선택되지 않고, 우리는 다음 섹션에서 그 효과에 절제 실험을 제공합니다. 논의 된 바와 같이, 우리의 솔루션은 상당한 실행 시간을 절약 여러 규모의 지역을 예측하는 이미지 피라미드 또는 필터 피라미드를 필요로하지 않습니다. 도 3은 (오른쪽) 비늘 종횡비 광범위한 우리의 방법의 성능을 나타낸다. 표 1은 ZF 망을 이용하여 각각의 고정 용 학습 제안 평균 크기를 나타낸다. 우리는 우리의 알고리즘은 기본 수용 필드보다 큰 예측을 할 수 있습니다. 이러한 예측은 개체의 중간 표시되면 여전히 대략 객체의 범위를 추정 할 수있다 불가능하지는 온이다.
The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss. For a typical 1000 × 600 image, there will be roughly 20000 (≈ 60 × 40 × 9) anchors in total. With the cross-boundary anchors ignored, there are about 6000 anchors per image for training. If the boundary-crossing outliers are not ignored in training, they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully convolutional RPN to the entire image. This may generate crossboundary proposal boxes, which we clip to the image boundary.	이미지의 경계를 교차 앵커 박스는주의하여 취급해야합니다. 그들은 손실에 기여하지 않도록 훈련하는 동안, 우리는 상호 경계 앵커를 무시합니다. 전형적인 1000 × 600 이미지의 경우, 총 약 20000 (≈ 60 × 40 × 9) 앵커가있을 것이다. 무시 간 경계 앵커로, 교육에 대한 이미지 당 약 6000 앵커가 있습니다. 경계 교차 특이점이 교육에서 무시하지 않는 경우, 그들은 목적에 오류 조건을 정정하기 어려운, 큰 소개, 교육은 수렴하지 않습니다. 테스트하는 동안, 그러나, 우리는 여전히 전체 이미지에 완전히 길쌈 RPN을 적용합니다. 이것은 우리가 이미지의 경계에 클립 crossboundary 제안 상자를 생성 할 수있다.
Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals at test-time.	일부 RPN 제안은 매우 서로 중첩된다. 중복을 줄이기 위해, 우리는 그들의 CLS 점수에 따라 제안서 지역에 비 최대 억제 (NMS)을 채택한다. 우리는 이미지 당 2,000 제안 영역에 대한 정보를 나뭇잎 0.7에서 NMS의 IOU 임계 값을 수정합니다. 우리가 보여주는 바와 같이, NMS는 궁극적 검출 정밀도를 손상하지 않고, 실질적으로 제안의 수를 감소시킨다. NMS 후, 우리는 검색을 위해 상단-N 위의 제안 영역을 사용합니다. 이하에서는, 테스트시에 제안 다른 수의 빠른 R-CNN 2000 RPN 제안서를 이용하여 훈련하지만 평가한다.

EXPERIMENTS

Faster-RCNN_-_table2.jpg	Faster-RCNN_-_table3.jpg	Faster-RCNN_-_table4.jpg	Faster-RCNN_-_table5.jpg	Faster-RCNN_-_table6.jpg	Faster-RCNN_-_table7.jpg	Faster-RCNN_-_table8.jpg	Faster-RCNN_-_table9.jpg	Faster-RCNN_-_figure4.jpg	Faster-RCNN_-_table10.jpg	Faster-RCNN_-_table11.jpg	Faster-RCNN_-_table12.jpg

Experiments on PASCAL VOC

Experiments on MS COCO

From MS COCO to PASCAL VOC

CONCLUSION

ENG

KOR

We have presented RPNs for efficient and accurate region proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.

우리는 효율적이고 정확한 지역 제안 생성을위한 LPNs에 발표했다. 하류 검출 네트워크 컨벌루션 기능을 공유함으로써, 영역 제안서 단계 거의 비용 부담이다. 우리의 방법은 실시간에 가까운 프레임 속도로 실행하는 통합 깊은 학습 기반 물체 검출 시스템을 가능하게한다. 학습 RPN은 영역 제안 품질 때문에 전체 객체 검출의 정확도를 향상시킨다.

Documentation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (v3): https://github.com/rbgirshick/py-faster-rcnn (Python Caffe version); https://github.com/ShaoqingRen/faster_rcnn (Matlab version); 1506.01497v3.pdf

References

K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision (ECCV), 2014. ↩
R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015. ↩
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015. ↩
J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision (IJCV), 2013. ↩
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. ↩
C. L. Zitnick and P. Dollar, “Edge boxes: Locating object ´ proposals from edges,” in European Conference on Computer Vision (ECCV), 2014. ↩
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ↩
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2010. ↩
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations (ICLR), 2014. ↩
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards 14 real-time object detection with region proposal networks,” in Neural Information Processing Systems (NIPS), 2015. ↩
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” 2007. ↩
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft COCO: Com- ´ mon Objects in Context,” in European Conference on Computer Vision (ECCV), 2014. ↩
S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” arXiv:1511.02300, 2015. ↩
J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based model for object detection and semantic part localization,” arXiv:1511.07131, 2015. ↩
J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015. ↩
J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” arXiv:1511.07571, 2015. ↩
D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human curation and convnets: Powering item-to-item recommendations on pinterest,” arXiv:1511.04003, 2015. ↩
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015. ↩
J. Hosang, R. Benenson, and B. Schiele, “How good are detection proposals, really?” in British Machine Vision Conference (BMVC), 2014. ↩
J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes ´ for effective detection proposals?” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015. ↩
N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, “Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv: 1505.05836, 2015. ↩
J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmentation using constrained parametric min-cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. ↩
P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, ´ “Multiscale combinatorial grouping,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. ↩
B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. ↩
C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Neural Information Processing Systems (NIPS), 2013. ↩
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. ↩
C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” arXiv:1412.1441 (v1), 2015. ↩
P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment object candidates,” in Neural Information Processing Systems (NIPS), 2015. ↩
J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint object and stuff segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ↩
S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” arXiv:1504.06066, 2015. ↩
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Neural Information Processing Systems (NIPS), 2015. ↩
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014. ↩
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010. ↩
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ↩
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989. ↩
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” in International Journal of Computer Vision (IJCV), 2015. ↩
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” in Neural Information Processing Systems (NIPS), 2012. ↩
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv:1408.5093, 2014. ↩
K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine Vision Conference (BMVC), 2015. ↩