Fast R-CNN:Paper

Fast R-CNN Ross Girshick Microsoft Research [email protected]

Pre-defined References

¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶ ¹⁷ ¹⁸ ¹⁹ ²⁰ ²¹ ²² ²³ ²⁴ ²⁵

Abstract

ENG

KOR

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn

본 논문에서는 객체 검출을위한 빠른 지역 기반 길쌈 네트워크 방법 (고속 R-CNN)을 제안한다. 빠른 R-CNN 효율적으로 깊은 콘볼 루션 네트워크를 사용하여 객체의 제안을 분류하기 위해 사전 작업에 구축합니다. 이전의 연구에 비해 빠른 R-CNN은 검출 정확성을 증가시키면서 훈련 및 시험 속도를 향상시키는 여러 가지 혁신을 이용한다. 빠른 R-CNN은 9 × R-CNN보다 더 빨리가, 시험 시간에 213 × 빠른 매우 깊은 VGG16 네트워크를 훈련하고, SPPnet에 비해 PASCAL VOC 2012 년에 더 높은지도를 달성, 빠른 R-CNN은 빠른 VGG16 3 × 열차 10 × 빠른 테스트, 그리고 더 정확하다. 빠른 R-CNN은 ++ 파이썬과 C로 구현 (CAFFE 사용) https://github.com/rbgirshick/fast-rcnn 에서 오픈 소스 MIT 라이센스에 따라 사용할 수 있습니다.

Introduction

ENG	KOR
Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.	최근 깊은 ConvNets [14, 16]을 크게 개선 한 화상 분류 [14]와 물체 검출 [9, 19]의 정확도. 화상 분류에 비해 물체 검출이 해결보다 복잡한 방법을 필요로하는 더 도전 과제이다. 이에 복잡성으로 전류 방식 (예를 들어, [9, 11, 19, 25]) 느리고 우아하다 다단 파이프 기차 모델.
Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.	검출 물체의 정확한 위치 파악이 필요하기 때문에 복잡성은 두 가지 주요 과제를 만들어 발생한다. 첫째, 다수의 후보 대상물 위치는 (종종 소위 "제안") 처리한다. 둘째,이 후보는 정확한 현지화를 달성하기 위해 정제해야 만 거친 현지화를 제공합니다. 이러한 문제에 대한 해결책은 종종 속도, 정확성, 단순성을 손상.
In this paper, we streamline the training process for stateof-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.	본 논문에서는 stateof 최신의 ConvNet 기반 개체 탐지기 [9, 11]에 대한 교육 과정을 간소화. 우리는 공동 목적 제안서를 분류하고 자신의 공간 위치를 수정 배운다 단일 단계 훈련 알고리즘을 제안한다.
The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).1	그 결과 방법은 SPPnet [11]보다 [9] 3 × 빠른 R-CNN보다 9 × 빠른 매우 깊은 감지 네트워크 (VGG16 [20]) 훈련을 할 수 있습니다. 런타임시, 검출 네트워크는 PASCAL VOC에 최고 정확도를 달성 2012 [7] 66 % (대 R-CNN 62 %)의지도 .1 동안 (객체 제안서 시간 제외) 0.3S의 이미지를 처리

R-CNN and SPPnet

ENG	KOR
The Region-based Convolutional Network method (RCNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:	지역 기반 길쌈 네트워크 법 (RCNN)는 [9] 오브젝트 제안 분류 깊은 ConvNet를 사용하여 우수한 객체 검출의 정확도를 달성한다. R-CNN은 그러나 주목할만한 단점이 있습니다 :
1. Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classi- fier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.	1. 트레이닝 다단 파이프 라인이다. R-CNN은 첫 번째 로그 손실을 사용하여 객체의 제안에 ConvNet을 finetunes. 그리고, ConvNet 기능에 SVM을 맞습니다. 이 SVM을은 softmax를 classi-에 Fier 미세 조정에 의해 학습을 교체, 개체 감지기 역할을합니다. 세 번째 교육 단계에서 경계 박스 회귀 배운된다.
2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.	2. 교육 시간과 공간에 고가이다. SVM과 경계 박스 회귀 훈련, 기능은 각각의 이미지에있는 각 개체의 제안에서 추출되고 디스크에 기록. 이러한 VGG16 매우 깊은 네트워크로,이 프로세스는 VOC07의 trainval 세트의 5K 이미지 2.5 GPU-일이 소요됩니다. 이러한 기능은 스토리지의 기가 바이트의 수백을 필요로한다.
3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).	3. 개체 검출 속도가 느립니다. 시험 시간에, 기능은 각 시험 이미지의 각 개체의 제안에서 추출된다. VGG16와 검출 (GPU에) 47s에 / 이미지를합니다.
R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 × 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.	이 계산을 공유하지 않고, 각 개체의 제안에 대한 ConvNet 전진 패스를 수행하기 때문에 R-CNN은 느립니다. 공간 피라미드 풀링 네트워크 (SPPnets) [11]을 나눔으로써 계산 R-CNN을 단축하는 것이 제안되었다. SPPnet 방법은 전체 입력 영상에 대한 컨벌루션 피쳐 맵을 계산 한 후, 상기 공유 기능은지도로부터 추출 된 특징 벡터를 이용하여 각 개체에 제안을 분류한다. 특징은 고정 된 크기의 출력으로 제안 내부 기능지도 부 maxpooling하여 제안서를 추출한다 (예를 들어, 6 × 6). 다중 출력 크기는 풀링하고 공간 피라미드 풀링 [15]에서와 같이 연결됩니다. SPPnet은 시험 시간에 10 ~ 100 × 의해 R-CNN을 가속화합니다. 트레이닝 시간도 빠르게 의한 제안서 특징 추출 × 3만큼 감소된다.
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.	SPPnet 또한 주목할만한 단점이 있습니다. R-CNN 마찬가지로 트레이닝 특징 추출 포함 다단 파이프 라인이며, 로그 감소, SVM을 트레이닝, 그리고 마지막으로 피팅 바운딩 박스 회귀와 네트워크를 미세 조정. 기능은 디스크에 기록됩니다. 그러나 R-CNN과 달리 미세 조정 알고리즘은 공간 피라미드 풀링 앞에 길쌈 층을 업데이트 할 수 없습니다 [11]에 제안했다. 당연히이 제한 (고정 컨벌루션 층) 매우 깊은 네트워크의 정확성을 제한한다.

Contributions

ENG	KOR
We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:	자신의 속도와 정확성을 향상시키는 동시에 우리는 R-CNN과 SPPnet의 단점을 해결 새로운 훈련 알고리즘을 제안한다. 이 훈련하고 테스트하기 위해 비교적 빠른 있기 때문에 우리는이 방법을 빠른 R-CNN에 문의하십시오. 빠른 RCNN 방법은 몇 가지 장점이 있습니다 :
1. Higher detection quality (mAP) than R-CNN, SPPnet	1. 높은 검출 품질 (지도) R-CNN보다 SPPnet
2. Training is single-stage, using a multi-task loss	2. 트레이닝 멀티 태스크 손실을 이용하여, 단일 단계 인
3. Training can update all network layers	3. 교육 모든 네트워크 계층을 업데이트 할
4. No disk storage is required for feature caching	4. 어떠한 디스크 스토리지는 기능 캐싱 필요하지 않습니다
Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn	빠른 R-CNN은 파이썬과 C ++ (CAFFE [13])로 작성 https://github.com/rbgirshick/fast-rcnn 에서 오픈 소스 MIT 라이센스에 따라 사용할 수 있습니다.

Fast R-CNN architecture and training

ENG

KOR

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

무화과. 도 1은 고속 R-CNN 아키텍처를 도시한다. 빠른 R-CNN 네트워크는 입력으로 전체 이미지와 객체 제안의 세트를합니다. 네트워크 우선 전환 기능 맵을 생성하기 위해 컨벌루션 (전환) 및 최대 풀링 여러 층으로 전체 이미지를 처리한다. 그리고, 각 오브젝트 제안서이자 (ROI) 풀링 층의 영역이 기능 맵으로부터 고정 길이 특징 벡터를 추출한다. K 오브젝트 클래스를 통해 softmax를 확률 추정치를 생성 한 플러스 포괄 "배경"클래스와 실제 사를 출력 다른 층 : 각 특징 벡터는 마침내 두 형제 출력 층으로 분기 완전히 연결 (FC) 층의 순서로 공급된다 K 객체 클래스마다 -valued 번호. 4 개의 값들의 각 세트는 K 클래스 중 하나 정제 바운딩 박스의 위치를 부호화한다.

The RoI pooling layer

ENG	KOR
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w)	투자 수익 (ROI) 풀링 층은 W × H의 고정 된 공간 범위와 작은 기능지도에 관심이 유효한 영역 내부의 기능을 변환 할 최대 풀링을 사용 (예를 들어, 7 × 7) H와 W는 계층 하이퍼 매개 변수입니다, 특정 RoI에 독립적. 본 논문에서는 투자 수익 전환 기능지도에 직사각형의 창입니다. 각 RoI에이 상단 왼쪽 모서리를 지정하는 네 튜플 (w R, C, H,)에 의해 정의된다 (R, C)과 높이 (w 시간) 폭
RoI max pooling works by dividing the h × w RoI window into an H × W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].	RoI에 최대 / w × 기준 사이즈 H / h를 서브 윈도우 W 그리드 × H로 RoI에 창 W × H 나누어 작품 풀링 다음 W 및 대응하는 출력 그리드 셀에 각각의 서브 - 윈도우의 값을 최대 풀링 . 풀링 표준 최대 풀링에서와 같이 각 기능 맵 채널에 독립적으로 적용된다. 관심 영역 층은 단순히 하나의 피라미드 레벨이 인 SPPnets [11]에서 사용될 공간적 피라미드 풀링 층의 특수한 경우이다. 우리는 [11]에서 주어진 풀링 하위 창 계산을 사용합니다.

Initializing from pre-trained networks

ENG	KOR
We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.	우리는 다섯 최대 풀링 층으로, 세 가지 사전 훈련 ImageNet [4] 네트워크와 각 실험과 다섯 열세 전환 층 사이 (네트워크 자세한 내용은 4.1 절 참조). 사전 훈련 네트워크 고속 R-CNN 네트워크를 초기화 할 때, 세 개의 변형을 거친다.
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).	먼저 최근 최대 풀링 층 (VGG16위한 예컨대, H = W = 7) 순 최초로 완전히 연결 층과 호환되도록 H와 W를 설정하여 구성된다 RoI에 풀링 층에 의해 대체된다.
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classifi- cation) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).	둘째, K + 1 카테고리와 카테고리 별 bounding-을 통해 이전 (완전히 연결 층과 softmax를 설명 두 형제 층으로 대체됩니다 (구분 단자 1000 방법 ImageNet 위해 훈련 된) 네트워크의 마지막 완전히 연결 층과 softmax를 상자 회귀).
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.	이미지 목록 및 그 이미지에서의 ROI리스트 : 셋째, 네트워크는 두 개의 데이터 입력을 사용하도록 수정된다.

Fine-tuning for detection

ENG	KOR
Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.	역 전파하여 모든 네트워크 가중치를 훈련하는 것은 빠른 R-CNN의 중요한 기능입니다. 첫째, SPPnet이 공간 피라미드 풀링 층 아래에 가중치를 업데이트 할 수없는 이유의이 명료하게 할 수 있습니다.
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).	근본 원인은 각 훈련 샘플 (즉, 투자 수익) R-CNN과 SPPnet 네트워크는 훈련 정확히 어떻게 다른 이미지에서 올 때 SPP 계층을 통해 그 역 전파가 매우 비효율적이다. 비 효율성은 각 투자 수익 종종 전체 입력 영상에 걸쳐 매우 큰 수용 필드를 가질 수 사실에서 유래한다. 전진 패스는 전체 수용 필드를 처리해야하기 때문에, 훈련 입력은 큰 (보통 전체 이미지)입니다.
We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).	우리는 훈련 동안 공유 기능을 활용하여보다 효율적인 교육 방법을 제안한다. 빠른 RCNN 훈련에서 확률 그라데이션 하강 (SGD) minibatches 먼저 N 이미지를 샘플링하여 다음 각 이미지에서 R / N의 ROI를 샘플링하여, 계층 적으로 샘플링된다. 비판적, 전후 패스에 동일한 이미지를 공유 계산과 메모리에서 ROI를. N은 작은 만들기 미니 일괄 계산을 감소시킨다. 예를 들어, N = 2를 사용하는 경우와는 R = 128, 제안 된 훈련 방식 다른 이미지 (128)에서 하나의 ROI를 샘플링하는 것보다 약 64 × 속도 (즉, R-CNN 및 SPPnet 전략).
One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.	이 전략을 통해 하나의 관심사는 동일한 이미지에서 로아는 상관 관계가 있기 때문에 느린 교육 수렴을 일으킬 수있다. 이러한 문제는 실제 문제가 나타나지 않는 우리는 N = 2 및 R = 128 R-CNN보다 적은 SGD 반복을 이용하여 좋은 결과를 달성한다.
In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.	계층 적 샘플링뿐만 아니라, 빠른 R-CNN은 하나의 미세 조정이 공동 아니라, SVM을을 softmax를 분류 훈련보다, softmax를 분류하고 경계 상자 회귀 변수를 최적화 단계, 3 별도의 단계로 회귀 [9 간소화 된 교육 과정을 사용하여 11]. 이 절차 (ROI 풀링 층을 통해 손실, 미니 일괄 표본 추출 방법, 역 전파 및 SGD 하이퍼 파라미터)의 구성 요소를 설명한다.
Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, t^k = t_x^k, t_y^k, t_w^k, t_h^k), for each of the K object classes, indexed by k. We use the parameterization for t^k given in [9], in which t^k specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.	다중 작업 손실. 빠른 R-CNN 네트워크는 두 형제 출력 층을 보유하고 있습니다. 첫 번째는 (ROI 당) 이산 확률 분포, P = 출력 (P0를... 약동학), K + 1 종류 이상. 평소와 같이, 페이지는 완전히 연결 층의 K + 1의 출력을 통해 softmax를 계산한다. 두 번째 형제 층의 출력 경계 박스 회귀 오프셋, t ^ K = t_x ^ K, t_y ^ K, t_w ^ K, t_h ^ K), K에 의해 인덱스 K 객체 클래스 각각에 대한. 우리는 [9], 여기서 t ^ k는 규모 불변 번역을 지정하고 객체의 제안에 로그 공간의 높이 / 폭 시프트 상대에 주어진 t ^ k에 대한 파라미터를 사용합니다.
Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:	. 각 교육 투자 수익 (ROI)이 지상 진실 클래스 u 및 지상 진실 경계 박스 회귀 대상 v를 표시되어 우리는 공동으로 분류 및 경계 상자 회귀 훈련을하는 레이블이있는 각 RoI에에 다중 작업 손실 L을 사용합니다
(1)
in which Lcls(p, u) = − log pu is log loss for true class u. The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple t^u = t_x^u, t_y^u, t_w^u, t_h^u), again for class u. The Iverson bracket indicator function [u ≥ 1] evaluates to 1 when u ≥ 1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. For bounding-box regression, we use the loss	하는 LCLS (P, U) = - 로그 우레탄 사실 클래스 U에 대한 로그 손실이다. 두 번째 작업 손실, Lloc는, 클래스 U, V = (VX, VY, 폭스 바겐, VH)에 대한 진정한 경계 박스 회귀 대상의 튜플에 대해 정의하고, 예측 된 튜플 t ^ U = t_x ^ 유, t_y ^ U를 , t_w ^ U, t_h ^ U)를 다시위한 클래스 유. 아이버슨 브래킷 표시 기능은 [U ≥ 1] 때 유 ≥ 1, 0, 그렇지 않으면 1로 평가합니다. 관례 적으로 포괄 배경 클래스는 u는이 지상 진실 경계 상자의 아무 개념이없고 따라서 Lloc이 무시됩니다 배경 ROI를 들어 0 = 레이블이 붙어 있습니다. 바운딩 박스 회귀를 위해, 우리는 손실 사용
(2)
in which	하는
(3)
is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.	R-CNN 및 SPPnet에 사용 L2 손실보다 이상치에 덜 민감 L1 강력한 감소이다. 회귀 타겟 바운드되면 L2 손실 훈련 폭발 구배를 방지하기 위해 학습 비율의주의 깊은 조정이 필요할 수있다. 식. 3이 감도을 제거합니다.
The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use λ = 1.	식의 하이퍼 매개 변수 λ. 1 개의 태스크 손실의 균형을 제어한다. 우리는 제로 평균 및 단위 분산을 가지고 지상 진실 회귀 대상 VI를 정상화. 모든 실험은 λ 1 = 사용합니다.
We note that [6] uses a related loss to train a classagnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).	우리는 [6] classagnostic 오브젝트 제안 네트워크를 훈련 관련 손실을 사용 있습니다. 우리의 접근 방식과는 달리, [6] 현지화 및 분류를 분리하는 두 개의 네트워크 시스템에 대한 옹호. OverFeat [19], R-CNN은 [9] 및 SPPnet [11]도하지만이 방법은 우리가 빠른 R-CNN (5.1 절)에 대한 최적입니다 보여 단계 현명한 교육, 사용, 분류 및 경계 박스 지역화를 양성.
Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.	미니 배치 샘플링. 미세 조정시, 각 SGD 미니 배치는 (일반적인 관행이기 때문에, 우리가 실제로 데이터 세트의 순열을 반복) 임의로 선택 N = 2 이미지로 구성되어있다. 우리는 각 이미지에서 64 ROI를 샘플링, 크기 R = 128의 미니 배치를 사용합니다. [9]에서와 같이, 우리는 적어도 0.5의 groundtruth 경계 상자 (IOU) 중복 조합을 통해 교차로를 대상 제안에서 로아의 25 %를 가져 가라. 이러한 로아 즉 유 ≥ 1. 나머지 로아가 간격 [0.1 지상 진실, 0.5으로 최대 IOU이 객체의 제안에서 샘플링이) [11] 다음 전경 객체 클래스로 표지 예제를 포함한다. 이러한 배경 예 0.1 나타납니다 0.1 낮은 임계 값은 하드 예를 마이닝하기위한 휴리스틱 역할을 = U로 표시되어 [8]. 훈련하는 동안 이미지는 가로 확률 0.5로 반전된다. 다른 데이터 증가는 사용되지 않습니다.
Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.	RoI에 풀링 레이어를 통해 다시 전파. 역 전파 경로 투자 수익 (ROI) 풀링 계층을 통해 파생 상품. N> 1로 확장 전진 패스 취급하기 때문에 모든 이미지 독립적으로 간단하지만 명확하게하기 위해, 우리는 미니 배치 (N = 1) 당 하나의 이미지를 가정합니다.
Let xi ∈ R be the i-th activation input into the RoI pooling layer and let yrj be the layer’s j-th output from the rth RoI. The RoI pooling layer computes y_rj = x_i∗(r,j), in which i∗(r, j) = argmax_i0∈R(r,j) xi0 . R(r, j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different outputs yrj .	XI ∈ R은 투자 수익 (ROI) 풀링 층에 i 번째 활성화 입력하자 및 yrj가 r 번째 투자 수익 (ROI)에서 레이어의 j 번째 출력 할 수 있습니다. 투자 수익 (ROI) 풀링 층이있는 내가 (R, J) = argmax_i0∈R (R, J) xi0을 * = x_i로부터 * (R, J)를 y_rj 계산한다. R (R, j)는 서브 창을 통해 출력 부 yrj 최대 풀에 입력 세트 인덱스입니다. 단일 XI은 여러 출력 yrj에 할당 될 수있다.
The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable xi by following the argmax switches:	ROI를 풀링 레이어의 하위 기능 argmax 스위치 따라 각 입력 변수 XI에 대하여 손실 함수의 편미분을 계산한다 :
(4)
In words, for each mini-batch RoI r and for each pooling output unit yrj , the partial derivative ∂L/∂yrj is accumulated if i is the argmax selected for yrj by max pooling. In back-propagation, the partial derivatives ∂L/∂yrj are already computed by the backwards function of the layer on top of the RoI pooling layer.	즉, 각 미니 배치 투자 수익 (ROI) r에 대한 각 풀링 출력 부 yrj를 들어, 편미분 ∂L / ∂yrj 내가 최대 풀링에 의해 yrj 선택 argmax 경우 축적된다. 역 전파에서 편미분 ∂L / ∂yrj 이미 투자 수익 (ROI) 풀링 층의 상단에 레이어의 하위 기능에 의해 계산된다.
SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.	SGD 하이퍼 매개 변수를 설정합니다. softmax를 분류 및 경계 상자 회귀 분석에 사용 된 완전히 연결 층은 표준 편차는 각각 0.01와 0.001로 평균이 0 인 가우시안 분포로 초기화된다. 편향 모든 층마다 학습 층 웨이트 (1)의 속도, 2 바이어스 및 0.001 글로벌 학습 속도를 사용하여 0으로 초기화된다. VOC07 또는 VOC12 trainval에 훈련 때 우리는 30K 미니 배치 반복에 대한 SGD를 실행하고 다른 10,000 반복의 학습 0.0001에 속도와 열차를 내립니다. 우리가 큰 데이터 세트에 훈련 할 때 후술하는 바와 같이, 우리는 더 많은 반복을 위해 SGD를 실행합니다. (무게와 편견에) 0.0005 0.9과 매개 변수에 대한 부패의 모멘텀이 사용된다.

Scale invariance

ENG	KOR
We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.	우리는 규모 불변 물체 검출을 달성하는 두 가지 방법을 탐구 : (1) "무력"학습을 통해 (2) 이미지 피라미드를 사용하여. 이러한 전략은 [11]의 두 가지 방법을 따르십시오. 무차별 접근 방식에서, 각각의 이미지는 트레이닝 및 테스트 동안 모두 미리 정의 된 화소 크기로 처리된다. 네트워크는 직접 훈련 데이터에서 규모 불변 물체 검출을 배워야한다.
The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.	멀티 - 스케일 방법은, 대조적으로, 화상 피라미드를 통해 네트워크에 근사 스케일 불변을 제공한다. 시험 때, 이미지 피라미드는 약에 사용되는 각 개체의 제안을 확장 정상화. 멀티 - 스케일 트레이닝 동안, 우리는 랜덤 데이터 증가의 형태로, 피라미드 스케일 이미지를 샘플링 할 때마다, 다음 [11] 샘플. 우리는 GPU 메모리 제한으로 인해 단지 소규모 네트워크, 멀티 규모의 교육 실험.

Fast R-CNN detection

ENG	KOR
Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈ 45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].	고속 R-CNN 네트워크가되면 앞으로 패스를 실행하는 것보다 조금 더에, 검출 양의 미세 조정 (가정 오브젝트 제안 미리 계산이다). 네트워크는 입력으로 이미지 (또는 이미지 피라미드, 이미지의 목록으로 인코딩)과 득점 R 오브젝트 제안의 목록을합니다. 우리가 큰하는 경우 (≈ 45K)을 고려할 것입니다하지만 테스트시, R은 일반적으로 약 2000입니다. 화상 피라미드를 사용하는 경우, 각각은 스케일링 RoI에 투자 수익 영역 [11]에서 2242에 가장 가까운 화소가되도록 스케일에 할당된다.
For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k \| r) ∆= pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].	각 시험 투자 수익 R 들어, 포워드 패스 클래스 사후 확률 분포 P와 R (K 클래스 각각 자체 정제 바운딩 박스 예측을 얻는다)에 대해 예측 된 바운딩 박스 오프셋들의 세트를 출력한다. (\| R 클래스 = k)는 Δ = PK 우리는 추정 된 확률 잠을 사용하여 각 객체 클래스 k에 대한 연구에 검출 자신감을 할당합니다. 우리는 그 다음 R-CNN에서 알고리즘과 설정을 사용하여 각 클래스에 대해 독립적으로 비 최대 억제를 수행 [9].

Truncated SVD for faster detection

ENG	KOR
For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].	전체 화상 분류를위한 시간은 완전히 연결 층을 계산하는 전환 층에 비해 작다 보냈다. 반대로, 검출 (도. 2 참조) 공정 로아의 수가 크고, 전방 통과 시간의 거의 절반은 완전히 연결 층을 계산하는 소요된다. 대형 완전히 연결 층은 쉽게 절단 SVD [5, 23]로 압축에 의해 가속된다.
In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as	이 기술에서, V 가중치 행렬 W × U 의해 파라미터 층은 약으로 인수 분해되고
(5)
using SVD. In this factorization, U is a u × t matrix comprising the first t left-singular vectors of W, Σt is a t × t diagonal matrix containing the top t singular values of W, and V is v × t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix Σ_t*V^T (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.	SVD를 사용하여. 이 인수에서, U는 W의 제 t의 좌 특이 벡터를 포함하는 AU × T는 행렬이고, Σt는 W의 위쪽 t 특이 값을 포함 × t 대각 행렬로하고, V는 제 t의 오른쪽을 포함 t 행렬 × V는 W. 잘린다 SVD의 특이 벡터 t하는 자외선에서 파라미터 수를 감소 (U + V), t는 분보다 작은 경우 중요 할 수있다 (U, V). 네트워크를 압축하기 위해, W에 대응하는 하나의 완전한 접속 층은 그들 사이에 비선형없이 두 완전히 연결 층에 의해 대체된다. 이들 층의 제는 가중치 행렬 Σ_t V * ^ T (없고 바이어스)를 사용하고, 두 번째는 (W와 연관된 원래 바이어스로) U를 사용한다. 로아의 수가 많은 경우 간단한 압축 방법이 좋은 속도 향상 효과를 볼 수있다.

Main results

ENG	KOR
Three main results support this paper’s contributions:	세 가지 주요 결과는 본 논문의 공헌을 지원합니다
1. State-of-the-art mAP on VOC07, 2010, and 2012	VOC07, 2010 년과 2012 년에 1 최첨단지도
2. Fast training and testing compared to R-CNN, SPPnet	2. 빠른 교육 및 R-CNN 비교 테스트, SPPnet
3. Fine-tuning conv layers in VGG16 improves mAP	VGG16 3. 미세 조정의 전환 층은 맵을 향상

Experimental setup

ENG

KOR

Our experiments use three pre-trained ImageNet models that are available online.2 The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG CNN M 1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use single-scale training and testing (s = 600; see Section 5.2 for details).

첫 번째는 R-CNN [9]에서 CaffeNet (기본적으로 AlexNet [14])입니다 online.2 우리의 실험은 사용할 수있는 세 가지 사전 훈련 ImageNet 모델을 사용합니다. 우리는 대안위한 모델 S 등이 CaffeNet 참조 "소."제 2 네트워크가 VGG로부터 CNN M 1024 [3], S와 동일한 깊이를 가지지 만 넓게된다. 우리는이 네트워크 모델 M, 전화 "매체를."마지막 네트워크는 [20]에서 매우 깊은 VGG16 모델입니다. 이 모델은 최대이기 때문에, 우리는 모든 실험은 단일 규모의 교육 및 테스트를 사용하여, 그것은이 섹션에서는 L. 모델 호출 (들 = 600; 자세한 내용은 5.2 절 참조).

VOC 2010 and 2012 results

ENG	KOR
On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3).3 For the NUS NIN c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.	이러한 데이터 세트에, 우리는 빠른 R-CNN (짧은 FRCN가) 상단 NUS NIN C2000의 경우 COMP4 공공 리더에서 (외부 데이터) 트랙 (표 2, 표 3) 0.3에 대한 방법과 BabyLearning 방법에 대해 비교, 이이 경우에는 관련 게시물은 없으며 우리는 사용 ConvNet 아키텍처에 대한 정확한 정보를 찾을 수 없습니다; 그들은 네트워크 - 인 - 네트워크 디자인 [17]의 변종이다. 다른 모든 방법이 같은 사전 훈련 VGG16 네트워크에서 초기화됩니다.
Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.	빠른 R-CNN은 65.7 % (여분의 데이터를 68.4 %)의지도 VOC12에 최고 결과를 얻을 수있다. 그것은 빨리 모든 "느린"R-CNN 파이프 라인을 기반으로하는 다른 방법보다 또한, 2 차의 크기입니다. VOC10, SegDeepM에서 [25] 빠른 R-CNN보다 더 높은지도 (67.2 % 대 66.1 %)을 달성한다. SegDeepM는 VOC12의 trainval 플러스 분할 주석에 훈련; O2P [1] 의미 - 분할 방법에서 R-CNN 탐지 및 세분화를 통해 추론하기 위하여 마르코프 랜덤 필드를 사용하여 R-CNN의 정확성을 높일 수 있도록 설계되었습니다. 빠른 R-CNN은 더 좋은 결과로 이어질 수 R-CNN의 장소에 SegDeepM으로 교환 할 수있다. 확대 된 07 ++ (12) 트레이닝 세트를 사용하는 경우 SegDeepM를 능가하는, 68.8 %로, 빠른 R-CNN의지도 증가 (표 2 캡션을 참조).

VOC 2007 results

ENG

KOR

On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). R-CNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.

VOC07에, 우리는 R-CNN과 SPPnet에 빠른 R-CNN을 비교합니다. 모든 방법은 같은 사전 훈련 VGG16 네트워크에서 시작하여 경계 박스 회귀를 사용합니다. VGG16 SPPnet 결과는 [11]의 저자에 의해 계산 하였다. SPPnet 교육 및 테스트를 모두 동안 다섯 저울을 사용합니다. SPPnet에 빠른 R-CNN의 개선이 빠른 R-CNN은 단일 규모의 교육 및 테스트를 사용하더라도 (63.1 %에서 66.9 %로)지도에 큰 향상을 제공 전환 층을 미세 조정 있음을 보여줍니다. R-CNN은 66.0 %의지도를 얻을 수있다. 작은 점으로, SPPnet는 파스칼의 "어렵다"로 표시된 예제없이 훈련을했다. 이 예제를 제거하면 68.1 %로 빠른 R-CNN 맵을 향상시킨다. 다른 모든 실험은 "어려운"예를 사용합니다.

Training and testing time

ENG	KOR
Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.	빠른 교육 및 시험 시간은 우리의 제 2 주 결과입니다. 표 4는 빠른 RCNN, R-CNN, 및 SPPnet 사이 VOC07에 교육 시간 (시간), (이미지 당 초) 속도를 테스트하고,지도를 비교합니다. VGG16를 들어, 빠른 R-CNN은 빨리와 절단 SVD 213 ×없이 R-CNN보다 146 × 빠르게 이미지를 처리합니다. 교육 시간은 9.5 84 시간에서 9 × 감소된다. SPPnet에 비해 빠른 RCNN이 VGG16 2.7 × 빠른 열차와 7 × 빠른립니다 SVD없이 10 × 빨리와 테스트 (9.5 대 25.5 시간). 이 캐시 기능을하지 못하기 때문에 빠른 R-CNN은 또한, 디스크 스토리지의 기가 바이트의 수백을 제거합니다.
Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088 × 4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096×4096 fc7 layer reduces runtime with little loss in mAP. Further speed-ups are possible with smaller drops in mAP if one fine-tunes again after compression.	잘린 SVD. 잘린 SVD지도에서 단지 작은 (0.3 % 포인트) 하락으로 30 % 이상 검출 시간을 줄이고 모델 압축 후 추가 미세 조정을 수행 할 필요없이 할 수 있습니다. 무화과. 4096 × 4096 FC7 층에서 VGG16의 FC6 층에서 25088 × 4096 매트릭스 및 상위 256 특이 값에서 최고 1,024 특이 값을 사용하는 방법이지도에 약간의 손실 런타임을 감소 보여줍니다. 또한 속도 업지도에 작은 방울과 가능하면 하나의 미세 조정을 다시 압축 후.

Which layers to fine-tune?

ENG	KOR
For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets	SPPnet 용지 [11]에서 고려 덜 깊은 네트워크의 경우, 미세 조정 만이 완전히 연결 층은 좋은 정확성에 대한 충분한 것으로 나타났다. 우리는이 결과가 매우 깊은 네트워크를 보유하지 않을 것이라고 가정. 미세 조정 전환 레이어를 확인하려면 VGG16 중요합니다, 우리는-미세 조정,하지만 완전히 연결 층을 배울 수 있도록 열세 전환 레이어를 동결 빠른 R-CNN을 사용합니다. 이 절제는 단일 규모 SPPnet 훈련을 에뮬레이트와 61.4 % (표 5)에 66.9 %에서지도를 감소시킨다. 이 실험은 우리의 가설을 확인 : 투자 수익 (ROI) 풀링 계층을 통해 훈련은 매우 깊은 그물을 위해 중요하다
Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.	이 모든 전환 층이 미세 조정되어야 함을 의미합니까? 즉, 아니. 소규모 네트워크 (S 및 M)에서 우리는 CONV1는 일반 및 작업 독립적입니다 찾기 (잘 알려진 사실 [14]). CONV1 배울 수 있도록 여부,지도에 의미있는 영향을 미치지 않습니다. VGG16를 들어, 우리는 단지 필요 (13 전환 층 9) conv3 1에서 위로 레이어를 업데이트했습니다. 이러한 관찰은 실용적이다 : (1) CONV2 1에서 업데이트하는 conv3 1에서 학습에 비해 1.3 × (12.5 대 9.5 시간)에 의한 훈련을 느리게; GPU 메모리 오버를 실행하고 (2) (1)로부터 갱신 CONV1. CONV2 1부터 학습지도의 차이는 0.3 점 (표 5, 마지막 열)이었다. VGG16 미세 조정 레이어를 사용하여 본 논문의 모든 고속 R-CNN 결과 1 위로 conv3; 모델 S와 M 미세 조정 레이어 모든 실험은 CONV2 최대.

Design evaluation

ENG	KOR
We conducted experiments to understand how Fast RCNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.	우리는 RCNN는 R-CNN과 SPPnet 비교하는 방법 빠른 이해뿐만 아니라, 디자인 결정을 평가하는 실험을 실시했다. 모범 사례에 따라, 우리는 파스칼 VOC07 데이터 세트에서 이러한 실험을 수행 하였다.

Does multi-task training help?

ENG	KOR
Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?	이 순차적으로 훈련 된 작업의 파이프 라인을 관리 피할 수 있기 때문에 다중 작업 훈련 편리합니다. 뿐만 아니라 작업 공유 표현 (ConvNet)을 통해 서로 영향 때문에 결과를 개선 할 수있는 잠재력을 가지고 [2]. 다중 작업 훈련은 빠른 R-CNN에 물체 검출 정밀도를 향상합니까?
To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ = 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ = 1), but we disable boundingbox regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.	이 질문을 테스트하기 위해, 우리는 식으로 만 분류 손실, LCLS를 사용하는 기본 네트워크를 훈련. 1 (즉, λ = 0으로 설정). 이 기준선이 모델은 경계 박스 회귀 변수가없는 것을 표 6 주에 각 그룹의 첫 번째 열에 모델 S, M, L 및 인쇄됩니다. 다음 (그룹 당 두 번째 열), 우리는 다중 작업 손실 훈련 된 네트워크를 가지고 (식. 1, λ = 1), 그러나 우리는 테스트시에는 BoundingBox 회귀 분석을 사용하지 않도록 설정합니다. 이것은 네트워크의 분류의 정확도를 분리 및 기준 네트워크와 사과 투 사과 비교를 허용한다.
Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.	세 가지 네트워크를 통해 우리는 다중 작업 훈련 혼자 분류에 대한 교육을 순수 분류 정확도 기준을 개선하는 것이 관찰한다. 개선은 다중 작업 학습에서 일관된 긍정적 인 효과를 보여 +0.8 +1.1 매핑 포인트 범위.
Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Lloc while keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).	마지막으로, 우리는 경계 박스 회귀 층에베이스 라인 (만 분류 손실 훈련) 모델, 압정을하고, 냉동 다른 모든 네트워크 매개 변수를 유지하면서 Lloc으로 그들을 훈련. 각 그룹의 세 번째 열은이 단계 현명한 교육 방식의 결과를 보여줍니다지도는 열 하나 이상 향상되지만 무대 현명한 훈련은 다중 작업 훈련 (그룹 당 등 열) 실적이 저조.

Scale invariance: to brute force or finesse?

ENG	KOR
We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multi-scale). In either case, we define the scale s of an image to be the length of its shortest side.	무차별 학습 (단일 규모) 및 이미지 피라미드 (멀티 스케일) : 우리는 스케일 불변 물체 감지를 달성하기위한 두 가지 전략을 비교합니다. 각각의 경우에, 우리는 짧은 변의 길이로 화상의 스케일 S를 정의한다.
All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.	모든 단일 스케일의 실험들 = 600 픽셀을 사용한다; 의 우리가 1000 픽셀에서 가장 긴 상 측 캡으로 적은 일부 이미지 600 이상이어야하고 이미지의 가로 세로 비율을 유지할 수있다. 그 VGG16은 미세 조정시 GPU 메모리에 맞도록이 값이 선택되었다. 작은 모델은 메모리 바인딩되지 않습니다 및 S의 값이 큰 혜택을 누릴 수 있습니다; 그러나, 각 모델의 최적화는 우리의 주요 관심사가 아닙니다. 우리 PASCAL 화상이 384 × 473 화소의 평균함으로써 하나의 스케일 설정 전형적 1.6 배로 업 샘플링이 이미지를 참고. 투자 수익 (ROI) 풀링 계층에서 평균 유효 보폭 따라서 ≈ 10 픽셀입니다.
In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.	멀티 - 스케일 설정에서는에서 지정한 동일한 다섯 저울을 사용하여 [11] (S ∈ {480, 576, 688, 864, 1200}) SPPnet와 비교를 용이하게한다. 그러나, 우리는 GPU 메모리를 초과하지 않도록 2000 화소의 긴 쪽 모자.
Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that single-scale detection performs almost as well as multi-scale detection. Our findings con firm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Yet it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.	훈련 한 다섯 중 하나 저울로 시험했을 때 표 7 모델 S와 M을 보여줍니다. 아마도 [11]에서 가장 놀라운 결과는 단일 크기의 검출이 멀티 스케일 검출만큼이나 잘 수행이었다. 우리의 연구 결과는 죄수들이 결과를 굳게 : 깊은 ConvNets은 숙련에서 직접 학습 스케일 불변이다. 멀티 스케일 접근 방식은 계산 시간 (표 7)에 큰 비용으로 맵에 단지 작은 증가를 제공합니다. VGG16 (모델 L)의 경우에, 우리는 상세한 구현함으로써 단일 스케일을 사용하도록 제한된다. 그러나 그것은 R-CNN, 각 제안은 정규 사이즈 휘어 있다는 의미에서 "무한"저울을 사용하더라도, R-CNN [10]에 대해보고 된 66.0 %보다 약간 높은 66.9 %의 맵을 얻을 수있다.
Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.	단일 규모의 처리가 특히 매우 깊은 모델, 속도와 정확도 사이의 최적의 균형을 제공하므로, 외부에서이 서브 섹션의 모든 실험들 = 600 픽셀 단일 규모의 교육 및 테스트를 사용합니다.

Do we need more training data?

ENG	KOR
A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.	더 많은 훈련 데이터와 함께 제공 할 때 좋은 목적 검출기는 개선 될 전망이다. 주홍 등. [24] DPM [8]지도 백에 천 몇 훈련 예 후 포화 것으로 나타났습니다. 여기에서 우리는 빠른 R-CNN을 평가하기 위해, 약 16.5k에 이미지의 수를 배로, VOC12의 trainval 세트 설정 trainval VOC07을 증가. 훈련 집합을 확대하면 70.0 % (표 1)에 66.9 %에서 VOC07 시험에지도를 향상시킨다. 이 데이터 집합에 훈련 때 우리는 대신 40K의 60K 미니 배치 반복을 사용합니다.
We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.	우리는 우리가 VOC07의 trainval, 테스트 및 VOC12의 trainval의 결합에서 21.5k 이미지의 데이터 세트를 구성하는 VOC10과 2012 년 유사한 실험을 수행. 이 데이터 집합에 훈련 할 때, 우리는 100,000의 SGD 반복을 사용하여 0.1 × 각 40K 반복 (대신 각 30K)에 의해 학습 속도를 낮 춥니 다. VOC10과 2012 년지도는 66.1 %에서 68.8 %와 65.7 %에서 각각 68.4 %로 향상시킨다.

Do SVMs outperform softmax?

ENG	KOR
Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in R-CNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.	빠른 R-CNN 대신 훈련 한 대 받침대의 미세 조정을하는 동안 배운 softmax를 분류를 사용하여 선형 SVM을 사후, R-CNN과 SPPnet 이루어졌다있다. 이 선택의 영향을 이해하기 위해서, 우리는 빠른 R-CNN 하드 부정적인 광산과 사후 SVM 훈련을 구현했습니다. 우리는 R-CNN과 훈련 알고리즘과 하이퍼 매개 변수를 동일하게 사용합니다.
Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.	표 8 softmax를 약간 0.1-0.8지도 포인트, 세 가지 네트워크를위한 SVM을 능가 보여줍니다. 이 효과는 작지만 "원샷"미세 조정 이전 다단 트레이닝 접근법에 비해 충분하다는 것을 보여준다. 우리는 투자 수익 (ROI)을 득점 할 때 softmax를이 한 대 받침대 SVM을 달리 클래스 사이에 경쟁을 도입 있습니다.

Are more proposals always better?

ENG	KOR
There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposalclassifier cascade also improves Fast R-CNN accuracy.	(광범위) 개체 탐지기의 두 가지 유형이 있습니다 : (예를 들어, 선택적 검색 [21]) 개체의 제안 스파 스 세트를 사용하는 사람들과 조밀 한 세트를 사용하는 사람들 (예를 들어, DPM [8]). 희소 제안 분류하는 메커니즘이 제안 제 평가할 작은 세트로 분류 떠나는 후보 수많은 거부하는 캐스케이드 [22]의 형태이다. DPM 탐지 [21]에 적용 할 때이 폭포는 검출 정확도를 향상시킨다. proposalclassifier 폭포는 빠른 R-CNN의 정확도를 향상 것을 우리는 증거를 찾을 수 있습니다.
Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and retesting model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP	선택적 검색의 품질 모드를 사용하여, 우리는 이미지 당 1,000 만 -10에 제안에서 청소 때마다 재 훈련 및 재검사 모델 M. 제안서지도를 해가되지해야 이미지 당 제안의 수를 증가, 순수 연산 역할을 수행하는 경우
We find that mAP rises and then falls slightly as the proposal count increases (Fig. 3, solid blue line). This experiment shows that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.	우리는지도 상승을 찾은 다음 제안 수 증가 (그림. 3, 청색 선)으로 약간 떨어진다. 이 실험은 더 많은 제안과 깊은 분류에 무리를주지하는 것은 도움이되지 않는다는 것을 보여주고, 조금이라도 정확도를 아파요.
This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.	이 결과는 실제의 실험을 실행하지 않고 예측하는 것은 곤란하다. 오브젝트 제안 품질을 측정하는 최첨단은 평균 리콜 (AR) [12]입니다. AR은 이미지 당 제안의 고정 번호를 사용하는 경우, R-CNN을 사용하여 몇 가지 제안 방법에 대한지도 잘 상관 관계. 무화과. 3 이미지 당 제안의 수는 변화로 AR (고체 레드 라인)지도 상관 관계가없는 것으로 나타났다. AR은주의해서 사용해야합니다; 때문에 더 제안에 높은 AR지도가 증가하는 것을 의미하지는 않습니다. 다행히 모델 M과 교육 및 시험 미만 2.5 시간이 소요됩니다. 빠른 R-CNN 따라서 프록시 측정하는 것이 바람직하다 오브젝트 제안지도, 효율적으로, 직접 평가를 할 수 있습니다.
We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle). The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000 × {2, 4, 6, 8, 10, 32, 45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.	/ 이미지에 대한 45K 상자의 속도로, (규모, 위치, 화면 비율 이상) 조밀하게 생성 상자를 사용할 때 우리는 또한 빠른 R-CNN을 조사합니다. 이 조밀 한 세트가 각각 선택적 검색 상자가 가장 가까운 (IOU)에 조밀 한 상자에 의해 대체 될 때 충분히 풍부,지도 (, 57.7 %로 그림. 3, 파란색 삼각형)을 단 1 점을 삭제합니다. 조밀 한 상자의 통계는 선택적 검색 상자와는 다릅니다. 1000 × {2, 4, 6, 8, 10, 32, 45} 조밀 박스 무작위 샘플을 추가 할 때 2K 선택적 검색 상자부터는 맵을 테스트한다. 각 실험을 위해 우리는 다시 기차이 조밀 한 상자를 추가하고 다시 테스트 모델 M.가,지도, 결국 53.0 %에 도달, 더 선택적 검색 창을 추가 할 때보다 더 강력하게 떨어진다.
We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).	우리는 또한 훈련 만 조밀 한 상자 (45K / 이미지)를 사용하여 빠른 R-CNN을 테스트합니다. 이 설정은 52.9 %의지도 (블루 다이아몬드)를 산출한다. 하드 부정적인 광산과 SVM을가 조밀 한 상자 분포에 대처하기 위해 필요한 경우 마지막으로, 우리는 확인한다. SVM을 더 악화을 수행 49.3 % (파란색 원).

Preliminary MS COCO results

ENG	KOR
We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP, which also averages over IoU thresholds, is 19.7%.	우리는 예비 기준을 설정하기 위해 MS 코코 데이터 세트 [18]에 (VGG16와) 빠른 R-CNN을 적용했다. 우리는 240K 반복의 80K 이미지 트레이닝 세트에 대한 교육을하고 "테스트 dev에"평가 평가 서버를 사용하여 설정합니다. 파스칼 스타일의 맵은 35.9 %이다

Conclusion

ENG	KOR
This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-theart detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.	이 논문은 빠른 R-CNN, R-CNN과 SPPnet에 깨끗하고 빠른 업데이트를 제안한다. 국가의 theart 검출 결과를보고뿐만 아니라, 우리는 우리가 새로운 통찰력을 제공 희망 상세한 실험을 제시한다. 특히 참고로, 스파 스 오브젝트 제안 검출기 품질을 향상시키기 위해 나타납니다. 이 문제는 과거에 조사하는 (시간에) 너무 비용이 많이 드는,하지만 빠른 R-CNN과 실제가된다. 물론, 조밀 한 상자가 드문 드문 제안뿐만 아니라 수행 할 수 있도록 아직 발견되지 않은 기술이 존재할 수있다. 이러한 방법을 개발하면 더 도움이 물체 검출을 가속화 할 수있다.
Acknowledgements. I thank Kaiming He, Larry Zitnick, and Piotr Dollar for helpful discussions and encouragement.	감사의 글. 나는 도움이 토론과 격려를 Kaiming 그는 래리 Zitnick 및 표트르 달러 감사합니다.

Documentation

Fast R-CNN (Ross Girshick, Microsoft Research): https://github.com/rbgirshick/fast-rcnn; 1504.08083v2.pdf

References

J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV, 2012. 5 ↩
R. Caruana. Multitask learning. Machine learning, 28(1), 1997. 6 ↩
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. 5 ↩
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 2 ↩
E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014. 4 ↩
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 3 ↩
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 1 ↩
P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 3, 7, 8 ↩
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 4, 8 ↩
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Regionbased convolutional networks for accurate object detection and segmentation. TPAMI, 2015. 5, 7, 8 ↩
K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2, 3, 4, 5, 6, 7 ↩
J. H. Hosang, R. Benenson, P. Dollar, and B. Schiele. What ´ makes for effective detection proposals? arXiv preprint arXiv:1502.05082, 2015. 8 ↩
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. of the ACM International Conf. on Multimedia, 2014. 2 ↩
A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1, 4, 6 ↩
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1 ↩
Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comp., 1989. 1 ↩
M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. 5 ↩
T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zit- ´ nick. Microsoft COCO: common objects in context. arXiv e-prints, arXiv:1405.0312 [cs.CV], 2014. 8 ↩
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014. 1, 3 ↩
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 5 ↩
J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 8 ↩
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 8 ↩
J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, 2013. 4 ↩
X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we need more training data or better models for object detection? In BMVC, 2012. 7 ↩
Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In CVPR, 2015. 1, 5 ↩