SPPnet:Paper

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

Pre-defined References

¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶ ¹⁷ ¹⁸ ¹⁹ ²⁰ ²¹ ²² ²³ ²⁴ ²⁵ ²⁶ ²⁷ ²⁸ ²⁹ ³⁰ ³¹ ³² ³³ ³⁴ ³⁵ ³⁶ ³⁷ ³⁸ ³⁹ ⁴⁰

Abstract

ENG	KOR
Abstract—Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224×224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-theart classification results using a single full-image representation and no fine-tuning.	기존의 딥 컨볼 루션 신경망 (CNN)은 고정 크기 (예를 들어, 224 × 224)의 입력 이미지를 필요로한다. 이 요구 사항은 "인공"이며 임의의 크기 / 스케일의 이미지 또는 하위 이미지에 대한 인식 정확도를 떨어 뜨릴 수 있습니다. 이 작업에서는 위의 요구 사항을 없애기 위해 네트워크에 다른 풀링 전략 인 "공간 피라미드 풀링"을 장착합니다. SPP-net이라고 불리는 새로운 네트워크 구조는 이미지 크기 / 규모에 관계없이 고정 길이 표현을 생성 할 수 있습니다. 피라미드 풀링은 객체 변형에도 견고합니다. 이러한 장점을 통해 SPP-net은 일반적으로 모든 CNN 기반 이미지 분류 방법을 개선해야합니다. ImageNet 2012 데이터 세트에서 우리는 SPP-net이 각기 다른 디자인에도 불구하고 다양한 CNN 아키텍처의 정확성을 향상 시킨다는 것을 입증합니다. Pascal VOC 2007 및 Caltech101 데이터 세트에서 SPP-net은 하나의 전체 이미지 표현을 사용하고 미세 조정을하지 않고 최첨단 분류 결과를 얻습니다.
The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102× faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007.	SPP-net의 힘은 물체 감지에서도 중요합니다. SPP-net을 사용하여 전체 이미지에서 기능 맵을 한 번만 계산 한 다음 임의의 영역 (하위 이미지)에 기능을 풀어 탐지기 교육을위한 고정 길이 표현을 생성합니다. 이 방법은 콘볼 루션 특징을 반복적으로 계산하는 것을 피한다. 테스트 이미지 처리에서 우리의 방법은 R-CNN 방법보다 24-102 배 빠름과 동시에 Pascal VOC 2007에서 더 우수하거나 유사한 정확도를 달성합니다.
In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.	ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2014에서 우리의 방법은 38 개 팀 모두에서 객체 감지에서 2 위와 이미지 분류에서 3 위를 차지합니다. 이 원고는이 대회에 대한 개선점을 소개합니다.

INTRODUCTION

ENG	KOR
We are witnessing a rapid, revolutionary change in our vision community, mainly caused by deep convolutional neural networks (CNNs) [1] and the availability of large scale training data [2]. Deep-networksbased approaches have recently been substantially improving upon the state of the art in image classification [3], [4], [5], [6], object detection [7], [8], [5], many other recognition tasks [9], [10], [11], [12], and even non-recognition tasks.	우리는 주로 심층 컨볼 루션 신경망 (CNNs) [1]과 대규모 훈련 데이터의 가용성 [2]에 기인 한 우리의 비전 공동체에서의 급속하고 혁명적 인 변화를 목격하고 있습니다. Deep-networks 기반 접근법은 최근 이미지 분류 [3], [4], [5], [6], 객체 탐지 [7], [8] 인식 작업 [9], [10], [11], [12] 및 비 인식 작업
However, there is a technical issue in the training and testing of the CNNs: the prevalent CNNs require a fixed input image size (e.g., 224×224), which limits both the aspect ratio and the scale of the input image. When applied to images of arbitrary sizes, current methods mostly fit the input image to the fixed size, either via cropping [3], [4] or via warping [13], [7], as shown in Figure 1 (top). But the cropped region may not contain the entire object, while the warped content may result in unwanted geometric distortion. Recognition accuracy can be compromised due to the content loss or distortion. Besides, a pre-defined scale may not be suitable when object scales vary. Fixing input sizes overlooks the issues involving scales.	그러나, CNN의 훈련 및 테스트에 기술적 인 문제가있다. 널리 보급 된 CNN은 고정 된 입력 이미지 크기 (예를 들어, 224 × 224)를 필요로하며, 이것은 입력 이미지의 종횡비 및 스케일을 제한한다. 임의의 크기의 이미지에 적용 할 때, 현재의 방법은 그림 1 (상단)에서 볼 수 있듯이 자르기 [3], [4] 또는 뒤틀림 [13], [7]을 통해 입력 이미지를 고정 크기로 맞 춥니 다. 그러나 잘린 영역은 전체 객체를 포함하지 않을 수 있지만 뒤틀린 내용은 원하지 않는 기하학적 왜곡을 초래할 수 있습니다. 컨텐츠의 손실이나 왜곡으로 인해 인식 정확도가 떨어질 수 있습니다. 게다가, 미리 정의 된 스케일은 오브젝트 스케일이 다를 때 적합하지 않을 수 있습니다. 입력 크기를 고정하면 저울과 관련된 문제가 간과됩니다.
So why do CNNs require a fixed input size? A CNN mainly consists of two parts: convolutional layers, and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps which represent the spatial arrangement of the activations (Figure 2). In fact, convolutional layers do not require a fixed image size and can generate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixedsize/length input by their definition. Hence, the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.	그러면 CNN에 고정 된 입력 크기가 필요한 이유는 무엇입니까? CNN은 주로 두 부분으로 구성됩니다 : 길쌈 계층과 뒤 따르는 완전히 연결된 계층. 컨볼 루션 레이어는 슬라이딩 윈도우 방식으로 작동하고 활성화의 공간적 배열을 나타내는 기능 맵을 출력합니다 (그림 2). 사실, 컨볼 루션 레이어는 고정 된 이미지 크기를 필요로하지 않으며 모든 크기의 기능 맵을 생성 할 수 있습니다. 반면, 완전히 연결된 레이어는 정의에 따라 고정 크기 / 길이 입력이 필요합니다. 따라서 고정 크기 제한은 네트워크의 더 깊은 단계에 존재하는 완전 연결된 계층에서만 제공됩니다.
In this paper, we introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fullyconnected layers (or other classifiers). In other words, we perform some information “aggregation” at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning. Figure 1 (bottom) shows the change of the network architecture by introducing the SPP layer. We call the new network structure SPP-net.	본 논문에서는 네트워크의 고정 크기 제약을 제거하기위한 SPP (Spatial Pyramid Pooling) [14,15] 계층을 소개한다. 특히 마지막 길쌈 레이어 위에 SPP 레이어를 추가합니다. SPP 계층은 기능을 풀링하고 고정 길이 출력을 생성 한 다음 완전 연결 계층 (또는 다른 분류 자)에 공급합니다. 즉, 처음에는 자르기 또는 뒤틀림의 필요성을 피하기 위해 네트워크 계층 구조의 더 깊은 단계 (길쌈 레이어와 완전 연결 레이어 사이)에서 일부 정보 "집계"를 수행합니다. 그림 1 (하단)은 SPP 계층을 도입하여 네트워크 아키텍처의 변화를 보여줍니다. 새로운 네트워크 구조 SPP-net을 호출합니다.
Spatial pyramid pooling [14], [15] (popularly known as spatial pyramid matching or SPM [15]), as an extension of the Bag-of-Words (BoW) model [16], is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classi- fication (e.g., [17], [18], [19]) and detection (e.g., [20]) before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs. We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixedlength output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Through experiments we show that all these factors elevate the recognition accuracy of deep networks.	Bag-of-Words (BoW) 모델 [16]의 확장 된 공간 피라미드 풀링 [14], [15] (널리 공간 피라미드 매칭 또는 SPM [15]으로 알려져 있음)은 컴퓨터 비전. 이미지를 더 세밀한 레벨에서 더 거친 레벨로 분할하고 그 안에 로컬 피처를 집계합니다. SPP는 최근 CNN의 보급 전에 분류 및 경쟁 우승 기반 시스템 (예 : [17], [18], [19]) 및 탐지 (예 : [20])에서 오랫동안 핵심 구성 요소였습니다. 그럼에도 불구하고, SPP는 CNN의 맥락에서 고려되지 않았다. 우리는 SPP가 깊은 CNN에 대해 몇 가지 주목할만한 성질을 가지고 있음을 주목한다. 1) SPP는 입력 크기에 관계없이 고정 길이 출력을 생성 할 수 있지만 이전의 딥 네트워크에서 사용 된 슬라이딩 윈도우 풀링은 불가능하다. 2) SPP는 다단계 공간 저장소를 사용하는 반면 슬라이딩 창 풀링은 단일 창 크기 만 사용합니다. 다중 레벨 풀링은 객체 변형에 강건 함을 보였다 [15]. 3) SPP는 입력 스케일의 유연성 덕분에 가변 스케일로 추출 된 피쳐를 풀링 할 수 있습니다. 실험을 통해 우리는 이러한 모든 요소가 심층 네트워크의 인식 정확도를 높인다는 것을 보여줍니다.
SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feed images with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable input sizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training, and leads to better testing accuracy.	SPP-net은 테스트를 위해 임의의 크기의 이미지 / 창에서 표현을 생성 할 수있을뿐만 아니라 교육 도중 다양한 크기 또는 크기의 이미지를 제공 할 수 있습니다. 가변 크기 이미지를 사용한 교육은 스케일 불변성을 높이고 과다 피트를 줄입니다. 우리는 간단한 다중 크기 훈련 방법을 개발합니다. 단일 네트워크가 가변 입력 크기를 수용하기 위해서는 모든 매개 변수를 공유하는 여러 네트워크에 의해 대략적인 입력 크기를 사용하는 반면, 각 네트워크는 고정 입력 크기를 사용하여 교육됩니다. 각 신기원에서 우리는 주어진 입력 크기로 네트워크를 훈련시키고 다음 신기원을위한 다른 입력 크기로 전환한다. 실험 결과에 따르면이 멀티 크기 교육은 기존의 단일 크기 교육처럼 수렴되어 테스트의 정확성이 향상되었습니다.
The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications [3], [4], [5] (or their modifications), over the no-SPP counterparts. These architectures have various filter numbers/sizes, strides, depths, or other designs. It is thus reasonable for us to conjecture that SPP should improve more sophisticated (deeper and larger) convolutional architectures. SPP-net also shows state-of-the-art classification results on Caltech101 [21] and Pascal VOC 2007 [22] using only a single full-image representation and no fine-tuning.	SPP의 장점은 특정 CNN 디자인과 직각을 이룹니다. ImageNet 2012 데이터 세트에 대한 일련의 제어 실험에서 SPP가없는 SPP를 통해 기존의 간행물 [4], [5] (또는 수정본)에서 4 가지 CNN 아키텍처를 개선한다는 것을 입증합니다. 이러한 아키텍처에는 다양한 필터 번호 / 크기, 스트라이드, 깊이 또는 기타 디자인이 있습니다. 따라서 SPP가보다 정교한 (깊고 큰) 길쌈 아키텍처를 개선해야한다고 생각하는 것이 합리적입니다. SPP-net은 또한 Caltech101 [21]과 Pascal VOC 2007 [22]에 대한 최신 분류 결과를 보여줍니다. 단 하나의 전체 이미지 표현과 미세 조정이 필요 없습니다.
SPP-net also shows great strength in object detection. In the leading object detection method R-CNN [7], the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in RCNN is time-consuming, because it repeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can run the convolutional layers only once on the entire image (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speedup of over one hundred times over R-CNN. Note that training/running a detector on the feature maps (rather than image regions) is actually a more popular idea [23], [24], [20], [5]. But SPP-net inherits the power of the deep CNN feature maps and also the flexibility of SPP on arbitrary window sizes, which leads to outstanding accuracy and efficiency. In our experiment, the SPP-net-based system (built upon the R-CNN pipeline) computes features 24-102× faster than R-CNN, while has better or comparable accuracy. With the recent fast proposal method of EdgeBoxes [25], our system takes 0.5 seconds processing an image (including all steps). This makes our method practical for real-world applications.	SPP-net은 또한 물체 감지에 큰 힘을 발휘합니다. 선행 물체 검출 방법 인 R-CNN [7]에서, 후보 윈도우로부터의 특징은 깊은 컨벌루션 네트워크를 통해 추출된다. 이 방법은 VOC 및 ImageNet 데이터 세트 모두에서 현저한 탐지 정확도를 보여줍니다. 그러나 RCNN의 특징 계산은 이미지 당 수천 개의 뒤틀린 영역의 원시 픽셀에 반복적으로 깊은 컨벌루션 네트워크를 적용하기 때문에 시간 소모적입니다. 이 논문에서는 창 개수에 관계없이 전체 이미지에서 한 번만 컨볼 루션 레이어를 실행 한 다음 SPP-net을 사용하여 기능 맵에서 기능을 추출 할 수 있음을 보여줍니다. 이 방법은 R-CNN에 비해 100 배가 넘는 속도를 제공합니다. 실제로 이미지 영역이 아닌 피쳐 맵에서 탐지기를 훈련 / 실행하는 것이 실제로 더 널리 사용되는 아이디어입니다 [23, 24, 20, 5]. 그러나 SPP-net은 CNN 기능 맵의 강력한 기능과 임의의 창 크기에 대한 SPP의 유연성을 이어 받아 뛰어난 정확성과 효율성을 제공합니다. 우리의 실험에서 SPP-net 기반 시스템 (R-CNN 파이프 라인 위에 구축 됨)은 R-CNN보다 24-102 배 빠른 기능을 계산하는 반면 정확도는 비슷하거나 비슷합니다. EdgeBoxes [25]의 최근의 빠른 제안 방법으로, 우리 시스템은 0.5 초의 이미지 처리 (모든 단계 포함)가 필요합니다. 이로써 우리의 방법은 실제 응용 프로그램에 실용적입니다.
A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014 [26], and ranked #2 in object detection and #3 in image classification (both are provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014. We show that the SPP-nets can boost various networks that are deeper and larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts. Further, driven by our detection framework, we find that multi-view testing on feature maps with flexibly located/sized windows (Sec. 3.1.5) can increase the classification accuracy. This manuscript also provides the details of these modifications.	이 원고의 예비 버전은 ECCV 2014에 게재되었습니다.이 작업을 바탕으로 ILSVRC 2014 [26]의 경쟁에 참여했으며 객체 감지에서 2 위를, 이미지 분류에서 3 위를 차지했습니다 (둘 다 제공됩니다 - 데이터 전용 트랙). ILSVRC 2014에 대한 몇 가지 수정 사항이 있습니다. SPP-net은 SPP가 아닌 것보다 더 깊고 다양한 네트워크 (섹션 3.1.2-3.1.4)를 향상시킬 수 있음을 보여줍니다. 또한 탐지 프레임 워크에 의해 유연하게 위치 / 크기 조정 된 창 (3.1.5 절)이있는 기능 맵에 대한 다중 뷰 테스트가 분류 정확도를 높일 수 있습니다. 이 원고는 또한 이러한 수정의 세부 사항을 제공합니다.
We have released the code to facilitate future research http://research.microsoft.com/en-us/um/people/kahe/.	우리는 미래의 연구 http://research.microsoft.com/en-us/um/people/kahe/을 용이하게하는 코드를 발표했다.

DEEP NETWORKS WITH SPATIAL PYRAMID POOLING

Convolutional Layers and Feature Maps

ENG	KOR
Consider the popular seven-layer architectures [3], [4]. The first five layers are convolutional, some of which are followed by pooling layers. These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows. The last two layers are fully connected, with an N-way softmax as the output, where N is the number of categories.	널리 사용되는 7 계층 아키텍처를 고려하십시오 [3], [4]. 첫 번째 다섯 개의 레이어는 길쌈 (convolutional)이며, 그 중 일부는 풀링 레이어가 뒤 따른다. 이러한 풀링 레이어는 슬라이딩 윈도우를 사용한다는 의미에서 "컨볼 루션"으로 간주 될 수도 있습니다. 마지막 두 레이어는 N 방향 softmax가 출력으로 완전히 연결되어 있습니다. 여기서 N은 카테고리의 수입니다.
The deep network described above needs a fixed image size. However, we notice that the requirement of fixed sizes is only due to the fully-connected layers that demand fixed-length vectors as inputs. On the other hand, the convolutional layers accept inputs of arbitrary sizes. The convolutional layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs. These outputs are known as feature maps [1] - they involve not only the strength of the responses, but also their spatial positions.	위에서 설명한 깊은 네트워크에는 고정 된 이미지 크기가 필요합니다. 그러나 고정 크기의 요구 사항은 고정 길이 벡터를 입력으로 요구하는 완전 연결 계층에만 기인합니다. 다른 한편, 컨벌루션 계층은 임의의 크기의 입력을 허용합니다. 컨볼 루션 레이어는 슬라이딩 필터를 사용하며, 출력은 입력과 거의 동일한 종횡비를 갖습니다. 이러한 출력은 피쳐 맵 (feature map) [1]으로 알려져 있습니다. 응답의 강도뿐만 아니라 공간의 위치도 포함됩니다.
In Figure 2, we visualize some feature maps. They are generated by some filters of the conv5 layer. Figure 2(c) shows the strongest activated images of these filters in the ImageNet dataset. We see a filter can be activated by some semantic content. For example, the 55-th filter (Figure 2, bottom left) is most activated by a circle shape; the 66-th filter (Figure 2, top right) is most activated by a ∧-shape; and the 118-th filter (Figure 2, bottom right) is most activated by a ∨-shape. These shapes in the input images (Figure 2(a)) activate the feature maps at the corresponding positions (the arrows in Figure 2).	그림 2에서 일부 기능 맵을 시각화합니다. 이들은 conv5 계층의 일부 필터로 생성됩니다. 그림 2 (c)는 ImageNet 데이터 세트에서 이들 필터의 가장 활성화 된 이미지를 보여줍니다. 필터는 의미 론적 내용으로 활성화 될 수 있습니다. 예를 들어 55 번째 필터 (그림 2, 왼쪽 아래)는 대부분 원 모양으로 활성화됩니다. 66 번째 필터 (그림 2, 오른쪽 상단)는 ∧ 모양으로 가장 활성화됩니다. 118 번째 필터 (그림 2, 오른쪽 하단)는 ∨ 모양으로 가장 활성화됩니다. 입력 이미지 (그림 2 (a))의 이러한 모양은 해당 위치에서 기능 맵을 활성화합니다 (그림 2의 화살표).
It is worth noticing that we generate the feature maps in Figure 2 without fixing the input size. These feature maps generated by deep convolutional layers are analogous to the feature maps in traditional methods [27], [28]. In those methods, SIFT vectors [29] or image patches [28] are densely extracted and then encoded, e.g., by vector quantization [16], [15], [30], sparse coding [17], [18], or Fisher kernels [19]. These encoded features consist of the feature maps, and are then pooled by Bag-of-Words (BoW) [16] or spatial pyramids [14], [15]. Analogously, the deep convolutional features can be pooled in a similar way.	입력 크기를 수정하지 않고 그림 2의 기능 맵을 생성한다는 사실을 알아 두는 것이 중요합니다. deep convolutional layer에 의해 생성 된 이러한 feature map은 전통적인 method [27], [28]의 feature map과 유사하다. 이러한 방법들에서, SIFT 벡터 [29] 또는 이미지 패치 [28]는 조밀하게 추출되고 벡터 양자화 [16], [15], [30], 스파 스 코딩 [17], [18] 피셔 커널 [19]. 이러한 인코딩 된 기능은 기능 맵으로 구성되어 Bag-of-Word (BoW) [16] 또는 공간 피라미드 [14], [15]에 의해 풀링됩니다. 유사하게, 깊은 콘볼 루션 특징은 유사한 방식으로 풀링 될 수있다.

The Spatial Pyramid Pooling Layer

ENG	KOR
The convolutional layers accept arbitrary input sizes, but they produce outputs of variable sizes. The classi- fiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.	컨볼 루션 레이어는 임의의 입력 크기를 허용하지만 가변 크기의 출력을 생성합니다. 분류 자 (SVM / softmax) 또는 완전 연결 계층에는 고정 길이 벡터가 필요합니다. 이러한 벡터는 기능들을 함께 묶는 BoW (Bag-of-Words) 접근법 [16]에 의해 생성 될 수있다. Spatial pyramid pooling [14,15]은 로컬 공간 저장소에 풀링함으로써 공간 정보를 유지할 수 있다는 점에서 BoW를 향상시킨다. 이러한 공간 저장소는 이미지 크기에 비례하는 크기를 가지므로 저장소의 수는 이미지 크기에 관계없이 고정됩니다. 이것은 슬라이딩 윈도우의 수가 입력 크기에 의존하는 이전의 딥 네트워크 [3]의 슬라이딩 윈도우 풀링과는 대조적이다.
To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g., pool5, after the last convolutional layer) with a spatial pyramid pooling layer. Figure 3 illustrates our method. In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kMdimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.	임의의 크기의 이미지에 깊은 네트워크를 채택하기 위해 마지막 풀링 레이어 (예 : 마지막 Convolutional layer 이후의 pool5)를 공간 피라미드 풀링 레이어로 바꿉니다. 그림 3은 우리의 방법을 보여준다. 각 공간 저장소에서 각 필터의 응답을 저장합니다 (이 문서 전체에서 최대 풀링 사용). 공간 피라미드 풀링의 출력은 M (k는 마지막 길쌈 레이어의 필터 수)으로 표시된 빈의 수를 갖는 k 차원 차원 벡터입니다. 고정 차원 벡터는 완전히 연결된 레이어에 대한 입력입니다.
With spatial pyramid pooling, the input image can be of any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g., min(w, h)=180, 224, ...) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales. The scales play important roles in traditional methods, e.g., the SIFT vectors are often extracted at multiple scales [29], [27] (determined by the sizes of the patches and Gaussian filters). We will show that the scales are also important for the accuracy of deep networks.	공간 피라미드 풀링을 사용하면 입력 이미지의 크기가 달라질 수 있습니다. 이것은 임의의 종횡비를 허용 할뿐만 아니라 임의의 비율을 허용합니다. 입력 이미지의 크기를 임의의 크기 (예 : min (w, h) = 180, 224, ...)로 조정하고 동일한 네트워크를 적용 할 수 있습니다. 입력 이미지의 스케일이 다른 경우 네트워크는 (동일한 필터 크기로) 서로 다른 스케일로 피쳐를 추출합니다. 스케일은 전통적인 방법에서 중요한 역할을한다. 예를 들어, SIFT 벡터는 종종 패치와 가우시안 필터의 크기에 의해 결정되는 다중 스케일 [29], [27]에서 추출된다. 우리는 규모가 깊은 네트워크의 정확성에 중요하다는 것을 보여줍니다.
Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation, which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33], a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method.	흥미롭게도 가장 거친 피라미드 레벨에는 전체 이미지를 덮는 하나의 빈이 있습니다. 이것은 사실 여러 동시 작업에서 조사되는 "전역 풀링"작업입니다. [31], [32]에서 전역 평균 풀링은 모델 크기를 줄이고 초과 맞춤을 줄이기 위해 사용된다. [33]에서 전역 평균 풀링은 정확도를 높이기 위해 모든 fc 계층 이후 테스트 단계에서 사용됩니다. [34]에서는 약점 관리 대상 객체 인식을 위해 전역 최대 풀링을 사용합니다. 글로벌 풀링 작업은 전통적인 Bag-of-Words 메서드에 해당합니다.

Training the Network

ENG

KOR

Theoretically, the above network structure can be trained with standard back-propagation [1], regardless of the input image size. But in practice the GPU implementations (such as cuda-convnet [3] and Caffe [35]) are preferably run on fixed input images. Next we describe our training solution that takes advantage of these GPU implementations while still preserving the spatial pyramid pooling behaviors.

이론적으로, 위의 네트워크 구조는 입력 이미지 크기에 관계없이 표준 역 전파 [1]로 학습 할 수 있습니다. 그러나 실제로 GPU 구현 (예 : cuda-convnet [3] 및 Caffe [35])은 고정 입력 이미지에서 실행하는 것이 바람직합니다. 다음으로 공간 피라미드 풀링 동작을 유지하면서 이러한 GPU 구현을 활용하는 교육 솔루션에 대해 설명합니다.

Single-size training

ENG	KOR
As in previous works, we first consider a network taking a fixed-size input (224×224) cropped from images. The cropping is for the purpose of data augmentation. For an image with a given size, we can pre-compute the bin sizes needed for spatial pyramid pooling. Consider the feature maps after conv5 that have a size of a×a (e.g., 13×13). With a pyramid level of n×n bins, we implement this pooling level as a sliding window pooling, where the window size win = da/ne and stride str = ba/nc with d·e and b·c denoting ceiling and floor operations. With an l-level pyramid, we implement l such layers. The next fully-connected layer (fc6) will concatenate the l outputs. Figure 4 shows an example configuration of 3-level pyramid pooling (3×3, 2×2, 1×1) in the cuda-convnet style [3].	이전 작품들에서와 마찬가지로, 우리는 먼저 고정 크기 입력 (224 × 224)을 이미지에서 잘라내는 네트워크를 고려합니다. 자르기는 데이터 증가를위한 것입니다. 주어진 크기의 이미지의 경우 공간 피라미드 풀링에 필요한 빈 크기를 미리 계산할 수 있습니다. conv5 이후에 크기가 a × a (예 : 13 × 13) 인 피쳐 맵을 고려하십시오. n × n 빈의 피라미드 레벨을 사용하여이 풀링 레벨을 슬라이딩 윈도우 풀링으로 구현합니다. 창 크기는 da / ne이고 스트라이드 str = ba / nc는 d • e와 bc가 천장 및 바닥 작동을 나타냅니다. . l- 레벨 피라미드로, 우리는 그러한 레이어를 구현합니다. 다음 완전 연결된 레이어 (fc6)는 l 개의 출력을 연결합니다. 그림 4는 3 레벨 피라미드 풀링 (3 × 3, 2 × 2, 1 × 1)의 구성을 cuda-convnet 스타일 [3]으로 보여줍니다.
The main purpose of our single-size training is to enable the multi-level pooling behavior. Experiments show that this is one reason for the gain of accuracy.	단일 크기 교육의 주된 목적은 다중 레벨 풀링 동작을 가능하게하는 것입니다. 실험 결과 정확성 향상의 이유 중 하나임이 입증되었습니다.

Multi-size training

ENG	KOR
Our network with SPP is expected to be applied on images of any sizes. To address the issue of varying image sizes in training, we consider a set of predefined sizes. We consider two sizes: 180×180 in addition to 224×224. Rather than crop a smaller 180×180 region, we resize the aforementioned 224×224 region to 180×180. So the regions at both scales differ only in resolution but not in content/layout. For the network to accept 180×180 inputs, we implement another fixed-size-input (180×180) network. The feature map size after conv5 is a×a = 10×10 in this case. Then we still use win = da/ne and str = ba/nc to implement each pyramid pooling level. The output of the spatial pyramid pooling layer of this 180-network has the same fixed length as the 224-network. As such, this 180-network has exactly the same parameters as the 224-network in each layer. In other words, during training we implement the varying-input-size SPP-net by two fixed-size networks that share parameters.	우리의 SPP 네트워크는 모든 크기의 이미지에 적용될 것으로 예상됩니다. 교육에서 다양한 이미지 크기 문제를 해결하기 위해 사전 정의 된 크기 세트를 고려합니다. 우리는 224 × 224에 더하여 180 × 180의 두 가지 크기를 고려합니다. 더 작은 180x180 영역을 잘라 내기보다는 위에서 언급 한 224x224 영역을 180x180으로 크기 조정합니다. 따라서 두 스케일의 영역은 해상도 만 다르지만 내용 / 레이아웃은 다릅니다. 네트워크가 180 × 180 입력을 수용하기 위해 다른 고정 크기 입력 (180 × 180) 네트워크를 구현합니다. 이 경우 conv5 이후의 피처 맵 크기는 a × a = 10 × 10입니다. 그런 다음 win = da / ne 및 str = ba / nc를 사용하여 각 피라미드 풀링 수준을 구현합니다. 이 180 네트워크의 공간 피라미드 풀링 계층의 출력은 224 네트워크와 동일한 고정 길이를 갖습니다. 따라서이 180 네트워크는 각 계층의 224 네트워크와 동일한 매개 변수를 갖습니다. 바꾸어 말하면, 우리는 학습하는 동안 매개 변수를 공유하는 두 개의 고정 크기 네트워크로 다양한 입력 크기 SPP-net을 구현합니다.
To reduce the overhead to switch from one network (e.g., 224) to the other (e.g., 180), we train each full epoch on one network, and then switch to the other one (keeping all weights) for the next full epoch. This is iterated. In experiments, we find the convergence rate of this multi-size training to be similar to the above single-size training.	하나의 네트워크 (예 : 224)에서 다른 네트워크 (예 : 180)로 전환하는 오버 헤드를 줄이기 위해 하나의 네트워크에서 각 전체 이포크를 교육 한 후 다른 전체 에포크로 모든 가중치를 유지하면서 다른 네트워크로 전환합니다. 이것은 반복됩니다. 실험에서이 다중 크기 훈련의 수렴 속도는 위의 단일 크기 훈련과 유사하다는 것을 알 수 있습니다.
The main purpose of our multi-size training is to simulate the varying input sizes while still leveraging the existing well-optimized fixed-size implementations. Besides the above two-scale implementation, we have also tested a variant using s × s as input where s is randomly and uniformly sampled from [180, 224] at each epoch. We report the results of both variants in the experiment section.	멀티 사이즈 교육의 주된 목적은 기존의 잘 최적화 된 고정 크기 구현을 활용하면서 다양한 입력 크기를 시뮬레이션하는 것입니다. 위의 두 단계 구현 외에도 s를 입력으로 사용하는 변형을 테스트했으며 여기서 s는 각 시간마다 [180, 224]에서 무작위로 균등하게 샘플링됩니다. 실험 섹션에서 두 변종의 결과를보고합니다.
Note that the above single/multi-size solutions are for training only. At the testing stage, it is straightforward to apply SPP-net on images of any sizes.	위의 단일 / 다중 크기 솔루션은 교육용입니다. 테스트 단계에서는 모든 크기의 이미지에 SPP-net을 적용하는 것이 간단합니다.

SPP-NET FOR IMAGE CLASSIFICATION

Experiments on ImageNet 2012 Classification

ENG

KOR

We train the networks on the 1000-category training set of ImageNet 2012. Our training algorithm follows the practices of previous work [3], [4], [36]. The images are resized so that the smaller dimension is 256, and a 224×224 crop is picked from the center or the four corners from the entire image1. The data are augmented by horizontal flipping and color altering [3]. Dropout [3] is used on the two fully-connected layers. The learning rate starts from 0.01, and is divided by 10 (twice) when the error plateaus. Our implementation is based on the publicly available code of cuda-convnet [3] and Caffe [35]. All networks in this paper can be trained on a single GeForce GTX Titan GPU (6 GB memory) within two to four weeks.

우리는 우리의 교육 알고리즘은 이전 작품의 사례를 다음과 ImageNet 2012 년 1000 카테고리 트레이닝 세트의 네트워크를 훈련 [3], [4], [36]. 작은 치수가 256이며, 224 × 224 작물 센터 또는 전체 image1에의 네 모서리에서 선택되도록 이미지는 크기가 조정됩니다. 데이터는 수평 반전과 색 변질 [3]에 의해 증강된다. [3] 드롭 아웃은 두 개의 완전히 연결된 레이어에 사용됩니다. 학습 속도가 0.01에서 시작하여, 10 (2 회) 때 에러 대지로 분할된다. 우리의 구현은 CUDA-convnet [3] CAFFE [35]의 공개 코드를 기반으로합니다. 이 논문의 모든 네트워크는 2~4주 내에서 하나의 지포스 GTX 타이탄 GPU (6 GB 메모리)에 대한 교육을 할 수 있습니다.

Baseline Network Architectures

ENG	KOR
The advantages of SPP are independent of the convolutional network architectures used. We investigate four different network architectures in existing publications [3], [4], [5] (or their modifications), and we show SPP improves the accuracy of all these architectures. These baseline architectures are in Table 1 and briefly introduced below:	SPP의 이점은 사용되는 컨벌루션 네트워크 구조와 무관하다. 우리는 기존 공보의 네 가지 네트워크 구조를 조사 [3], [4], [5] (또는 변형), 우리는 SPP 이러한 모든 구조의 정확성을 향상 나타낸다. 이러한 기본 구조는 표 1에 간략하게 소개 아래 :
• ZF-5: this architecture is based on Zeiler and Fergus’s (ZF) “fast” (smaller) model [4]. The number indicates five convolutional layers.	• ZF-5 :이 아키텍처는 ZEILER 및 퍼거스의 (ZF)를 기반으로 "빠른"(작은) 모델 [4]. 수는 다섯 길쌈 레이어를 나타냅니다.
• Convnet*-5: this is a modification on Krizhevsky et al.’s network [3]. We put the two pooling layers after conv2 and conv3 (instead of after conv1 and conv2). As a result, the feature maps after each layer have the same size as ZF-5.	• Convnet *이 -5 :.이 Krizhevsky 등의 네트워크 [3]의 수정이다. 우리는 (대신 CONV1 및 CONV2 후) CONV2 및 conv3 후 두 풀링 층했습니다. 그 결과, 각 층 후 기능 맵은 ZF-5와 동일한 크기를 갖는다.
• Overfeat-5/7: this architecture is based on the Overfeat paper [5], with some modifications as in [6]. In contrast to ZF-5/Convnet*-5, this architecture produces a larger feature map (18×18 instead of 13 × 13) before the last pooling layer. A larger filter number (512) is used in conv3 and the following convolutional layers. We also investigate a deeper architecture with 7 convolutional layers, where conv3 to conv7 have the same structures.	• Overfeat-5 / 7 :이 아키텍처에서 일부 수정하여, Overfeat 종이 [5]에 기초한다 [6]. ZF-5 / Convnet * -5 대조적으로,이 구조는 지난 풀링 층 전 (18 × 13 × 18 대신 13)보다 큰 피쳐 맵을 생성한다. 큰 필터 번호 (512)에서 사용 conv3 컨벌루션 층 팔로. 또한 conv7에 conv3는 동일한 구조를 갖는다 7 컨벌루션 층과 깊은 구조를 조사.
In the baseline models, the pooling layer after the last convolutional layer generates 6×6 feature maps, with two 4096-d fc layers and a 1000-way softmax layer following. Our replications of these baseline networks are in Table 2 (a). We train 70 epochs for ZF-5 and 90 epochs for the others. Our replication of ZF-5 is better than the one reported in [4]. This gain is because the corner crops are from the entire image, as is also reported in [36].	기본 모델에서, 마지막 길쌈 층 후 풀링 층은 6 × 6이 4096 차원 FC의 레이어 기능지도, 1000 방법 softmax를 층 이하를 생성합니다. 이러한 기준 네트워크 우리 복제는 표 2의 (a). 우리는 ZF-5 70 신 (新) 시대와 다른 90 시대를 양성. ZF-5의 우리의 복제에 [4]보고 된 것보다 더 낫다. 코너 작물이 전체 이미지 때문에 또한 [36]에보고 된 바와 같이이 이득이다.

Multi-level Pooling Improves Accuracy

ENG	KOR
In Table 2 (b) we show the results using singlesize training. The training and testing sizes are both 224×224. In these networks, the convolutional layers have the same structures as the corresponding baseline models, whereas the pooling layer after the final convolutional layer is replaced with the SPP layer. For the results in Table 2, we use a 4-level pyramid. The pyramid is {6×6, 3×3, 2×2, 1×1} (totally 50 bins). For fair comparison, we still use the standard 10- view prediction with each view a 224×224 crop. Our results in Table 2 (b) show considerable improvement over the no-SPP baselines in Table 2 (a). Interestingly, the largest gain of top-1 error (1.65%) is given by the most accurate architecture. Since we are still using the same 10 cropped views as in (a), these gains are solely because of multi-level pooling.	표 2의 (b) 우리 singlesize 훈련을 사용하여 결과를 나타낸다. 교육 및 시험 크기는 모두 224 × 224이다. 최종 층 컨벌루션 이후 풀링 층 SPP 층으로 대체되는 반면 이러한 네트워크에서, 컨벌루션 층은 해당 기본 모델과 동일한 구조를 갖는다. 표 2의 결과를 위해, 우리는 4 수준 피라미드를 사용합니다. 피라미드는 {6 × 6, 3 × 3 × 2, 1 × 1}입니다 (완전히 50 쓰레기통). 공정한 비교를 위해, 우리는 여전히 각각 표준 10 뷰 예측이 224 × 224 작물을 보려면 사용합니다. 표 2에서 우리의 결과 (b)는 표 2의 (a)에서 노 SPP베이스 라인을 통해 상당한 개선을 보여줍니다. 흥미롭게도, 최고의 오류 1 (1.65 %)의 최대 이득이 가장 정확한 구조에 의해 주어진다. 우리는 여전히 같은 10 자른 뷰를 사용하고 있기 때문에 (), 이러한 이득은 전적으로 때문에 멀티 레벨 풀링의이다.
It is worth noticing that the gain of multi-level pooling is not simply due to more parameters; rather, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout [15]. To show this, we train another ZF-5 network with a different 4-level pyramid: {4×4, 3×3, 2×2, 1×1} (totally 30 bins). This network has fewer parameters than its no-SPP counterpart, because its fc6 layer has 30×256-d inputs instead of 36×256-d. The top-1/top- 5 errors of this network are 35.06/14.04. This result is similar to the 50-bin pyramid above (34.98/14.14), but considerably better than the no-SPP counterpart (35.99/14.76).	이는 다단계 풀링의 이득으로 인해 그 이상의 파라미터에 단순히 아니라고 몰래 가치; 다단계 풀링 객체 변형 및 공간 배치 [15]의 분산을 강력하기 때문에, 오히려이다. 이를 보여주기 위해, 우리는 다른 4 수준 피라미드와 다른 ZF-5 네트워크를 훈련 : {4 × 4, 3 × 3, 2 × 2, 1 × 1} (완전히 30 쓰레기통). 그 FC6 층 30 × 256-D 입력하는 대신 36 × 256 (D)을 가지고 있기 때문에이 네트워크는 그 노 SPP 대응보다 적은 수의 매개 변수가 있습니다. 이 네트워크의 최상위 1 / 리면 5 오류가 35.06 / 14.04이다. 이 결과는 위의 50 단의 피라미드 (34.98 / 14.14)과 유사하지만, 노 SPP 대응 (35.99 / 14.76)보다 훨씬 낫다.

Multi-size Training Improves Accuracy

ENG	KOR
Table 2 (c) shows our results using multi-size training. The training sizes are 224 and 180, while the testing size is still 224. We still use the standard 10-view prediction. The top-1/top-5 errors of all architectures further drop. The top-1 error of SPP-net (Overfeat-7) drops to 29.68%, which is 2.33% better than its noSPP counterpart and 0.68% better than its single-size trained counterpart.	표 2 (c)는 다양한 크기의 훈련을 사용하여 우리의 결과를 보여줍니다. 테스트 크기가 여전히 우리는 여전히 10 표준 뷰 예측 (224)을 사용하는 동안 훈련 크기는 224, 180이다. 모든 아키텍처의 상단-1 / 최고 5 오류가 더 놓습니다. SPP-그물의 위쪽-1 오류 (Overfeat-7)의 noSPP 대응보다 2.33 % 더 나은와 싱글 사이즈 훈련 대응보다 0.68 % 더 나은 29.68 %로 떨어진다.
Besides using the two discrete sizes of 180 and 224, we have also evaluated using a random size uniformly sampled from [180, 224]. The top-1/5 error of SPP-net (Overfeat-7) is 30.06%/10.96%. The top- 1 error is slightly worse than the two-size version, possibly because the size of 224 (which is used for testing) is visited less. But the results are still better the single-size version.	180과 224의 두 개의 분리 된 크기를 사용하는 외에, 또한 균일 [180, 224]에서 샘플을 임의의 크기로 평가했다. SPP-NET의 최고 5분의 1 오류 (Overfeat-7) / 10.96 %, 30.06 %이다. 리면 1 에러 (테스트에 사용된다) (224)의 크기가 작은 방문 할 가능성이 있기 때문에, 2 크기의 버전보다 약간 더 나쁘다. 그러나 그 결과는 단일 - 사이즈 버전 여전히 낫다.
There are previous CNN solutions [5], [36] that deal with various scales/sizes, but they are mostly based on testing. In Overfeat [5] and Howard’s method [36], the single network is applied at multiple scales in the testing stage, and the scores are averaged. Howard further trains two different networks on low/highresolution image regions and averages the scores. To our knowledge, our method is the first one that trains a single network with input images of multiple sizes.	이전 CNN 솔루션이있다 [5], 다양한 스케일 / 크기를 처리하지만, 그들은 주로 테스트를 기반으로 [36]. Overfeat [5] 및 하워드의 방법 [36]에있어서, 하나의 네트워크는 테스트 단계에서 다수의 스케일에 도포하고, 점수를 평균화된다. 하워드 더 열차 저 / highresolution 이미지 영역 평균 점수에 두 개의 서로 다른 네트워크. 우리의 지식에, 우리의 방법은 여러 크기의 입력 이미지가 하나의 네트워크를 훈련 첫 번째입니다.

Full-image Representations Improve Accuracy

ENG	KOR
Next we investigate the accuracy of the full-image views. We resize the image so that min(w, h)=256 while maintaining its aspect ratio. The SPP-net is applied on this full image to compute the scores of the full view. For fair comparison, we also evaluate the accuracy of the single view in the center 224×224 crop (which is used in the above evaluations). The comparisons of single-view testing accuracy are in Table 3. Here we evaluate ZF-5/Overfeat-7. The top-1 error rates are all reduced by the full-view representation. This shows the importance of maintaining the complete content. Even though our network is trained using square images only, it generalizes well to other aspect ratios.	다음으로 우리는 전체 이미지보기의 정확성을 조사합니다. 분 (H, W) = 256의 종횡비를 유지하기 때문에 우리는 이미지의 크기를 조정. SPP-NET은 전체 뷰의 점수를 계산하기 위해이 전체 이미지에 적용됩니다. 공정한 비교를 위해, 우리는 (상기 평가에 사용되는) 중앙 224 × 224 작물의 단일보기의 정확성을 평가한다. 단일 뷰 테스트 정확도 비교 우리는 ZF-5 / Overfeat -7- 평가를 표 3에있다. 최상위 1 오류율은 모두 풀 뷰 표현에 의해 감소된다. 이것은 완전한 콘텐츠를 유지하는 것의 중요성을 보여준다. 우리의 네트워크 만 정사각형 이미지를 사용하여 훈련된다하더라도, 다른 종횡비 잘 일반화.
Comparing Table 2 and Table 3, we find that the combination of multiple views is substantially better than the single full-image view. However, the fullimage representations are still of good merits. First, we empirically find that (discussed in the next subsection) even for the combination of dozens of views, the additional two full-image views (with flipping) can still boost the accuracy by about 0.2%. Second, the full-image view is methodologically consistent with the traditional methods [15], [17], [19] where the encoded SIFT vectors of the entire image are pooled together. Third, in other applications such as image retrieval [37], an image representation, rather than a classification score, is required for similarity ranking. A full-image representation can be preferred.	표 2, 표 3을 비교, 우리는 여러보기의 조합이 하나의 전체 이미지보기보다 실질적으로 더 나은 것을 찾을 수 있습니다. 그러나, fullimage 표현은 여전히 좋은 장점입니다. 첫째, 우리는 경험적으로 (다음 섹션에서 설명)도보기 수십 조합, (뒤집기)와 추가로 두 개의 전체 이미지보기가 여전히 약 0.2 %의 정확도를 높일 수를 찾을 수 있습니다. 두 번째로, 전체 이미지 화면 전체 화상의 부호화 SIFT 벡터 함께 풀링 전통적인 방법 [15], [17], [19]을 방법 론적으로 일치한다. 셋째, 이러한 화상 검색 [37], 이미지 표현보다는 분류 스코어와 같은 다른 애플리케이션에서 유사성 순위 요구된다. 전체 이미지 표시가 바람직 할 수있다.

Multi-view Testing on Feature Maps

ENG	KOR
Inspired by our detection algorithm (described in the next section), we further propose a multi-view testing method on the feature maps. Thanks to the flexibility of SPP, we can easily extract the features from windows (views) of arbitrary sizes from the convolutional feature maps.	(다음 절에 설명) 우리 검출 알고리즘에 의해 고무, 우리는 상기 기능지도 멀티 뷰 검사 방법을 제안한다. SPP의 유연성 덕분에, 우리는 쉽게 길쌈 기능 맵에서 임의의 크기의 창 (보기)에서 특징을 추출 할 수 있습니다.
On the testing stage, we resize an image so min(w, h) = s where s represents a predefined scale (like 256). Then we compute the convolutional feature maps from the entire image. For the usage of flipped views, we also compute the feature maps of the flipped image. Given any view (window) in the image, we map this window to the feature maps (the way of mapping is in Appendix), and then use SPP to pool the features from this window (see Figure 5). The pooled features are then fed into the fc layers to compute the softmax score of this window. These scores are averaged for the final prediction. For the standard 10-view, we use s = 256 and the views are 224×224 windows on the corners or center. Experiments show that the top-5 error of the 10-view prediction on feature maps is within 0.1% around the original 10-view prediction on image crops.	테스트 단계에서, 우리는 이미지의 크기를 조정들 (256 등) 미리 정의 된 배율을 나타내는 = s의 (시간, w) 분 그래서. 그 다음 우리는 전체 이미지에서 길쌈 기능 맵을 계산한다. 플립 뷰의 사용을 위해, 우리는 또한 대칭 이미지의 피쳐 맵을 계산한다. 이미지의 모든보기 (창)을 감안할 때, 우리는 (그림 5 참조) 기능지도에이 창을 (매핑의 방법은 부록에)지도, 다음이 창에서 기능을 풀 SPP를 사용합니다. 풀링 특징은 다음이 창 softmax를 점수를 계산하는 FC 층으로 공급된다. 이 점수는 최종 예측에 대한 평균 있습니다. 표준 10-보기 위해, 우리는 S = 256를 사용하고보기는 모서리 또는 중앙에 224 × 224 창입니다. 실험 기능지도 10보기 예측의 상위 5 오류가 이미지 작물의 원래 10 뷰 예측 주위에 0.1 % 내에 있음을 보여줍니다.
We further apply this method to extract multiple views from multiple scales. We resize the image to six scales s ∈ {224, 256, 300, 360, 448, 560} and compute the feature maps on the entire image for each scale. We use 224 × 224 as the view size for any scale, so these views have different relative sizes on the original image for different scales. We use 18 views for each scale: one at the center, four at the corners, and four on the middle of each side, with/without flipping (when s = 224 there are 6 different views). The combination of these 96 views reduces the top-5 error from 10.95% to 9.36%. Combining the two fullimage views (with flipping) further reduces the top-5 error to 9.14%.	우리는 또한 여러 스케일에서 여러보기를 추출하는이 방법을 적용 할 수 있습니다. 우리는 여섯 규모의 ∈ {224, 256, 300, 360, 448, 560}에 이미지 크기를 조정하고 각 규모에 대한 전체 이미지에 기능 맵을 계산한다. 우리는 어떤 규모의 뷰 크기가 224 × 224을 사용하므로이 뷰는 다른 스케일 원본 이미지에 다른 상대적인 크기를 가지고있다. 우리는 각각의 규모 18 뷰 사용 틀지 않고 함께 / 각면의 중앙에 모서리 중앙에 하나, 네, 네을 (들 (224) = 때 6 개의보기가 있습니다). 이들 96 개의 뷰의 조합 10.95 %에서 9.36 %로 상위 5 에러를 감소시킨다. (뒤집기)와 두 fullimage보기를 결합하면 더 9.14 %로 상위 5 오류를 줄일 수 있습니다.
In the Overfeat paper [5], the views are also extracted from the convolutional feature maps instead of image crops. However, their views cannot have arbitrary sizes; rather, the windows are those where the pooled features match the desired dimensionality. We empirically find that these restricted windows are less beneficial than our flexibly located/sized windows.	길쌈 기능은 이미지 작물 대신지도에서 Overfeat 종이 [5]에서,보기도 추출된다. 그러나, 그들의 견해는 임의의 크기를 가질 수 없습니다; 오히려, 창 풀링 기능을 원하는 차원과 일치하는 곳입니다. 우리는 경험적으로 이러한 제한 창 우리의 유연 위치 / 크기의 창보다 유익한 것을 찾을 수 있습니다.

Summary and Results for ILSVRC 2014

ENG	KOR
In Table 4 we compare with previous state-of-theart methods. Krizhevsky et al.’s [3] is the winning method in ILSVRC 2012; Overfeat [5], Howard’s [36], and Zeiler and Fergus’s [4] are the leading methods in ILSVRC 2013. We only consider single-network performance for manageable comparisons.	표 4에서 우리는 이전 국가의 theart 방법과 비교한다. Krizhevsky 등의 [3] ILSVRC 2012 년 우승 방법입니다.; Overfeat [5] 하워드의 [36], 그리고 ZEILER 및 퍼거스의 [4] ILSVRC 2013 년 최고의 방법은 우리는 단지 관리 비교를 위해 단일 네트워크 성능을 고려한다.
Our best single network achieves 9.14% top-5 error on the validation set. This is exactly the single-model entry we submitted to ILSVRC 2014 [26]. The top-5 error is 9.08% on the testing set (ILSVRC 2014 has the same training/validation/testing data as ILSVRC 2012). After combining eleven models, our team’s result (8.06%) is ranked #3 among all 38 teams attending ILSVRC 2014 (Table 5). Since the advantages of SPPnet should be in general independent of architectures, we expect that it will further improve the deeper and larger convolutional architectures [33], [32].	우리의 가장 좋은 하나의 네트워크는 검증 세트에 9.14 % 최고 5 오류를 얻을 수있다. 이것이 바로 우리가 ILSVRC 2014 [26]에 제출 단일 모델 항목입니다. 상위 5 오류 (ILSVRC 2014 ILSVRC 2012과 동일한 교육 / 검증 / 테스트 데이터가) 테스트 세트에 9.08 %이다. 열한 모델을 조합 한 후, 우리 팀의 결과는 (8.06 %) ILSVRC 2014 (표 5)에 참석하는 모든 38 팀 중 3 위를 기록하고 있습니다. SPPnet의 장점이 아키텍처의 일반적인 독립에 있어야하기 때문에, 우리는 더 깊고 더 큰 길쌈 구조 [33], [32] 개선 될 것으로 기대.

Experiments on VOC 2007 Classification

ENG	KOR
Our method can generate a full-view image representation. With the above networks pre-trained on ImageNet, we extract these representations from the images in the target datasets and re-train SVM classifiers [38]. In the SVM training, we intentionally do not use any data augmentation (flip/multi-view). We l2-normalize the features for SVM training.	우리의 방법은 전체 뷰 이미지 표현을 생성 할 수있다. 위의 네트워크 ImageNet에 사전 교육을받은, 우리는 대상 데이터 세트의 이미지에서 이러한 표현을 추출하고 다시 기차 SVM 분류 [38]. SVM 훈련에서, 우리는 의도적으로 데이터의 증가 (플립 / 멀티 뷰)를 사용하지 않습니다. 우리는 SVM 훈련 기능을-정상화 L2.
The classification task in Pascal VOC 2007 [22] involves 9,963 images in 20 categories. 5,011 images are for training, and the rest are for testing. The performance is evaluated by mean Average Precision (mAP). Table 6 summarizes the results.	파스칼 VOC 2007 [22]의 분류 작업은 20 카테고리에 9,963 이미지를 포함한다. 5011 이미지는 트레이닝 용으로, 나머지는 시험이다. 성능은 평균 평균 정밀 (MAP)에 의해 평가된다. 결과를 표 6에 요약한다.
We start from a baseline in Table 6 (a). The model is ZF-5 without SPP. To apply this model, we resize the image so that its smaller dimension is 224, and crop the center 224×224 region. The SVM is trained via the features of a layer. On this dataset, the deeper the layer is, the better the result is. In Table 6 (b), we replace the no-SPP net with our SPP-net. As a first-step comparison, we still apply the SPP-net on the center 224×224 crop. The results of the fc layers improve. This gain is mainly due to multi-level pooling.	우리는 표 6의 (a)에베이스 라인에서 시작합니다. 이 모델은 ZF-5 SPP하지 않고있다. 이 모델을 적용하기 위해, 우리는 그 작은 치수가 224이되도록 이미지 크기를 조정하고, 중앙 224 × 224 영역을 자르기. SVM은 층의 기능을 통해 훈련된다. 이 세트에서 깊은 층인수록 결과이다. 표 6의 (b)에서, 우리는 우리의 SPP-그물없는 SPP 그물을 교체합니다. 첫 번째 단계의 비교, 우리는 여전히 중앙 224 × 224 작물에 SPP-그물을 적용합니다. FC 층들의 결과를 향상시킨다. 이 이득은 다단계 풀링 주로 기인한다.
Table 6 (c) shows our results on full images, where the images are resized so that the shorter side is 224. We find that the results are considerably improved (78.39% vs. 76.45%). This is due to the full-image representation that maintains the complete content.	표 6 (c)는 짧은면 우리는 결과가 상당히 (78.39 % 대 76.45 %)을 개선 것을 발견 224이되도록 이미지의 크기가 변경되어 전체 이미지에 우리의 결과를 보여줍니다. 이는 전체 콘텐츠를 유지 전체 이미지 표현이다.
Because the usage of our network does not depend on scale, we resize the images so that the smaller dimension is s and use the same network to extract features. We find that s = 392 gives the best results (Table 6 (d)) based on the validation set. This is mainly because the objects occupy smaller regions in VOC 2007 but larger regions in ImageNet, so the relative object scales are different between the two sets. These results indicate scale matters in the classification tasks, and SPP-net can partially address this “scale mismatch” issue.	우리의 네트워크의 사용 규모에 의존하지 않기 때문에, 우리는 작은 치수의 수 있도록 이미지 크기를 조정하고 기능을 추출하기 위해 동일한 네트워크를 사용합니다. 우리는이 = 392 검증 세트를 기반으로 최상의 결과 (표 6 (d)를) 제공 s를 찾을 수 있습니다. 개체 ImageNet에서 VOC 2007 작은 영역이지만 큰 영역을 차지하기 때문에 주로이므로 상대 오브젝트 비늘 두 세트 사이 다르다. 이 결과는 분류 태스크에 스케일 물질을 나타내고, SPP-그물 부분이 "스케일 불일치"문제를 해결할 수있다.
In Table 6 (e) the network architecture is replaced with our best model (Overfeat-7, multi-size trained), and the mAP increases to 82.44%. Table 8 summarizes our results and the comparisons with the state-of-theart methods. Among these methods, VQ [15], LCC [18], and FK [19] are all based on spatial pyramids matching, and [13], [4], [34], [6] are based on deep networks. In these results, Oquab et al.’s (77.7%) and Chatfield et al.’s (82.42%) are obtained by network fine-tuning and multi-view testing. Our result is comparable with the state of the art, using only a single full-image representation and without fine-tuning.	표 6 (e)에 네트워크 아키텍처는 우리의 최고의 모델 (Overfeat-7, 훈련 멀티 크기) 및 82.44 %로지도 증가로 대체됩니다. 표 8은 우리의 결과와 국가의 theart 방법과 비교를 요약 한 것입니다. 이들 방법 중에서, VQ [15], LCC [18], 및 FK [19] 일치하는 모든 공간 피라미드에 기초되고 [13], [4], [34], [6] 깊은 네트워크에 기초한다. 이러한 결과에서 Oquab 외.의 (77.7 %)과 Chatfield 외.의 (82.42 %)는 네트워크 미세 조정 및 멀티 뷰 테스트를 통해 얻을 수있다. 이러한 결과는 하나의 전체 이미지 표현과 미세 조정을 사용하지 않고, 종래 기술과 비교할 수있다.

Experiments on Caltech101

ENG	KOR
The Caltech101 dataset [21] contains 9,144 images in 102 categories (one background). We randomly sample 30 images per category for training and up to 50 images per category for testing. We repeat 10 random splits and average the accuracy. Table 7 summarizes our results.	Caltech101 데이터 세트는 [21] (102) 범주 (한 배경)에서 9144 이미지가 포함되어 있습니다. 우리는 무작위로 훈련 및 테스트 카테고리 당 50 이미지까지 카테고리 당 30 이미지를 샘플링. 우리는 10 임의의 분할을 반복 정확성을 평균. 표 7은 우리의 결과를 요약 한 것입니다.
There are some common observations in the Pascal VOC 2007 and Caltech101 results: SPP-net is better than the no-SPP net (Table 7 (b) vs. (a)), and the fullview representation is better than the crop ((c) vs. (b)). But the results in Caltech101 have some differences with Pascal VOC. The fully-connected layers are less accurate, and the SPP layers are better. This is possibly because the object categories in Caltech101 are less related to those in ImageNet, and the deeper layers are more category-specialized. Further, we find that the scale 224 has the best performance among the scales we tested on this dataset. This is mainly because the objects in Caltech101 also occupy large regions of the images, as is the case of ImageNet.	파스칼 VOC 2007 Caltech101 결과 몇 가지 일반적인 관찰이있다 : SPP-그물 노 SPP 그물보다 좋다 (표 7의 (b) 대 (A)), 그리고 fullview 표현은 농작물보다 더 ((c ) 대 (b)). 그러나 Caltech101의 결과는 파스칼 VOC와 약간의 차이가 있습니다. 완전히 연결된 층은 덜 정확하고, SPP 층 더 낫다. Caltech101에서 오브젝트 카테고리 ImageNet 형태와 관련 이하이고, 깊은 층 이상의 카테고리 특화되어 있기 때문에 가능하다. 또한, 우리는 규모 (224) 우리는이 데이터 집합 테스트에서 규모 중 최고의 성능을 가지고 찾을 수 있습니다. Caltech101의 객체는 이미지에 큰 영역을 차지하기 때문에 ImageNet의 경우와 같이이 중심이다.
Besides cropping, we also evaluate warping the image to fit the 224×224 size. This solution maintains the complete content, but introduces distortion. On the SPP (ZF-5) model, the accuracy is 89.91% using the SPP layer as features - lower than 91.44% which uses the same model on the undistorted full image.	자르기 외에, 우리는 또한 224 × 224 크기에 맞게 이미지를 휘게 평가합니다. 이 용액의 전체 콘텐츠를 유지하지만 왜곡을 도입한다. 이하 91.44 % 왜곡 전체 이미지에서 동일한 모델을 사용 - SPP (ZF-5) 모델, 정밀도 등의 기능 층을 SPP하여 89.91 %이다.
Table 8 summarizes our results compared with the state-of-the-art methods on Caltech101. Our result (93.42%) exceeds the previous record (88.54%) by a substantial margin (4.88%).	표 8 Caltech101에 최첨단 방법에 비해 우리의 결과를 요약 한 것입니다. 우리의 결과 (93.42 %)이 상당한 마진 (4.88 %)에 의해 이전 기록 (88.54 %)를 초과합니다.

SPP-NET FOR OBJECT DETECTION

ENG	KOR
Deep networks have been used for object detection. We briefly review the recent state-of-the-art R-CNN method [7]. R-CNN first extracts about 2,000 candidate windows from each image via selective search [20]. Then the image region in each window is warped to a fixed size (227×227). A pre-trained deep network is used to extract the feature of each window. A binary SVM classifier is then trained on these features for detection. R-CNN generates results of compelling quality and substantially outperforms previous methods. However, because R-CNN repeatedly applies the deep convolutional network to about 2,000 windows per image, it is time-consuming. Feature extraction is the major timing bottleneck in testing.	깊은 네트워크 객체 검출을 위해 사용되었다. 우리는 잠시 최근 최첨단 R-CNN 방법을 검토 [7]. R-CNN은 제 1 선택 검색 [20]를 통해 각 이미지에서 약 2,000 후보 창을 추출합니다. 각 윈도우의 화상 영역은 고정 된 크기 (227 × 227)로 변형 된 것이다. 사전 교육을받은 깊은 네트워크는 각 윈도우의 특징을 추출하는 데 사용됩니다. 이진 SVM 분류는 다음 검출을위한 이러한 기능에 대한 교육을한다. R-CNN은 뛰어난 품질의 결과를 생성하고, 실질적으로 이전의 방법을 능가하는 성능. R-CNN 반복적 이미지 당 약 2,000 창문 깊은 컨벌루션 네트워크를 적용하기 때문에, 이는 시간 소모적이다. 특징 추출 테스트의 주요 타이밍 병목 현상입니다.
Our SPP-net can also be used for object detection. We extract the feature maps from the entire image only once (possibly at multiple scales). Then we apply the spatial pyramid pooling on each candidate window of the feature maps to pool a fixed-length representation of this window (see Figure 5). Because the time-consuming convolutions are only applied once, our method can run orders of magnitude faster.	우리 SPP-NET는 물체 검출에 사용될 수있다. 우리는 (아마도 여러 규모에서) 한 번만 전체 이미지에서 기능 맵의 압축을 풉니 다. 그렇다면 우리는이 윈도우의 고정 길이 표현을 풀 피쳐 맵 각 후보 창 풀링 공간 피라미드 적용 (도 5 참조). 시간이 많이 소요되는 회선은 한 번만 적용되기 때문에, 우리의 방법은 빠른 진도의 명령을 실행할 수 있습니다.
Our method extracts window-wise features from regions of the feature maps, while R-CNN extracts directly from image regions. In previous works, the Deformable Part Model (DPM) [23] extracts features from windows in HOG [24] feature maps, and the Selective Search (SS) method [20] extracts from windows in encoded SIFT feature maps. The Overfeat detection method [5] also extracts from windows of deep convolutional feature maps, but needs to predefine the window size. On the contrary, our method enables feature extraction in arbitrary windows from the deep convolutional feature maps.	R-CNN은 이미지 영역에서 직접 추출하는 동안 우리의 방법은, 기능지도의 영역에서 창 많다는 특징을 추출한다. 이전 작품에서, 변형 부 모델 (DPM) [23] [24] 기능지도 및 선택적 검색 (SS) 방법 인코딩 된 SIFT 기능지도에서 창에서 [20] 추출 HOG의 창에서 기능을 추출한다. Overfeat 검출 방법 [5] 또한 깊은 컨벌루션 기능 맵 윈도우에서 추출되지만 윈도우 사이즈를 미리 정의 할 필요가있다. 반대로, 우리의 방법은 깊은 길쌈 기능 맵에서 임의의 창에서 특징 추출 할 수 있습니다.

Detection Algorithm

ENG	KOR
We use the “fast” mode of selective search [20] to generate about 2,000 candidate windows per image. Then we resize the image such that min(w, h) = s, and extract the feature maps from the entire image. We use the SPP-net model of ZF-5 (single-size trained) for the time being. In each candidate window, we use a 4-level spatial pyramid (1×1, 2×2, 3×3, 6×6, totally 50 bins) to pool the features. This generates a 12,800- d (256×50) representation for each window. These representations are provided to the fully-connected layers of the network. Then we train a binary linear SVM classifier for each category on these features.	우리는 이미지 당 약 2,000 후보 창을 생성하는 선택적 검색 [20]의 "빠른"모드를 사용합니다. 그 다음 우리는 이미지와 같은 그 분 (W, H) = s의 크기를 조정하고, 전체 이미지에서 기능 맵의 압축을 풉니 다. 우리는 당분간 ZF-5의 SPP-네트 모델 (훈련 싱글 사이즈)를 사용합니다. 각 후보 창에서, 우리는 4 수준의 공간 피라미드를 사용 (1 × 1, 2 × 2, 3 × 3, 6 × 6, 완전히 50 쓰레기통) 기능을 풀 수 있습니다. 이는 각 창 12,800- D (256 × 50) 표현을 생성합니다. 이 표현은 네트워크의 완전 연결 층에 제공된다. 그 다음 우리는이 기능에 대한 각 범주에 대한 이진 선형 SVM 분류기를 훈련.
Our implementation of the SVM training follows [20], [7]. We use the ground-truth windows to generate the positive samples. The negative samples are those overlapping a positive window by at most 30% (measured by the intersection-over-union (IoU) ratio). Any negative sample is removed if it overlaps another negative sample by more than 70%. We apply the standard hard negative mining [23] to train the SVM. This step is iterated once. It takes less than 1 hour to train SVMs for all 20 categories. In testing, the classifier is used to score the candidate windows. Then we use non-maximum suppression [23] (threshold of 30%) on the scored windows.	SVM 훈련의 우리의 구현은 [20] 다음, [7]. 우리는 양의 샘플을 생성하기 위해 접지 진실 창을 사용한다. 부정적인 샘플 (교차 오버 노조 (IOU) 비율로 측정) 최대 30 % 긍정적 인 창을 겹치는 것들이다. 이 70 % 이상으로 다른 음성 시료를 중복하는 경우 어떤 부정적인 샘플이 제거됩니다. 우리는 SVM을 훈련 할 수있는 표준 하드 부정적인 광산 [23]을 적용합니다. 이 단계가 한 번 반복된다. 그것은 모든 20 범주에 대한 SVM을 훈련을 1 시간 미만 소요됩니다. 테스트에서, 분류는 후보 창을 득점하는 데 사용됩니다. 그 다음 우리는 득점 창에 비 최대 억제 [23] (30 % 임계 값)를 사용합니다.
Our method can be improved by multi-scale feature extraction. We resize the image such that min(w, h) = s ∈ S = {480, 576, 688, 864, 1200}, and compute the feature maps of conv5 for each scale. One strategy of combining the features from these scales is to pool them channel-by-channel. But we empirically find that another strategy provides better results. For each candidate window, we choose a single scale s ∈ S such that the scaled candidate window has a number of pixels closest to 224×224. Then we only use the feature maps extracted from this scale to compute the feature of this window. If the pre-defined scales are dense enough and the window is approximately square, our method is roughly equivalent to resizing the window to 224×224 and then extracting features from it. Nevertheless, our method only requires computing the feature maps once (at each scale) from the entire image, regardless of the number of candidate windows.	우리의 방법은 멀티 - 스케일 피쳐 추출함으로써 개선 될 수있다. 우리는 이미지의 크기를 조정되도록 분 = S ∈ S = {480, 576, 688, 864, 1200}, 각 규모에 대한 conv5의 기능 맵을 계산 (시간, w). 이러한 스케일에서 기능을 결합 한 전략은 그들에게 채널로 채널을 풀 것입니다. 그러나 우리는 경험적으로 또 다른 전략이 더 나은 결과를 제공하는 찾을 수 있습니다. 각 후보 창을 위해, 우리는 확장 후보 창이 픽셀 224 × 224에 가장 가까운 수를 가지고 단일 규모의 ∈ S는 선택합니다. 그리고 우리는이 윈도우의 기능을 계산하기 위해이 규모에서 추출 기능 맵을 사용합니다. 미리 정의 된 비늘 충분히 치밀하고 창은 대략 사각형 인 경우, 우리의 방법은 224 × 224 윈도우 크기를 조절하고 그것으로부터 특징 추출과 거의 동일하다. 그럼에도 불구하고, 우리의 방법은 관계없이 후보 윈도우의 수, 전체 이미지 (각 규모) 일단 피쳐 맵을 계산이 필요하다.
We also fine-tune our pre-trained network, following [7]. Since our features are pooled from the conv5 feature maps from windows of any sizes, for simplicity we only fine-tune the fully-connected layers. In this case, the data layer accepts the fixed-length pooled features after conv5, and the fc6,7 layers and a new 21-way (one extra negative category) fc8 layer follow. The fc8 weights are initialized with a Gaussian distribution of σ=0.01. We fix all the learning rates to 1e-4 and then adjust to 1e-5 for all three layers. During fine-tuning, the positive samples are those overlapping with a ground-truth window by [0.5, 1], and the negative samples by [0.1, 0.5). In each mini-batch, 25% of the samples are positive. We train 250k minibatches using the learning rate 1e-4, and then 50k mini-batches using 1e-5. Because we only fine-tune the fc layers, the training is very fast and takes about 2 hours on the GPU (excluding pre-caching feature maps which takes about 1 hour). Also following [7], we use bounding box regression to post-process the prediction windows. The features used for regression are the pooled features from conv5 (as a counterpart of the pool5 features used in [7]). The windows used for the regression training are those overlapping with a ground-truth window by at least 50%.	우리는 또한 미세 조정 우리의 사전 교육 네트워크, 다음 [7]. 우리의 기능은 간단하게 우리 만 미세 조정 완전히 연결된 레이어, 어떤 크기의 창에서 conv5 기능지도에서 풀링되기 때문에. 이 경우, 데이터 영역은 고정 길이를 풀링 conv5 후 기능 및 fc6,7 층 새로운 21 웨이 (엑스트라 마이너스 카테고리) FC8 층 추적을 받아 들인다. FC8 가중치는 σ = 0.01의 가우스 분포로 초기화된다. 우리는 1E-4에 대한 모든 학습 속도를 수정하고 다음 세 가지 레이어 1E-5로 조정합니다. 미세 조정하는 동안, 긍정적 인 샘플은 [0.5, 1]에 의해 지상 진실 창에 중복 해당하고, [0.1, 0.5)에 의해 부정적인 샘플입니다. 각각의 미니 - 배치에서, 샘플의 25 %는 긍정적이다. 우리는 그 다음 1E-5를 사용하여 50K 미니 일괄 학습 속도를 1E-4를 사용 250K minibatches를 양성합니다. 우리 만 미세 조정 FC 층 때문에, 훈련은 매우 빠르고 및 (약 1 시간 소요 미리 캐싱 기능 맵 제외) GPU에 약 2 시간이 소요됩니다. 또한 다음 [7], 우리는 사후 처리에 예측 창을 상자 회귀를 경계 사용합니다. 회귀 사용 기능 conv5으로부터 (에 사용 pool5 기능으로 대응 [7]) 풀링 특징이다. 회귀 훈련에 사용 창 50 % 이상 지상 진실 윈도우와 중첩들이다.

Detection Results

ENG	KOR
We evaluate our method on the detection task of the Pascal VOC 2007 dataset. Table 9 shows our results on various layers, by using 1-scale (s=688) or 5-scale. Here the R-CNN results are as reported in [7] using the AlexNet [3] with 5 conv layers. Using the pool5 layers (in our case the pooled features), our result (44.9%) is comparable with R-CNN’s result (44.2%). But using the non-fine-tuned fc6 layers, our results are inferior. An explanation is that our fc layers are pretrained using image regions, while in the detection case they are used on the feature map regions. The feature map regions can have strong activations near the window boundaries, while the image regions may not. This difference of usages can be addressed by fine-tuning. Using the fine-tuned fc layers (ftfc6,7), our results are comparable with or slightly better than the fine-tuned results of R-CNN. After bounding box regression, our 5-scale result (59.2%) is 0.7% better than R-CNN (58.5%), and our 1-scale result (58.0%) is 0.5% worse.	우리는 파스칼 VOC 2007 데이터 세트의 검출 작업에 우리의 방법을 평가합니다. 표 9는 1 규모 (들 = 688) 또는 5 스케일을 사용하여, 다양한 계층에 우리의 결과를 보여줍니다. 여기서 R-CNN 결과 [3] 5 층으로 전환 [7]을 이용 AlexNet에보고 된 바와 같다. (우리의 경우 풀링 기능) pool5 레이어를 사용하여, 우리의 결과 (44.9 %)은 R-CNN의 결과 (44.2 %)와 비교입니다. 그러나 비 미세 조정 FC6 층을 사용하여, 우리의 결과는 떨어진다. 설명 검출 경우가 기능 맵 영역에 사용되지만 우리 FC 층, 화상 영역을 이용 pretrained된다는 것이다. 피쳐 맵 영역하면서 화상 영역이 아닌 수 창 경계 근처 강한 활성화를 가질 수있다. 사용량의 차이는 미세 조정에 의해 해결 될 수있다. 미세 조정 FC 층 (ftfc6,7)를 사용하여, 우리의 결과와 비교 또는 R-CNN의 미세 조정 결과보다 약간 더 낫다. 상자의 회귀를 경계 한 후, 우리의 5 스케일 결과 (59.2 %)은 R-CNN (58.5 %)보다 0.7 % 더 나은, 그리고 우리의 1 규모의 결과 (58.0 %)이 0.5 % 더 나쁘다.
In Table 10 we further compare with R-CNN using the same pre-trained model of SPPnet (ZF-5). In this case, our method and R-CNN have comparable averaged scores. The R-CNN result is boosted by this pre-trained model. This is because of the better architecture of ZF-5 than AlexNet, and also because of the multi-level pooling of SPPnet (if using the noSPP ZF-5, the R-CNN result drops). Table 11 shows the results for each category.	표 10에서 우리는 또한 R-CNN은 SPPnet (ZF-5)의 동일한 사전 훈련 모델을 사용하여 비교. 이 경우, 우리의 방법과 R-CNN은 비교가 점수를 평균 있습니다. R-CNN 결과는이 사전 훈련 모델에 의해 증폭된다. 이것은 (사용하는 경우 noSPP ZF-5, R-CNN 결과 방울)로 인해, 또한 때문에 SPPnet의 다단계 풀링 AlexNet보다 ZF-5 나은 아키텍처이다. 표 11은 각 범주에 대한 결과를 보여줍니다.
Table 11 also includes additional methods. Selective Search (SS) [20] applies spatial pyramid matching on SIFT feature maps. DPM [23] and Regionlet [39] are based on HOG features [24]. The Regionlet method improves to 46.1% [8] by combining various features including conv5. DetectorNet [40] trains a deep network that outputs pixel-wise object masks. This method only needs to apply the deep network once to the entire image, as is the case for our method. But this method has lower mAP (30.5%).	표 11 추가적인 방법을 포함한다. 선택적 검색 (SS) [20] SIFT 기능지도에 공간 피라미드 매칭을 적용합니다. DPM [23]과 Regionlet [39] HOG 기능 [24]을 기반으로합니다. Regionlet 방법은 conv5 등 다양한 기능을 결합하여 46.1 % [8]로 향상시킨다. DetectorNet는 [40] 픽셀 현명한 객체 마스크를 출력하는 깊은 네트워크를 훈련한다. 우리의 방법에 대한 경우와 같이,이 방법은, 화상 전체를 한 번 깊은 네트워크를 적용 할 필요가있다. 그러나이 방법은 낮은지도 (30.5 %)가 있습니다.

Complexity and Running Time

ENG	KOR
Despite having comparable accuracy, our method is much faster than R-CNN. The complexity of the convolutional feature computation in R-CNN is O(n ·227^2) with the window number n (∼2000). This complexity of our method is O(r · s^2) at a scale s, where r is the aspect ratio. Assume r is about 4/3. In the single-scale version when s = 688, this complexity is about 1/160 of R-CNN’s; in the 5-scale version, this complexity is about 1/24 of R-CNN’s.	비교 정확도를 갖는에도 불구하고, 우리의 방법은 R-CNN보다 훨씬 빠릅니다. R-CNN의 컨벌루션 특성 연산의 복잡도는 윈도우 수 N (~2000)와 (N ^ 2 * 227) O이다. 우리의 방법이 복잡 R은 종횡비 스케일들에서, O (R · S ^ 2)이다. r은 약 4/3 가정합니다. s는 688 = 단일 스케일 버전에서,이 복잡성은 R-CNN의 약 1/160이고; 5 스케일 버전에서, 이러한 복잡성은 약 1/24의 R-CNN이다.
In Table 10, we provide a fair comparison on the running time of the feature computation using the same SPP (ZF-5) model. The implementation of RCNN is from the code published by the authors implemented in Caffe [35]. We also implement our feature computation in Caffe. In Table 10 we evaluate the average time of 100 random VOC images using GPU. R-CNN takes 14.37s per image for convolutions, while our 1-scale version takes only 0.053s per image. So ours is 270× faster than R-CNN. Our 5-scale version takes 0.293s per image for convolutions, so is 49× faster than R-CNN. Our convolutional feature computation is so fast that the computational time of fc layers takes a considerable portion. Table 10 shows that the GPU time of computing the 4,096-d fc7 features is 0.089s per image. Considering both convolutional and fully-connected features, our 1-scale version is 102× faster than R-CNN and is 1.2% inferior; our 5-scale version is 38× faster and has comparable results.	표 10에서는 동일한 SPP (ZF-5) 모델을 이용하여 기능 계산의 실행 시간의 공정한 비교를 제공한다. RCNN의 구현은 CAFFE [35]에 구현 된 저자에 의해 발표 된 코드에서입니다. 우리는 또한 CAFFE 우리의 기능 계산을 구현한다. 표 10에서, 우리는 GPU (100)를 사용하여 임의의 VOC 화상의 평균 시간을 평가한다. 우리의 1 규모의 버전은 0.053s 이미지 당을 취하면서 R-CNN은 회선에 대한 이미지 당 14.37s 걸립니다. 그래서 우리는 R-CNN 270 × 빠릅니다. 우리의 5 규모의 버전은 그래서 R-CNN보다 49 × 빠른, 회선에 대한 이미지 당 0.293s합니다. 우리 컨벌루션 연산 기능은 FC 층 계산 시간이 상당 부분을 차지하도록 빠르다. 표 10은 4096-D FC7 기능을 계산하는 GPU 시간은 이미지 당 0.089s 있음을 보여줍니다. 모두 길쌈과 완벽하게 연결 기능을 고려할 때, 우리의 1 규모의 버전은 R-CNN보다 102 × 빠른 1.2 % 떨어진다; 우리의 5 규모의 버전은 38 × 빠릅니다과 유사한 결과가 있습니다.
We also compares the running time in Table 9 where R-CNN uses AlexNet [3] as is in the original paper [7]. Our method is 24× to 64× faster. Note that the AlexNet [3] has the same number of filters as our ZF- 5 on each conv layer. The AlexNet is faster because it uses splitting on some layers, which was designed for two GPUs in [3].	우리는 또한 R-CNN이 사용하는 표 9에서 실행 시간 AlexNet을 비교 [3] 원래 종이에서와 같이 [7]. 우리의 방법은 64 × 빠른 24 ×입니다. AlexNet [3] 우리 ZF- 5 각 전환 층에 같은 필터의 같은 번호를 가지고 있습니다. AlexNet는 두 개의 GPU를 위해 설계되었습니다 일부 층에 분할을 사용하기 때문에 빠릅니다 [3].
We further achieve an efficient full system with the help of the recent window proposal method [25]. The Selective Search (SS) proposal [20] takes about 1-2 seconds per image on a CPU. The method of EdgeBoxes [25] only takes ∼ 0.2s. Note that it is sufficient to use a fast proposal method during testing only. Using the same model trained as above (using SS), we test proposals generated by EdgeBoxes only. The mAP is 52.8 without bounding box regression. This is reasonable considering that EdgeBoxes are not used for training. Then we use both SS and EdgeBox as proposals in the training stage, and adopt only EdgeBoxes in the testing stage. The mAP is 56.3 without bounding box regression, which is better than 55.2 (Table 10) due to additional training samples. In this case, the overall testing time is ∼0.5s per image including all steps (proposal and recognition). This makes our method practical for real-world applications.	우리는 더 최근의 창 제안 방법 [25]의 도움으로 효율적인 전체 시스템을 얻을 수 있습니다. 선택적 검색 (SS)의 제안 [20] CPU에 이미지 당 약 1 ~ 2 초가 소요됩니다. EdgeBoxes의 방법은 [25]은 0.2 ~합니다. 단지 테스트하는 동안 빠른 제안 방법을 사용하기에 충분합니다. (SS를 사용) 상기와 같이 훈련 같은 모델을 사용하여, 우리는 EdgeBoxes에 의해 생성 된 제안을 테스트합니다. 지도는 상자의 회귀를 경계없이 52.8이다. 이 EdgeBoxes 훈련에 사용하지 않는 것을 고려하면 합리적이다. 그 다음 우리는 교육 단계에서의 제안으로 SS와 edgeBOX를 모두 사용하고, 테스트 단계 만 EdgeBoxes을 채택한다. 지도로 인해 추가 교육 샘플을 55.2 (표 10)보다 더 박스 회귀를 경계없이 56.3이다. 이 경우, 전체 테스트 시간은 모든 단계 (제안 및 인식)를 포함한 이미지 당 ~0.5s이다. 이는 실제 애플리케이션을위한 우리의 방법은 실제합니다.

Model Combination for Detection

ENG	KOR
Model combination is an important strategy for boosting CNN-based classification accuracy [3]. We propose a simple combination method for detection.	모델 조합은 CNN 기반의 분류 정확도 [3] 강화를위한 중요한 전략이다. 우리는 검출을위한 단순한 조합 방법을 제안한다.
We pre-train another network in ImageNet, using the same structure but different random initializations. Then we repeat the above detection algorithm. Table 12 (SPP-net (2)) shows the results of this network. Its mAP is comparable with the first network (59.1% vs. 59.2%), and outperforms the first network in 11 categories.	우리는 같은 구조지만 다른 임의의 초기화를 사용하여, ImageNet에서 다른 네트워크를-훈련 사전. 그 다음 우리는 위의 검출 알고리즘을 반복합니다. 표 12 (SPP-NET (2))이 네트워크의 결과를 나타낸다. 그지도는 제 네트워크 (59.2 % 대 59.1 %)와 비교하고, 11 종류의 제 네트워크를 능가.
Given the two models, we first use either model to score all candidate windows on the test image. Then we perform non-maximum suppression on the union of the two sets of candidate windows (with their scores). A more confident window given by one method can suppress those less confident given by the other method. After combination, the mAP is boosted to 60.9% (Table 12). In 17 out of all 20 categories the combination performs better than either individual model. This indicates that the two models are complementary.	두 모델을 감안할 때, 우리는 먼저 테스트 이미지에 모든 후보 창을 점수로 어느 모델을 사용합니다. 그 다음 우리는 (자신의 점수) 후보 윈도우의 두 세트의 결합에 비 최대 억제를 수행합니다. 하나의 방법에 의해 주어진 더 자신감 창은 다른 방법에 의해 주어진 그 이하 확신을 억제 할 수있다. 조합 한 후,지도 60.9 % (표 12)에 밀어됩니다. (17) (20) 모든 중 카테고리의 조합은 하나의 개별 모델보다 더 나은 수행합니다. 이 두 모델은 보완을 나타냅니다.
We further find that the complementarity is mainly because of the convolutional layers. We have tried to combine two randomly initialized fine-tuned results of the same convolutional model, and found no gain.	우리는 더 보완이 주로 인해 길쌈 층는 사실을 알게 될 것입니다. 우리는 같은 컨볼 루션 모델이 무작위로 초기화 미세 조정 결과를 결합하려고 더 이득을 발견했다.

ILSVRC 2014 Detection

ENG	KOR
The ILSVRC 2014 detection [26] task involves 200 categories. There are ∼450k/20k/40k images in the training/validation/testing sets. We focus on the task of the provided-data-only track (the 1000-category CLS training data is not allowed to use).	ILSVRC 2014 검출 [26] 작업은 200 범주를 포함한다. 교육 / 검증 / 테스트 세트에서 ~450k / 20K / 40K 이미지가 있습니다. 우리는 (1000 카테고리 CLS 훈련 데이터를 사용할 수 없습니다) 제공 데이터 전용 트랙의 작업에 초점을 맞 춥니 다.
There are three major differences between the detection (DET) and classification (CLS) training datasets, which greatly impacts the pre-training quality. First, the DET training data is merely 1/3 of the CLS training data. This seems to be a fundamental challenge of the provided-data-only DET task. Second, the category number of DET is 1/5 of CLS. To overcome this problem, we harness the provided subcategory labels2 for pre-training. There are totally 499 nonoverlapping subcategories (i.e., the leaf nodes in the provided category hierarchy). So we pre-train a 499- category network on the DET training set. Third, the distributions of object scales are different between DET/CLS training sets. The dominant object scale in CLS is about 0.8 of the image length, but in DET is about 0.5. To address the scale difference, we resize each training image to min(w, h) = 400 (instead of 256), and randomly crop 224×224 views for training. A crop is only used when it overlaps with a ground truth object by at least 50%.	검출 (DET) 및 분류 (CLS) 훈련 데이터 세트 크게 영향을 사전에 교육의 질 사이의 세 가지 주요 차이점이 있습니다. 먼저, DET 훈련 데이터는 단지 1/3 CLS 훈련 데이터이다. 이렇게 제공된 데이터 전용 DET 작업의 근본적인 문제가 될 것 같다. 둘째, DET의 카테고리 번호 CLS의 1/5이다. 이러한 문제를 극복하기 위해 사전 훈련 제공 하위 라벨 2를 활용. 완전히 499 겹치지 않는 하위 범주 (제공 범주 계층 구조 즉, 잎 노드)가 있습니다. 그래서 우리는 DET 훈련 세트 499- 카테고리 네트워크를 사전 훈련. 셋째, 객체 규모의 분포는 DET / CLS 훈련 세트 사이 다릅니다. CLS의 지배적 인 객체 규모는 약 0.8 이미지 길이이지만, DET 약 0.5입니다. 규모의 차이를 해결하기 위해, 우리는 각 교육 이미지 (시간, w) = 400 (대신 256)를 min으로, 무작위로 훈련을위한 224 × 224보기자를 크기를 조정합니다. 적어도 50 % 지표 사실 오브젝트와 겹치는 경우 작물에만 사용된다.
We verify the effect of pre-training on Pascal VOC 2007. For a CLS-pre-training baseline, we consider the pool5 features (mAP 43.0% in Table 9). Replaced with a 200-category network pre-trained on DET, the mAP significantly drops to 32.7%. A 499-category pre-trained network improves the result to 35.9%. Interestingly, even if the amount of training data do not increase, training a network of more categories boosts the feature quality. Finally, training with min(w, h) = 400 instead of 256 further improves the mAP to 37.8%. Even so, we see that there is still a considerable gap to the CLS-pre-training result. This indicates the importance of big data to deep learning.	우리는 우리가 pool5 기능 (지도 표 9의 43.0 %)을 고려, CLS-사전 훈련 기준은 파스칼 VOC 2007 년에 사전 교육의 효과를 확인합니다. 200 카테고리 네트워크 DET에 사전 훈련으로 대체,지도 크게 32.7 %로 떨어진다. 499 카테고리 사전 훈련 네트워크는 35.9 %의 결과를 향상시킨다. 흥미롭게도, 훈련 데이터의 양이 증가하지 않는 경우에도, 이상의 카테고리들의 네트워크를 훈련하는 기능 품질 향상. 마지막으로, (시간, w) 분 = 대신 256의 400 훈련은 더욱 37.8 %로 맵을 향상시킨다. 그럼에도 불구하고, 우리는 CLS-사전 교육 결과에 상당한 차이가 여전히 존재 것을 알 수있다. 이 깊은 학습 빅 데이터의 중요성을 나타냅니다.
For ILSVRC 2014, we train a 499-category Overfeat- 7 SPP-net. The remaining steps are similar to the VOC 2007 case. Following [7], we use the validation set to generate the positive/negative samples, with windows proposed by the selective search fast mode. The training set only contributes positive samples using the ground truth windows. We fine-tune the fc layers and then train the SVMs using the samples in both validation and training sets. The bounding box regression is trained on the validation set.	ILSVRC 2014 년, 우리는 499 카테고리를 Overfeat- 7 SPP-그물을 훈련. 나머지 단계는 VOC 2007의 경우와 유사하다. 다음 [7], 우리는 선택적 검색 고속 모드에 의해 제안 된 창문 양 / 음 샘플을 생성하도록 설정 유효성 검사를 사용합니다. 만 설정 한 훈련은 지상 진실 창을 사용하여 긍정적 인 샘플을 기여한다. 우리는 미세 조정 후 FC 레이어와 모두 검증 및 훈련 세트 샘플을 사용하여 SVM을 훈련. 경계 상자의 회귀는 검증 세트에 대한 교육을한다.
Our single model leads to 31.84% mAP in the ILSVRC 2014 testing set [26]. We combine six similar models using the strategy introduced in this paper. The mAP is 35.11% in the testing set [26]. This result ranks #2 in the provided-data-only track of ILSVRC 2014 (Table 13) [26]. The winning result is 37.21% from NUS, which uses contextual information.	우리의 하나의 모델은 ILSVRC 2014 테스트 세트 [26]에서 31.84 %의지도로 연결됩니다. 우리는이 논문에서 소개 된 전략을 사용 여섯 유사한 모델을 결합한다. 지도는 테스트 세트 [26]에서 35.11 %이다. 이 결과는 ILSVRC 2014 (표 13) [26]의 제공 데이터 전용 트랙에서 2 위를 기록하고 있습니다. 수상 결과는 상황에 맞는 정보를 사용 NUS에서 37.21 %이다.
Our system still shows great advantages on speed for this dataset. It takes our single model 0.6 seconds (0.5 for conv, 0.1 for fc, excluding proposals) per testing image on a GPU extracting convolutional features from all 5 scales. Using the same model, it takes 32 seconds per image in the way of RCNN. For the 40k testing images, our method requires 8 GPU·hours to compute convolutional features, while RCNN would require 15 GPU·days.	우리의 시스템은 여전히이 데이터 집합에 대한 속도에 큰 장점을 보여줍니다. 그것은 모든 5 스케일에서 길쌈 특징을 추출하는 GPU에 테스트 이미지 당 우리의 단일 모델 (제안서 제외 전환 0.5, FC 0.1) 0.6 초 정도 걸립니다. 동일한 모델을 사용하여, 그것은 RCNN의 방법으로 이미지 당 32 초 정도 걸립니다. RCNN가 · 일 15 GPU를 필요로하는 동안 40K 테스트 이미지의 경우, 우리의 방법은, 길쌈 기능을 계산하기 위해 8 GPU · 시간을 필요로한다.

CONCLUSION

ENG

KOR

SPP is a flexible solution for handling different scales, sizes, and aspect ratios. These issues are important in visual recognition, but received little consideration in the context of deep networks. We have suggested a solution to train a deep network with a spatial pyramid pooling layer. The resulting SPP-net shows outstanding accuracy in classification/detection tasks and greatly accelerates DNN-based detection. Our studies also show that many time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition.

SPP 다른 비늘, 크기 및 종횡비를 처리하기위한 융통성있는 해결책이다. 이러한 문제는 시인성이 중요하지만, 깊은 네트워크 환경에서 작은 대가를 받았다. 우리는 공간 피라미드 풀링 층으로 깊은 네트워크를 훈련 할 수있는 솔루션을 제안했다. 그 결과 SPP-그물 분류 / 탐지 작업에서 뛰어난 정확성을 보여주고 크게는 DNN 기반의 검색을 가속화합니다. 우리의 연구는 컴퓨터 비전에 많은 시간 입증 된 기술 / 통찰력은 여전히 깊은 네트워크 기반의 인식에 중요한 역할을 할 수 있음을 보여준다.

APPENDIX A

ENG	KOR
In the appendix, we describe some implementation details:	부록에서 우리는 몇 가지 구현 세부 사항을 설명합니다

Mean Subtraction.

ENG

KOR

The 224×224 cropped training/testing images are often pre-processed by subtracting the per-pixel mean [3]. When input images are in any sizes, the fixedsize mean image is not directly applicable. In the ImageNet dataset, we warp the 224×224 mean image to the desired size and then subtract it. In Pascal VOC 2007 and Caltech101, we use the constant mean (128) in all the experiments.

224 × 224 이미지가 종종 픽셀 별 의미 차감하여 사전 처리 테스트 / 교육 자른 [3]. 입력 이미지가 어떤 크기에있을 때, FixedSize (크기 고정) 이미지 직접 적용 할 수 없습니다 의미한다. ImageNet 데이터 세트에서, 우리는 원하는 크기로 224 × 224 평균 이미지를 휘게 한 다음 뺍니다. 파스칼 VOC 2007 Caltech101에서, 우리는 모든 실험에서 일정한 평균 (128)를 사용합니다.

Implementation of Pooling Bins.

ENG

KOR

We use the following implementation to handle all bins when applying the network. Denote the width and height of the conv5 feature maps (can be the full image or a window) as w and h. For a pyramid level with n×n bins, the (i, j)-th bin is in the range of Intuitively, if rounding is needed, we take the floor operation on the left/top boundary and ceiling on the right/bottom boundary.

우리는 네트워크를 적용하는 경우 모든 빈들을 처리하기 위해 다음의 구현을 사용한다. conv5 기능 맵의 폭 및 높이를 나타내는 w 및 h와 같은 (전체 이미지 나 창이 될 수있다). N 개의 빈들 (I, j) 번째의 빈 × N을 가진 피라미드 레벨 반올림이 필요한 경우 직관적으로, 우리는 오른쪽 / 아래쪽 경계의 왼쪽 / 상단 경계 및 천장에 바닥 작업을.

Mapping a Window to Feature Maps.

ENG	KOR
In the detection algorithm (and multi-view testing on feature maps), a window is given in the image domain, and we use it to crop the convolutional feature maps (e.g., conv5) which have been sub-sampled several times. So we need to align the window on the feature maps.	검출 알고리즘 (및 기능지도 멀티 뷰 검사)에서, 윈도우는 화상 영역에 주어진, 우리는 길쌈 기능 맵 서브 - 샘플링 수회왔다 (예를 들어, conv5)을 잘라내 사용된다. 그래서 우리는 기능지도 창을 정렬 할 필요가있다.
In our implementation, we project the corner point of a window onto a pixel in the feature maps, such that this corner point in the image domain is closest to the center of the receptive field of that feature map pixel. The mapping is complicated by the padding of all convolutional and pooling layers. To simplify the implementation, during deployment we pad bp/2c pixels for a layer with a filter size of p. As such, for a response centered at (x^0, y0) , its effective receptive field in the image domain is centered at (x, y) = (Sx^0, Sy^0) where S is the product of all previous strides. In our models, S = 16 for ZF-5 on conv5, and S = 12 for Overfeat-5/7 on conv5/7. Given a window in the image domain, we project the left (top) boundary by: x^0 = bx/Sc + 1 and the right (bottom) boundary x^0 = dx/Se − 1. If the padding is not [p/2], we need to add a proper offset to x.	본 구현에서는, 화상 영역이 코너 지점이 기능 맵 픽셀 수용 필드의 중심에 가장 가까운하도록 기능 맵의 픽셀에 창 구석 점을 투영. 매핑은 모든 길쌈 및 풀링 층의 패딩 복잡하다. (P)의 필터 크기를 갖는 층 배포 우리 패드 BP / 2C 화소 중, 구현을 단순화한다. 이와 같이, 중심으로 응답 (X ^ 0, Y0)의 경우, 화상 영역에서의 효과적인 수용 필드가 (X, Y) = 중심된다 (내지 Sx ^ 0 ^ 0 싸이) S는 모든 이전의 생성물이고 진보. 우리의 모델에서, S는 Overfeat-5 conv5 / 7 / 7 ZF-5 conv5에, 및 S = 12 16 =. 이미지 영역에서 창을 감안할 때, 우리는에 의해 왼쪽 (위) 경계를 프로젝트 : X ^ 0 = BX / SC + 1과 오른쪽 (아래) 경계 X ^ 0 = DX / 괜찮다 - 1. 패딩이없는 경우 [ P / 2], 우리는 X에 적절한 오프셋 (offset)를 추가해야합니다.

Documentation

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition (v4): https://github.com/ShaoqingRen/SPP_net; 1406.4729v4.pdf

References

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989. ↩
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009. ↩
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” in NIPS, 2012. ↩
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” arXiv:1311.2901, 2013. ↩
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229, 2013. ↩
A. V. K. Chatfield, K. Simonyan and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in ArXiv:1405.3531, 2014. ↩
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014. ↩
W. Y. Zou, X. Wang, M. Sun, and Y. Lin, “Generic object detection with dense neural patterns and regionlets,” in ArXiv:1404.4316, 2014. ↩
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: An astounding baseline for recogniton,” in CVPR 2014, DeepVision Workshop, 2014. ↩
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in CVPR, 2014. ↩
N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdevr, “Panda: Pose aligned networks for deep attribute modeling,” in CVPR, 2014. ↩
Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in ArXiv:1403.1840, 2014. ↩
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” arXiv:1310.1531, 2013. ↩
K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in ICCV, 2005. ↩
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006. ↩
J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in ICCV, 2003. ↩
J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009. ↩
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in CVPR, 2010. ↩
F. Perronnin, J. Sanchez, and T. Mensink, “Improving the fisher ´ kernel for large-scale image classification,” in ECCV, 2010. ↩
K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders, “Segmentation as selective search for object recognition,” in ICCV, 2011. ↩
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVIU, 2007. ↩
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” 2007. ↩
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” PAMI, 2010. ↩
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005. ↩
C. L. Zitnick and P. Dollar, “Edge boxes: Locating object ´ proposals from edges,” in ECCV, 2014. ↩
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” arXiv:1409.0575, 2014. ↩
K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” in BMVC, 2011. ↩
A. Coates and A. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in ICML, 2011. ↩
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004. ↩
J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks for scene categorization,” in ECCV, 2008. ↩
M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv:1312.4400, 2013. ↩
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842, 2014. ↩
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014. ↩
M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014. ↩
Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” http://caffe.berkeleyvision.org/, 2013. ↩
A. G. Howard, “Some improvements on deep convolutional neural network based image classification,” ArXiv:1312.5402, 2013. ↩
H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” TPAMI, vol. 34, no. 9, pp. 1704–1716, 2012. ↩
C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), 2011. ↩
X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, 2013. ↩
C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in NIPS, 2013. ↩