SSD:Paper

SSD: Single Shot MultiBox Detector (v5) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

Abstract

ENG

KOR

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd .

우리는 하나의 심층 신경망 (deep neural network)을 사용하여 이미지의 객체를 검출하는 방법을 제시합니다. SSD라는 이름의 접근 방식은 테두리 상자의 출력 공간을 기능 맵 위치별로 서로 다른 종횡비와 비율로 기본 상자 집합으로 이산합니다. 예측 시간에 네트워크는 각 기본 상자에서 각 개체 범주의 존재에 대한 점수를 생성하고 개체 모양을보다 잘 일치시키기 위해 상자를 조정합니다. 또한 네트워크는 다양한 해상도의 여러 기능 맵의 예측을 결합하여 다양한 크기의 개체를 자연스럽게 처리합니다. SSD는 제안서 생성 및 후속 픽셀 또는 기능 재 샘플링 단계를 완전히 제거하고 모든 계산을 단일 네트워크에 캡슐화하기 때문에 객체 제안을 필요로하는 방법에 비해 간단합니다. 따라서 SSD는 탐지 구성 요소를 필요로하는 시스템에 통합하기 쉽고 직관적입니다. PASCAL VOC, COCO 및 ILSVRC 데이터 세트에 대한 실험 결과에 따르면 SSD는 추가 객체 제안 단계를 사용하는 방법과 비교하여 정확성이 뛰어나며 교육과 추론 모두에 통일 된 프레임 워크를 제공하면서 훨씬 빠릅니다. 300 × 300 입력의 경우 SSD는 NVIDIA Titan X에서 59FPS의 VOC2007 테스트에서 74.3 % mAP1을 달성하고 512x512 입력의 경우 SSD는 76.9 % mAP를 달성하여 최신 FAST R-CNN 모델을 능가합니다 . 다른 싱글 스테이지 방식에 비해 SSD는 입력 이미지 크기가 작을 때보 다 훨씬 더 정확합니다. 코드는 https://github.com/weiliu89/caffe/tree/ssd에서 사용할 수 있습니다.

Keywords

Real-time Object Detection; Convolutional Neural Network.

Introduction

ENG

KOR

Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a highquality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sec. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.

현재의 최첨단 물체 감지 시스템은 다음과 같은 접근 방식의 변형입니다 : 경계 상자를 가정하고, 각 상자의 픽셀 또는 특징을 재 샘플링하고, 고 품질 분류자를 적용합니다. 이 파이프 라인은 PASCAL VOC, COCO 및 ILSVRC 탐지의 최신 결과를 통해 Selective Search 작업 [1] 이후 탐지 벤치 마크에서 우세한 것으로, [3]과 같은 더 깊은 특징이 있음에도 불구하고 빠른 R-CNN을 기반으로합니다 [2]. 이러한 접근 방식은 정확하지만 임베디드 시스템에서는 너무 연산 집약적이었고 고급 하드웨어에서도 실시간 응용 프로그램에는 너무 느립니다. 이러한 접근 방식의 검출 속도는 프레임 당 초 단위 (SPF)로 측정되는 경우가 많으며 가장 빠른 고정밀 검출기 인 Faster R-CNN조차도 초당 7 프레임 (FPS)으로 작동합니다. 탐지 파이프 라인의 각 단계를 공격하여 더 빠른 탐지기를 구축하려는 시도가 많이 있었지만 (4 절의 관련 작업 참조) 지금까지는 탐지 정확도가 현저히 떨어지는 경우에만 속도가 크게 향상되었습니다.

ENG

KOR

This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. We are not the first to do this (cf [4,5]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications—especially using multiple layers for prediction at different scales—we can achieve high-accuracy using relatively low resolution input, further increasing detection speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from 63.4% mAP for YOLO to 74.3% mAP for our SSD. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.

이 논문은 바운딩 박스 가설에 대해 픽셀이나 피쳐를 재 샘플링하지 않는 최초의 딥 네트워크 기반의 객체 검출기를 제시하며, 그러한 접근법만큼 정확합니다. 그 결과 고정밀도 검출 (VOP2007 테스트에서 mAP 74.3 %, mAP 73.2 %를 사용하는 Faster R-CNN 7 FPS 또는 mAP 63.4 %를 사용하는 YOLO 45 FPS)에 비해 속도가 크게 향상되었습니다. 속도의 근본적인 향상은 경계 상자 제안과 후속 픽셀 또는 피쳐 리샘플링 단계를 없앰으로써 가능합니다. 우리는 이것을 처음으로하는 것은 아니지만 (cf [4,5]), 일련의 개선 사항을 추가함으로써 이전 시도에 비해 정확성을 크게 향상시킬 수 있습니다. 우리의 개선점으로는 작은 컨볼 루션 필터를 사용하여 경계 상자 위치에서 객체 카테고리 및 오프셋을 예측하고, 다른 종횡비 감지에 대해 별도의 예측기 (필터)를 사용하고, 네트워크의 후기 단계에서 이러한 필터를 적용하여 수행 할 수 있습니다 여러 척도에서의 탐지. 이러한 수정을 통해 (특히 여러 척도의 예측을 위해 여러 레이어 사용) 상대적으로 낮은 해상도의 입력을 사용하여 고정밀 도로 검색 속도를 높일 수 있습니다. 이러한 기여도는 독립적으로 보일 수 있지만, 결과 시스템은 PASCAL VOC의 실시간 탐지 정확도를 YOLO의 경우 63.4 %에서 SSD의 경우 74.3 %로 향상시킵니다. 이는 잔여 네트워크에 대한 최근의 매우 중요한 작업에서보다 탐지 정확도에서 더 큰 상대적 개선이다 [3]. 또한 고품질 감지 속도를 크게 향상 시키면 컴퓨터 비전이 유용한 설정 범위를 넓힐 수 있습니다.

ENG

KOR

We summarize our contributions as follows:

We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.

우리는 우리의 공헌을 다음과 같이 요약합니다 :

우리는 싱글 샷 검출기 (YOLO)의 이전 기술보다 더 빠른 여러 범주의 단일 샷 검출기 SSD를 소개하고 명시 적 영역을 수행하는 더 느린 기술만큼 정확하고 훨씬 정확합니다 제안 및 풀링 (빠른 R-CNN 포함).
SSD의 핵심은 기능 맵에 적용된 작은 컨볼 루션 필터를 사용하여 고정 된 기본 경계 상자 집합에 대한 범주 점수 및 상자 오프셋을 예측하는 것입니다.
높은 탐지 정확도를 달성하기 위해 서로 다른 스케일의 기능 맵과 다른 스케일의 예측을 생성하고 명시 적으로 종횡비로 예측을 분리합니다.
이러한 디자인 기능은 저해상도 입력 이미지에서도 단순한 엔드 투 엔드 교육과 높은 정확도를 제공하여 속도와 정확성의 균형을 더욱 향상시킵니다.
실험에는 PASCAL VOC, COCO 및 ILSVRC에서 평가되는 다양한 입력 크기를 가진 모델에 대한 타이밍 및 정확도 분석이 포함되어 있으며 최신 최첨단 접근 방법의 범위와 비교됩니다.

The Single Shot Detector (SSD)

This section describes our proposed SSD framework for detection (Sec. 2.1) and the associated training methodology (Sec. 2.2). Afterwards, Sec. 3 presents dataset-specific model details and experimental results.

Ssd-v5-fig1.jpg
Fig. 1: SSD framework. (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)). For each default box, we predict both the shape offsets and the confidences for all object categories ((c1, c2, ... , cp)). At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).
Fig. 1 : SSD 프레임 워크. (a) SSD는 훈련 중 각 물체에 대한 입력 이미지 및 접지 진실 상자 만 필요합니다. 컨벌루션 방식으로, 상이한 스케일 (예를 들어, (b) 및 (c)의 8x8 및 4x4)을 갖는 몇몇 특징 맵에서 각각의 위치에서 상이한 종횡비의 작은 세트 (예를 들어 4)를 평가한다. 각 기본 상자에 대해 모든 객체 범주 ((c1, c2, ..., cp))에 대한 모양 오프셋과 신뢰도를 예측합니다. 교육 시간에, 우리는 먼저이 기본 상자를 지상 진실 상자와 일치시킵니다. 예를 들어, 우리는 고양이와 두 개의 기본 상자를 매치 시켰고, 한 개는 양성으로 취급하고 나머지는 네거티브로 처리했습니다. 모델 손실은 위치 파악 손실 (예 : 부드러운 L1 [6])과 신뢰 손실 (예 : Softmax) 사이의 가중 합이다.

Model

ENG	KOR
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network2. We then add auxiliary structure to the network to produce detections with the following key features:	SSD 접근 방식은 고정 박스 크기의 경계 상자 모음과 해당 상자에있는 개체 클래스 인스턴스의 존재 여부에 대한 점수를 생성하는 피드 포워드 컨볼 루션 네트워크를 기반으로하며, 최종 탐지를 생성하는 비 최대 억제 단계가 뒤 따른다. 초기 네트워크 계층은 고품질 이미지 분류 (분류 계층 이전에 잘 렸음)에 사용되는 표준 아키텍처를 기반으로하며 기본 네트워크 2라고합니다. 그런 다음 네트워크에 보조 구조를 추가하여 다음과 같은 주요 기능을 갖춘 탐지를 생성합니다.
Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).	탐지를위한 다중 스케일 피쳐 맵 절단 된 기본 네트워크의 끝에 컨볼 루션 피쳐 레이어를 추가합니다. 이러한 레이어는 점진적으로 크기가 감소하고 여러 배율에서 감지를 예측할 수 있습니다. 탐지를 예측하기위한 컨볼 루션 모델은 각 기능 레이어마다 다릅니다 (단일 배율 피쳐 맵에서 작동하는 Overfeat [4] 및 YOLO [5] 참조).
Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).	탐지를위한 컨벌루션 예측 자 추가 된 각 피쳐 레이어 (또는 선택적으로 기본 네트워크의 기존 피쳐 레이어)는 컨벌루션 필터 세트를 사용하여 고정 된 탐지 예측 세트를 생성 할 수 있습니다. 이는 그림 1의 SSD 네트워크 아키텍처의 상단에 표시됩니다. 2. p 채널을 갖는 크기 m × n의 피쳐 레이어의 경우, 잠재적 인 검출의 매개 변수를 예측하기위한 기본 요소는 범주에 대한 점수를 산출하는 3x3xp의 작은 커널 또는 기본 상자 좌표. 커널이 적용되는 mxn 위치 각각에서 출력 값을 생성합니다. 바운딩 박스 오프셋 출력 값은 각 피쳐 맵 위치와 관련된 기본 상자 위치에 상대적으로 측정됩니다 (이 단계에서 컨벌루션 필터 대신 중간 완전 연결 레이어를 사용하는 YOLO [5] 아키텍처 참조).
Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the 4 offsets relative to the original default box shape. This results in a total of (c + 4)k filters that are applied around each location in the feature map, yielding (c + 4)kmn outputs for a m × n feature map. For an illustration of default boxes, please refer to Fig. 1. Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.	기본 상자 및 가로 세로 비율 기본 테두리 상자 세트를 각 기능 맵 셀과 연결하여 네트워크 맨 위에있는 여러 기능 맵을 찾습니다. 기본 상자는 기능 맵을 컨볼 루션 방식으로 배열하여 각 상자의 해당 셀에 대한 위치가 고정되도록합니다. 각 피쳐지도 셀에서 셀의 기본 상자 모양과 관련된 오프셋과 각 상자에 클래스 인스턴스가 있음을 나타내는 클래스 별 점수를 예측합니다. 특히 주어진 위치에서 k의 각 상자에 대해 원래의 기본 상자 모양과 관련하여 c 클래스 점수와 4 개의 오프셋을 계산합니다. 그 결과, (c + 4) k 개의 필터가 기능 맵의 각 위치 주변에 적용되어 m × n 피쳐 맵에 대해 (c + 4) kmn 출력을 산출합니다. 기본 상자에 대한 설명은 그림 2를 참조하십시오. 1. 우리의 기본 박스는 Faster R-CNN [2]에서 사용 된 앵커 박스와 비슷하지만 해상도가 다른 여러 개의 기능 맵에 적용합니다. 여러 가지 기능 맵에서 다른 기본 상자 모양을 허용하면 가능한 출력 상자 모양의 공간을 효율적으로 분리 할 수 있습니다.

Ssd-v5-fig2.jpg
Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5]. Our SSD model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test while also improving the speed.
Fig. 2 : 두 개의 단일 샷 검출 모델 인 SSD와 YOLO [5]의 비교. 우리의 SSD 모델은 기본 네트워크의 끝 부분에 몇 가지 기능 레이어를 추가합니다.이 기능 레이어는 서로 다른 크기와 종횡비 및 관련 신뢰도의 기본 상자에 대한 오프셋을 예측합니다. 300 × 300 입력 크기의 SSD는 VOC2007 테스트의 정확도에서 448 × 448 YOLO 성능보다 월등히 뛰어남과 동시에 속도가 향상되었습니다.

Training

ENG	KOR
The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. Once this assignment is determined, the loss function and back propagation are applied endto-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.	SSD 교육과 지역 제안서를 사용하는 일반적인 탐지기 교육의 주요 차이점은 지상 진실 정보를 고정 된 감지기 출력 세트의 특정 출력에 할당해야한다는 것입니다. 이 중 일부 버전은 YOLO [5]의 교육 및 Faster R-CNN [2] 및 MultiBox [7]의 지역 제안 단계에도 필요합니다. 이 할당이 결정되면 손실 함수와 역 전파가 엔드 투 엔드에 적용됩니다. 또한 교육에는 하드 네거티브 마이닝 및 데이터 확대 전략뿐 아니라 탐지 용 기본 상자 및 눈금 세트 선택이 포함됩니다.
Matching strategy During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.	일치하는 전략 훈련 중 우리는 지상실 진위 감지에 해당하는 기본 상자를 판별하고 이에 따라 네트워크를 훈련해야합니다. 각 지상 진실 상자에 대해 우리는 위치, 종횡비 및 규모에 따라 다른 기본 상자에서 선택합니다. 우선 각 진실 상자를 가장 좋은 jaccard가 겹치는 기본 상자에 일치 시켜서 시작합니다 (MultiBox [7]). MultiBox와는 달리 임계 값 (0.5)보다 높은 jaccard 겹침을 사용하여 기본 상자를 접지 진리와 일치시킵니다. 이것은 학습 문제를 단순화하여, 네트워크가 최대 겹침이있는 것을 선택하도록 요구하지 않고, 다수의 겹쳐진 디폴트 박스에 대한 높은 점수를 예측할 수있게한다.
Training objective The SSD training objective is derived from the MultiBox objective [7,8] but is extended to handle multiple object categories. Let \(x_{ij}^{p} = {1,0}\) be an indicator for matching the \(i\)-th default box to the \(j\)-th ground truth box of category \(p\). In the matching strategy above, we can have \(\sum_{i}x_{ij}^{p} \geq 1\). The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf):	교육 목표 SSD 교육 목표는 MultiBox 목표 [7,8]에서 파생되었지만 여러 객체 카테고리를 처리하도록 확장되었습니다. \(x_{ij}^{p} = {1,0}\)를 i 번째 기본 상자를 범주 p의 j 번째 접지 상자와 일치시키는 지표로 사용하십시오. 위의 일치 전략에서 \(\sum_{i}x_{ij}^{p} \geq 1\)를 가질 수 있습니다. 전체적인 목표 손실 함수는 위치 파악 손실 (loc)과 신뢰 손실 (conf)의 가중치 합이다.
Ssd-v5-eq1.jpg
where N is the number of matched default boxes. If N = 0, wet set the loss to 0. The localization loss is a Smooth L1 loss [6] between the predicted box (l) and the ground truth box (g) parameters. Similar to Faster R-CNN [2], we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h).	여기서 N은 일치하는 기본 상자의 수입니다. N = 0이면 손실을 0으로 설정합니다. 위치 파악 손실은 예측 상자 (l)와 접지 실 상자 (g) 매개 변수 사이의 부드러운 L1 손실입니다 [6]. Faster R-CNN [2]과 마찬가지로 기본 경계 상자 (d)의 중심 (cx, cy)과 폭 (w) 및 높이 (h)에 대한 오프셋으로 회귀합니다.
Ssd-v5-eq2.jpg
The confidence loss is the softmax loss over multiple classes confidences (c).	신뢰 손실은 여러 클래스 신뢰도 (c)에 대한 softmax 손실입니다.
Ssd-v5-eq3.jpg
and the weight term α is set to 1 by cross validation.	교차 검증에 의해 가중 항 α가 1로 설정된다.
Choosing scales and aspect ratios for default boxes To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results. Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the framework. In practice, we can use many more with small computational overhead.	기본 상자의 비율 및 종횡비 선택 서로 다른 객체 크기를 처리하기 위해 일부 방법 [4,9]에서는 이미지를 다른 크기로 처리하고 나중에 결과를 결합하는 것이 좋습니다. 그러나 예측을 위해 단일 네트워크에서 여러 레이어의 기능 맵을 활용함으로써 동일한 효과를 모방하고 모든 객체 척도에서 매개 변수를 공유 할 수 있습니다. 이전 연구 [10,11]에서는 하위 계층이 입력 객체의 세부 묘사를 더 잘 포착하기 때문에 하위 계층의 특성 맵을 사용하면 의미 론적 세분화 품질이 향상 될 수 있음을 보여주었습니다. 유사하게, [12]는 feature map으로부터 풀링 된 global context를 추가하는 것이 segmentation 결과를 부드럽게하는데 도움을 줄 수 있음을 보였다. 이 방법들에 의해 동기 부여를 위해 우리는 탐지를 위해 하부 및 상부 특징 맵 모두를 사용한다. 그림 1은 프레임 워크에서 사용되는 두 개의 표본 피쳐 맵 (8 × 8 및 4 × 4)을 보여줍니다. 실제로는 적은 계산 오버 헤드로 더 많은 것을 사용할 수 있습니다.
Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each layer. We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as:	네트워크 내의 서로 다른 레벨의 특성 맵은 서로 다른 (경험적) 수용 필드 크기를 갖는 것으로 알려져있다 [13]. 다행스럽게도 SSD 프레임 워크 내에서 기본 상자는 각 레이어의 실제 수용 필드와 일치 할 필요는 없습니다. 특정 기능 맵이 객체의 특정 축척에 반응하는 것을 배우도록 기본 상자의 기와를 디자인합니다. 예측을 위해 m 개의 피처 맵을 사용한다고 가정합니다. 각 기능 맵의 기본 상자 크기는 다음과 같이 계산됩니다.
Ssd-v5-eq4.jpg
where \(s_{min}\) is 0.2 and \(s_{max}\) is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced. We impose different aspect ratios for the default boxes, and denote them as \(a_r \in {1,2,3,\frac{1}{2},\frac{1}{3}}\). We can compute the width \((w_{k}^{a} = s_k \sqrt{a_r}\) and height \((h_k^a = s_k / \sqrt{a_r})\) for each default box. For the aspect ratio of 1, we also add a default box whose scale is \(s'_k = \sqrt{s_{k}s_{k+1}}\) resulting in 6 default boxes per feature map location. We set the center of each default box to \((\frac{i+0.5}{\|f_k\|}, \frac{j+0.5}{\|f_k\|})\) , where \(\|f_k\|\) is the size of the \(k\)-th square feature map, \(i, j \in [0, \|fk\|)\). In practice, one can also design a distribution of default boxes to best fit a specific dataset. How to design the optimal tiling is an open question as well.	여기서 \(s_{min}\)는 0.2이고 \(s_{max}\)는 0.9이며, 가장 낮은 레이어의 스케일은 0.2이며 가장 높은 레이어의 스케일은 0.9이며 모두 그 사이의 층은 규칙적으로 이격되어있다. 기본 상자에는 다른 종횡비가 적용되며 \(a_r \in {1,2,3,\frac{1}{2},\frac{1}{3}}\)로 표시됩니다. 각 기본 상자의 너비 \((w_{k}^{a} = s_k \sqrt{a_r}\)와 높이 \((h_k^a = s_k / \sqrt{a_r})\)를 계산할 수 있습니다. 종횡비가 1 인 경우 축척이 \(s'_k = \sqrt{s_{k}s_{k+1}}\) 인 기본 상자를 추가하여 피쳐지도 위치 당 6 개의 기본 상자가 생성됩니다. 각 기본 상자의 중심을 \((\frac{i+0.5}{\|f_k\|}, \frac{j+0.5}{\|f_k\|})\)로 설정합니다. 여기서 \(\|f_k\|\)는 k 번째 사각형 특징지도 인 \(i, j \in [0, \|fk\|)\)의 크기입니다. 실제로는 특정 데이터 세트에 가장 잘 맞는 기본 상자를 디자인 할 수도 있습니다. 최적의 타일링을 설계하는 방법도 공개적인 질문입니다.
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the 4 × 4 feature map, but not to any default boxes in the 8 × 8 feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.	모든 기본 기능 상자에 대한 예측을 여러 기능 맵의 모든 위치에서 다른 비율 및 종횡비로 결합하여 다양한 입력 개체 크기 및 모양을 포괄하는 다양한 예측을 제공합니다. 예를 들어, Fig. 1이면 개가 4x4 기능 맵에서 기본 상자와 일치하지만 8x8 기능 맵에서는 기본 상자와 일치하지 않습니다. 이러한 상자는 다른 크기와 도그 박스와 일치하지 않기 때문에 훈련 중에 네거티브로 간주되기 때문입니다.
Hard negative mining After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.	단점 마이닝 (Hard Negative mining) 매칭 단계가 끝나면 기본 상자의 대부분이 네거티브이며, 특히 가능한 기본 상자 수가 많으면 특히 그렇습니다. 이것은 긍정적 인 것과 부정적인 훈련의 예들 사이에 상당한 불균형을 가져온다. 모든 음화 예제를 사용하는 대신, 각 기본 상자에 대해 가장 높은 신뢰도 손실을 사용하여 정렬하고 음수와 양수 비율이 3 : 1 이하가되도록 상위 상자를 선택합니다. 우리는 이것이보다 빠른 최적화와보다 안정적인 교육을 유도한다는 것을 발견했습니다.
Data augmentation To make the model more robust to various input object sizes and shapes, each training image is randomly sampled by one of the following options: Use the entire original input image. Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9. Randomly sample a patch.	데이터 증가 다양한 입력 개체 크기 및 모양에 대한 모델을보다 강력하게 만들기 위해 각 교육 이미지는 다음 옵션 중 하나에 의해 무작위로 샘플링됩니다. 전체 원본 입력 이미지를 사용하십시오. 최소 jaccard가 0.1, 0.3, 0.5, 0.7 또는 0.9가되도록 패치를 샘플링하십시오. 무작위로 패치를 샘플링하십시오.
The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio is between 1/2 and 2. We keep the overlapped part of the ground truth box if the center of it is in the sampled patch. After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].	샘플링 된 각 패치의 크기는 원본 이미지 크기의 [0.1, 1]이고 종횡비는 1/2에서 2 사이입니다. 중앙에있는 샘플이 샘플링 된 패치 인 경우에는 지상 진실 상자의 중첩 된 부분을 유지합니다 . 앞서 언급 한 샘플링 단계 후에 샘플링 된 각 패치는 고정 크기로 크기가 조정되고 [14]에서 설명한 것과 유사한 몇 가지 사진 메트릭 왜곡을 적용 할뿐만 아니라 0.5의 확률로 수평으로 대칭 이동됩니다.