Going Deeper with Convolutions

We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

Pre-defined References

¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶ ¹⁷ ¹⁸ ¹⁹ ²⁰ ²¹

Introduction

ENG	KOR
In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks ²², the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 x fewer parameters than the winning architecture of Krizhevsky et al ²³ from two years ago, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al ²⁴.	지난 3년동안, 딥러닝(Deep learning) 발전이 이루어 졌다, 구체적으로는 Convolution network ²⁵, 이미지 인식과 물체 검출의 품질은 극적인 속도로 진행되고있다. 한 가지 고무적인 소식이 진보의 대부분은 더 강력한 하드웨어, 큰 데이터 세트와 더 큰 모델,하지만 주로 새로운 아이디어, 알고리즘 및 개선 된 네트워크 아키텍처의 결과에 불과 결과되지 않는 것입니다. 새로운 데이터 소스는 검출을 위해 동일한 경쟁 분류 데이터 세트 게다가 ILSVRC 2,014 대회에서 상위 항목, 예를 들면, 사용되지 않았다. 훨씬 더 정확하면서 ILSVRC 2014 년에 우리의 GoogLeNet 제출 실제로 2 년 전에서 Krizhevsky 등 [9]의 승리 아키텍처보다 12 X 적은 수의 매개 변수를 사용합니다. 객체 검출의 가장 큰 이득은 혼자 깊은 네트워크 또는 더 큰 모델의 활용에서 온,하지만하지 않은 R-CNN 알고리즘과 같은 깊은 아키텍처와 고전 컴퓨터 비전의 시너지 효과에서 Girshick로 등 [6].
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1:5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.	또 다른 주목할만한 요소입니다 모바일 및 임베디드 컴퓨팅, 우리의 알고리즘의 효율성을 지속적으로 견인와 - 특히 자신의 능력과 메모리 사용 - 이익의 중요성. 그것은이 논문에서 제시 한 깊은 아키텍처의 설계로 이어지는 고려 오히려 정확성 번호에 깎아 지른 고정을하는 것보다이 요소를 포함하는 것이 주목할 만하다. 실험의 대부분의 경우, 모델은 (1)의 연산 예산을 유지하도록 설계되었다 : 50 억 곱셈 - 추가가 순수 학문적 호기심으로 끝나지 않지만, 실제 사용에 넣을 수 있도록 추론 한번에 심지어 합리적인 비용으로 큰 데이터 세트에.
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al ²⁶ in conjunction with the famous “we need to go deeper” internet meme ²⁷. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of ²⁸ while taking inspiration and guidance from the theoretical work by Arora et al ²⁹. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, on which it significantly outperforms the current state of the art.	본 논문에서는 "우리가 더 깊이 갈 필요가 유명한"와 함께 린 등 [12]에 의해 네트워크 종이의 네트워크에서 자사의 이름을 파생 컴퓨터 비전을위한 효율적인 깊은 신경 네트워크 아키텍처, 코드 명 셉션에 초점을 맞출 것이다 인터넷 밈 [1]. 우리의 경우, 단어 "깊은"는 두 가지 의미로 사용된다 : 우선을, 우리는 '인 셉션 모듈 "의 형태도 증가 네트워크의보다 직접적인 의미에서 조직의 새로운 수준을 소개하는 의미에서 깊이. 일반적으로 하나의 논리 절정으로 셉션 모델을 볼 수 있습니다 [12] 아 로라에 의해 이론적 작업에서 영감과 지침을 복용하는 동안 등 [2].아키텍처의 이점은 실험적으로는 크게 현재의 기술을 능가하는 ILSVRC 2,014 분류 및 검출 문제에 검증된다.

ENG	KOR
Starting with LeNet-5 ³⁰, convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and maxpooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge ³¹³². For larger datasets such as Imagenet, the recent trend has been to increase the number of layers ³³ and layer size ³⁴³⁵, while using dropout ³⁶ to address the problem of overfitting.	(선택적 대비 정상화와 maxpooling 다음) 적층 길쌈 층 하나 이상의 완전히 연결된 층으로 준수 - LeNet-5 [10]을 시작으로, 길쌈 신경망 (현지 시간) 일반적으로 표준 구조가 있었다. 이 기본 디자인의 변형은 이미지 분류 문학에서 유행하고 최신 특히 ImageNet 분류 도전 [9, 21]에 MNIST, CIFAR에 최고의 결과를 산출했다. 전압 강하를 사용하여 [7] 과다 적합의 문제를 해결하는 반면 그러한 Imagenet 같은 큰 데이터 세트의 경우, 최근의 추세는, 층 [12]과 층 크기 [21, 14]의 수를 늘릴 수있다.
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as ³⁷ has also been successfully employed for localization ³⁸³⁹, object detection ⁴⁰⁴¹⁴²⁴³ and human pose estimation ⁴⁴. Inspired by a neuroscience model of the primate visual cortex, Serre et al. ⁴⁵ use a series of fixed Gabor filters of different sizes in order to handle multiple scales, similarly to the Inception model. However, contrary to the fixed 2-layer deep model of ⁴⁶, all filters in the Inception model are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.	MAX-풀링 층 정확한 공간 정보의 손실을 초래할 우려에도 불구하고, [9] 또한 성공적 파악 [9, 14], 물체 감지 [6, 14, 18, 5] 인간에 이용 된 것과 동일한 컨벌루션 네트워크 아키텍처 추정 [19] 포즈. , 세르 등 영장류 시각 피질의 신경 과학 모델에서 영감을. [15] 마찬가지로 셉션 모델, 여러 비늘을 처리하기 위해 서로 다른 크기의 고정 가버 필터들의 시리즈를 사용한다. 그러나, [15]의 고정 된 2 층 깊은 모델과 달리, 인 셉션 모델의 모든 필터는 알게된다. 또한, 셉션 층 GoogLeNet 모델의 경우 22 층 깊이 모델 선도, 여러 번 반복된다.
Network-in-Network is an approach proposed by Lin et al. ⁴⁷ in order to increase the representational power of neural networks. When applied to convolutional layers, the method could be viewed as additional 1 x 1 convolutional layers followed typically by the rectified linear activation ⁴⁸. This enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our architecture. However, in our setting, 1 x 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty.	네트워크 망에서 린 등에 의해 제안 된 접근법이다. 신경망의 재현 능력을 증가시키기 위해 [12]. 컨벌루션 층에인가되면, 상기 방법은 추가로 보여 질 수 1 × 1 컨벌루션 층 정류 선형 활성화에 의해 전형적으로 하였다 [10]. 이것은 쉽게 현재 CNN 파이프 라인에 통합 될 수있다. 우리는 우리의 아키텍처에 집중적으로이 방법을 사용합니다. 그러나, 우리의 환경에서, 1 × 1 회선은 이중 목적을 가지고 가장 비판적으로, 그들이 그렇지 우리 네트워크의 크기를 제한하는 것이 계산적 병목 현상을 제거하는 차원 축소 모듈로 주로 사용된다. 이것은 단지 깊이를 증가하지 않는 수 있지만 상당한 성능 저하없이 우리의 네트워크의 폭입니다.
The current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. ⁴⁹. R-CNN decomposes the overall detection problem into two subproblems: to first utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion, and to then use CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box ⁵⁰ prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.	물체 감지를위한 방법은 현재의 선두 Girshick 등에 의해 제안 루션 신경망 (CNN-R)를 가진 영역이다. [6]. R-CNN은 두 개의 하위 문제로 전체 검출 문제를 분해 : 최초의 범주에 얽매이지 방식으로 잠재적 인 대상 제안에 대한 색상과 슈퍼 픽셀 일관성과 같은 낮은 수준의 신호를 활용하고 그 위치에서 객체 범주를 식별하기 위해 현지 분류를 사용 할 수 있습니다. 이러한 2 단계 접근법은 낮은 수준의 단서 바운딩 박스 세그멘테이션의 정확성뿐만 아니라 최신의 CNNs 매우 강력한 전력 분류를 활용. 우리는 우리의 검출 제출 비슷한 파이프 라인을 채택하지만, 멀티 박스 높은 객체 경계 상자 리콜 [5] 예측 및 상자 제안을 경계의 더 나은 분류에 대한 종합적 접근 방식으로 두 단계의 개선을 살펴 보았다.

Motivation and High Level Considerations

ENG	KOR
The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of levels – of the network and its width: the number of units at each level. This is as an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However this simple solution comes with two major drawbacks.	깊은 뉴럴 네트워크의 성능을 향상시키는 가장 간단한 방법은 그 크기가 증가하는 것이다. 단위 수를 각 레벨 - 레벨의 수 - 네트워크와 폭이 모두 깊이를 증가 시킴. 이는 높은 품질의 모델을 훈련 쉽고 안전한 방법으로, 특히 표지 훈련 데이터의 대량의 유용성을 설명한다. 그러나이 간단한 해결책은 두 가지 주요 단점이 함께 제공됩니다.
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This can become a major bottleneck, since the creation of high quality training sets can be tricky and expensive, especially if expert human raters are necessary to distinguish between fine-grained visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated by Figure 1.	큰 크기는 일반적으로 트레이닝 세트에 표시된 예시의 수에 제한이 특히 과다 적합에 확대 네트워크 경향하게 파라미터의 더 큰 수를 의미한다. 높은 품질의 교육 세트의 생성이 전문가 인간의 평가자가 (심지어 1000 클래스 ILSVRC의 서브 세트) ImageNet에있는 것과 같은 세밀한 시각적 범주를 구분하는 데 필요한 특히, 까다 롭고 비용이 많이들 수 있기 때문에 이것은, 주요 병목이 될 수 있습니다 그림 1에 의해 입증 된 바와 같이.
Another drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then a lot of computation is wasted. Since in practice the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of results.	균일하게 증가 네트워크 크기의 또 다른 단점은 연산 리소스의 사용을 극적으로 증가한다. 예를 들어, 깊은 비전 네트워크에서, 두 개의 층이 컨벌루션 체인 경우, 계산의 차의 증가에 필터 결과의 수가 어느 일정한 증가. (가장 가중치가 제로에 근접 할 경우 결국, 예를 들어) 추가 용량이 비효율적으로 사용되면, 계산이 많이 소모된다. 실제로 계산 예산 항상 유한하기 때문에, 컴퓨팅 리소스의 효율적인 분배가 주요 목적이 결과의 품질을 증가시키는 경우에도, 크기의 무차별 증가하는 것이 바람직하다.
The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. ⁵¹. Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.	두 문제를 해결하는 근본적인 방법은 궁극적으로 심지어 회선 내부, 부족하게 연결 아키텍처에 완전히 연결에서 이동 될 것이다. 생물학적 시스템을 모방 게다가, 이것은 또한 인해 로라 등의 획기적인 작업에 더 확고한 이론적 토대의 이점을 가질 것이다. [1]. 그들의 주요 결과는 마지막 층의 활성화의 상관 통계를 분석하여 데이터 세트의 확률 분포는 크고 매우 희박한 깊은 신경망에 의해 표현할이면, 최적 네트워크 토폴로지 층으로 층을 구성 할 수 있다고하고 높은 상관 관계가 출력이 신경 세포를 클러스터링. 함께, 함께 와이어를 발사 뉴런 - - 엄격한 수학적 증명은 매우 강한 상태로,이 사항이 잘 알려진 181 \| 원칙에 공감한다는 사실 필요하지만 기본 아이디어는 실제로 더 적은 엄격한 조건 하에서 적용될 수 있음을 시사한다.
On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100x, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware ⁵²⁵³. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since ⁵⁴ in order to break the symmetry and improve learning, the trend changed back to full connections with ⁵⁵ in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.	이 불균일 희소 데이터 구조에 대한 수치 계산에 관해서는 단점, 컴퓨팅 인프라 오늘날 매우 비효율적이다. 산술 연산의 수 (100)에 의해 감소 되더라도, 조회 및 캐시 미스의 오버 헤드는 성긴 행렬로 스위칭 지불하지 않을 정도로 지배적이다. 갭은 하부 CPU 또는 GPU 하드웨어의 분 정보를 이용, 초고속 고밀도 행렬 곱셈을 허용 꾸준히 향상 고도로 동조 수치 라이브러리를 이용하여 더욱 넓어진 다 [2] [3]. 또한, 불균일 성긴 모델은보다 정교한 기술 및 컴퓨팅 인프라를 필요로한다. 대부분의 현재 시각 지향적 기계 학습 시스템은 회선을 이용하는 덕에 의해 공간 도메인에서 희소성을 이용한다. 그러나 회선은 이전 층에서 패치 조밀 한 연결의 집합으로 구현됩니다. [4] 대칭을 깰 학습을 향상시키기 위해, 트랜드 [3] 더 최적화하기 위해 병렬 계산 전체 연결을 다시 변경 이후 ConvNets 전통적 피쳐 치수 랜덤 성긴 연결 테이블을 사용했다. 구조의 균일 성 및 필터 및 더 많은 수의 배치 크기는 효율적 치밀한 계산을 이용을 허용.
This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. ⁵⁶) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.	이것은 다음에, 중간 단계에 대한 희망 존재하는지 질문 제기 이론에 의해 제안 된 것처럼, 심지어 필터 수준에서, 여분의 희소성 이용한다 아키텍처하지만 조밀 행렬에서 계산을 이용하여 현재의 하드웨어를 이용한다. 희소 행렬 계산에 광대 한 문학 (예 : [5]) 상대적으로 밀도 행렬에 스파 스 매트릭스를 클러스터링하는 희소 행렬 곱셈에 대한 기술이 실제 성능 상태를 제공하는 경향이 있음을 시사한다. 그것은 유사한 방법은 가까운 미래에 비 균일 깊은 학습 구조의 자동화 건설을 위해 이용 될 것이라고 생각하는 억지하지 않는 것 같습니다.
The Inception architecture started out as a case study of the first author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by ⁵⁷ for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on ⁵⁸. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for ⁵⁹ and ⁶⁰. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.	인 셉션 아키텍처를 사용할 쉽게, 밀도에 의해 [1] 비전 네트워크 및 덮고 가정 된 결과에 의해 묵시적 스파 스 구조에 근접하려고 정교한 네트워크 토폴로지 구성 알고리즘의 가상 출력을 평가하기위한 첫 번째 저자의 사례 연구로 시작 구성 요소. 매우 투기 사업 임에도 불구하고, 단지 토폴로지의 정확한 선택이 반복 후, 우리는 이미에 [6] 기반 참조 아키텍처에 대한 겸손한 이득을 볼 수 있었다. 학습 속도, 하이퍼 파라미터 및 향상된 트레이닝 방법의 상기 조정 후에는 수득 셉션 아키텍처의 기본 네트워크와 같은 위치 파악 및 물체 검출의 맥락에서 특히 유용한 것을 확립 [7,8]. 원래 건축 선택의 대부분이 의문을 제기하고 철저하게 테스트 한 동안 흥미롭게도, 그들은 적어도 로컬 최적으로 밝혀졌다.
One must be cautious though: although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction. Making sure would require much more thorough analysis and verification: for example, if automated tools based on the principles described below would find similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.	하나는하지만주의해야합니다 제안 된 구조는 컴퓨터 비전을위한 성공이되었다하더라도, 그것의 품질의 구축으로 이어질 한 원칙에 기인 할 수 있는지 여부를 여전히 의문이다. 메이커 더 철저한 분석 및 검증을 요구 확인 : 예를 들어, 비전 네트워크를 찾을 것이다 유사한 후술 원리에 기초 도구하지만 더 자동화 된 토폴로지 경우. 자동화 된 시스템이 동일한 알고리즘을 사용하여 다른 도메인에있는 유사한 이익하지만 매우 다르게 글로벌 아키텍처를 찾고 그 결과 네트워크 토폴로지를 생성 할 경우 가장 설득력있는 증거가 될 것입니다. 적어도,이 방향으로 흥미 진진한 미래의 일에 대한 셉션 아키텍처 수율 회사 동기 부여의 초기 성공.

Architectural Details

ENG	KOR
The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. ⁶¹ suggests a layer-by layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1 x 1 convolutions in the next layer, as suggested in ⁶². However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1 x 1, 3 x 3 and 5 x 5, however this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).	셉션 아키텍처의 주요 아이디어는 길쌈 비전 네트워크에서 최적의 로컬 성긴 구조 근사하고 쉽게 구할 치밀한 구성 요소들에 의해 커버 될 수있는 방법을 찾는 것에 기초한다. 번역 불변을 가정하는 것은 우리의 네트워크는 길쌈 빌딩 블록에서 구축되는 것을 의미합니다. 우리가 필요로하는 최적의 지역 건설을 찾기 위해 공간적으로 그것을 반복하는 것입니다. 아 로라는 등. [1] 층 - 층 구조에 의해 하나의 마지막 층의 상관 관계를 통계 분석하고 높은 상관 관계와 유닛 그룹으로 클러스터링해야하는 제안. 이러한 클러스터는 다음 층의 단위를 형성하고, 이전의 층 내에 유닛에 접속된다. 우리는 이전 층에서 각각의 유닛은 입력 영상의 일부 영역에 대응하고, 이들 유닛은 필터 뱅크로 그룹화되어 있다고 가정한다. 하위 계층에서 (입력에 가까운 것들) 단위 지역의 지역에 집중할 것 상관 관계. 이것은 우리가 하나의 지역에 집중 클러스터의 많은 끝낼 것, 의미 및 제안으로 그들은, 다음 층에서 1 × 1 회선의 층에 의해 커버 될 수있다 [2]. 그러나, 하나는 더 큰 패치를 통해 회선에 포함 할 수 있습니다 더 공간적으로 확산 클러스터의 적은 수있을 것으로 예상 할 수 있으며, 더 큰 영역을 통해 패치의 감소 수는있을 것입니다. patchalignment 문제를 피하기 위해, 셉션 아키텍처 화신 전류는 1 × 1, 3 X 3 X 5 5 단이 결정 편의보다는 더 필요에 기초 하였다 크기를 필터링하도록 제한된다. 또한, 제안 된 구조는 다음의 스테이지의 입력을 형성하는 하나의 출력 벡터로 결합 그들의 출력 필터 뱅크와 모든 층의 조합 인 것을 의미한다. 풀링 작업 기술 컨벌루션 네트워크의 현재 상태에서의 성공을 위해 필수적 되었기 때문에 또한, 각각의 이러한 단계에서 대체 평행 풀링 경로를 추가하는 것은 너무 추가적인 유익한 효과를 가질 것을 제안한다 (도 2의 (a) 참조).
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3 x 3 and 5 x 5 convolutions should increase as we move to higher layers.	이러한 "셉션 모듈"이 서로의 상부에 적층 된 바와 같이, 자신의 출력의 상관 통계 다를 수밖에 : 높은 추상화 기능은 상위 계층에 의해 캡쳐 될 때, 공간적 농도 제안 감소 할 것으로 예상되는 3 × 3의 비율 우리는 상위 계층으로 이동하는 5 × 5 회선이 증가한다.
One big problem with the above modules, at least in this na¨ıve form, is that even a modest number of 5 x 5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equals to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. Even while this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.	상기 모듈이 하나의 큰 문제는이 순 형태 적어도 5 × 5 회선이라도 적당한 개수의 필터 많은 수의 콘볼 루션 층의 위에 매우 고가 일 수 있다는 점이다. 이 문제는 더욱 현저 한번 풀링 유닛 믹스에 추가된다 : 출력 필터들의 수는 이전의 단계에서 필터의 수와 같다. 컨벌루션 층의 출력을 풀링 층의 출력의 병합 단계에 스테이지로부터의 출력의 수를 불가피 증가로 이어질 것이다. 이 아키텍처는 최적의 스파 스 구조를 포함 할 수도 있지만, 그것은 몇 단계 내에서 계산 블로우 업으로 이어지는 매우 비효율적으로 그것을 할 것입니다.
This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to model. We would like to keep our representation sparse at most places (as required by the conditions of ⁶³) and compress the signals only whenever they have to be aggregated en masse. That is, 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose. The final result is depicted in Figure 2(b).	이 제안 된 구조의 두 번째 생각에 이르게 : 신중 차원 감소 및 전산 요구 사항이 너무 많은 그렇지 증가 할 때마다 예상을 적용. 이것은 묻어의 성공을 기반으로합니다 심지어 저 차원 묻어은 상대적으로 큰 이미지 패치에 대한 많은 정보를 포함 할 수 있습니다. 그러나,이 묻어 고밀도 압축 된 형태를 나타내는 정보 및 압축 정보는 모델링 어렵다. 그들이 한꺼번에 집계 할 만 할 때마다 우리는 및 압축 신호 ([1]의 조건에 따라 필요한 경우) 대부분의 장소에서 희소 우리의 표현을 유지하고 싶습니다. 즉, 1 × 1 회선은 비용이 3 × 3, 5 × 5 회선 전에 감소를 계산하는 데 사용됩니다. 감소로서 사용되는 외에, 또한 그들을 겸용하게 정류 선형 활성화의 사용을 포함한다. 최종 결과는도 2에 도시되어있다 (b).
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.	일반적 셉션 네트워크는 그리드의 해상도를 이등분하는 스트라이드 2 가끔 MAX-풀링 레이어와 서로에 상기 적층 타입의 모듈로 구성된 네트워크이다. 기술적 인 이유로 (트레이닝 동안 메모리 효율성)의 경우, 기존의 컨볼 루션 방식으로 하부 층을 유지하면서 더 높은 층에서 유일한 셉션 모듈을 사용하기 시작하는 것이 유리할 같았다. 이것은 단순히 우리의 현재 구현의 일부 인프라의 비 효율성을 반영, 엄격하게 할 필요가 없습니다.
One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.	이 아키텍처의 주요 유익한 측면 중 하나는 상당히 계산 복잡성에 제어되지 않은 블로우 업없이 각 단계에서 단위 수를 증가시키는 것이 가능하다. 크기 감소의 사용은 유비쿼터스 제 큰 패치의 크기와 그들에 컨벌루션 전에 치수를 줄일 다음 층으로 최종 단의 입력 필터의 다수의 차폐 허용한다. 이 설계의 다른 실용적인 측면은 시각 정보가 다양한 배율에서 처리 한 후 응집되어야한다는 직관과 정렬되도록하는 것이 다음 단계에서 비늘 다른 추상 기능을 동시에 할 수있다.
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2 - 3 x faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.	연산 리소스의 사용은 개선 된 각 단의 폭뿐만 아니라 계산적 어려움없이 들어가는 단 수를 증가 모두 허용한다. 처음 아키텍처를 활용하는 또 다른 방법은 그것의 약간 열등하지만, 계산적으로 저렴 버전을 생성하는 것이다. 우리는 모든 포함 된 손잡이와 레버가 2 있습니다 네트워크에서 발생할 수 계산 자원의 통제 균형을 허용 것을 발견했다 - 3 × 빠른 유사 비 셉션 아키텍처와 네트워크를 수행하는 것보다, 그러나이이 시점에서주의 설명서 디자인이 필요합니다.

Inception_module.png
Figure 2: Inception module

GoogLeNet

Figure 3: GoogLeNet network with all the bells and whistles

ENG	KOR
We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage to Yann LeCuns pioneering LeNet 5 network ⁶⁴. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the influence of the exact architectural parameters is relatively minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.	우리는 ILSVRC14 경쟁에서 우리 팀 이름으로 GoogLeNet를 선택했다. 이 이름은 얀 LeCuns 개척 LeNet 5 네트워크 [1]에 경의입니다. 우리는 또한 경쟁에 대한 우리의 제출에 사용되는 셉션 아키텍처의 특정 화신을 참조 GoogLeNet를 사용합니다. 또한 약간 열등 품질있는 깊고 넓은 셉션 네트워크를 사용하지만, 앙상블에 추가 가장자리 결과를 향상시키기 위해 같았다. 우리의 실험은 정확한 구조적 파라미터의 영향이 비교적 작은 것을 도시 한 이후 우리는 해당 네트워크의 세부 사항을 생략합니다. 여기서, (GoogLeNet 명명) 가장 성공적인 특정 인스턴스는 데모 목적으로, 표 1에 기재되어있다. (다른 샘플링 방법과 훈련) 동일한 토폴로지가 우리의 앙상블 모형 7 중 6에 사용 하였다.
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224 x 224 taking RGB color channels with mean subtraction. “#3 x 3 reduce” and “#5 x 5 reduce” stands for the number of 1 x 1 filters in the reduction layer used before the 3 x 3 and 5 x 5 convolutions. One can see the number of 1 x 1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.	인 셉션 모듈 내부에 포함하여 모든 회선, 정류 선형 활성화를 사용합니다. 우리의 네트워크에서 수용 필드의 크기는 평균 감산 224 X 224 복용 RGB 컬러 채널이다. "# 3 × 3은 감소"및 "# 5 × 5 감소"3 × 3, 5 × 5 회선 전에 사용 저감 층에 1 × 1 필터의 개수를 나타낸다. 한 후 투사 층에 1 × 1 필터의 개수를 알 수 내장 풀 PROJ 열에 MAX-풀링. 이러한 모든 감소 / 투영 층뿐만 아니라 선형 활성화를 정류 사용합니다.
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends on the machine learning infrastructure system used. The use of average pooling before the classifier is based on ⁶⁵, although our implementation differs in that we use an extra linear layer. This enables adapting and fine-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.	그 추론은 특히 낮은 메모리 풋 프린트와 제한된 컴퓨팅 자원 심지어 포함하여 개별 장치에서 실행할 수 있도록 네트워크는, 마음에 계산의 효율성과 실용성을 함께 설계되었습니다. (우리는 또한 풀링을 계산하는 경우 또는 27 층) 매개 변수 만 레이어를 계산하면 네트워크는 22 층 깊이입니다. 네트워크의 구성에 사용되는 층들의 전체 수 (독립적 인 빌딩 블록)은 약 100 그러나이 숫자가 사용될 기계 학습 기반 시스템에 달려있다. 분급 전의 평균 풀링의 사용에 기초한다 [2], 우리는 우리가 구현 여분 선형 층을 사용하는 것이 상이하다하더라도. 이것은 적응 및 기타 라벨에 대한 우리의 네트워크를 미세 조정하는 것은 쉽게 설정하지만 대부분은 편리하고 우리가 큰 효과를 기대하지 않습니다 수 있습니다. 그것은 평균 풀링에 완전히 연결 층에서 이동, 약 0.6 %로 상위 1 정확성을 향상 그러나 드롭 아웃의 사용에도 완전히 연결 층을 제거한 후 필수 유지 한 것으로 나타났습니다.
Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.	네트워크의 비교적 큰 깊이를 감안할 때, 효과적인 방식으로 모든 층을 통해 위로 경사를 전파 할 수있는 능력이 우려되었다. 흥미로운 통찰이 작업에 비교적 얕게 네트워크의 강한 성능 네트워크의 중간 층에 의해 생성 된 특징은 매우 차별적되어야한다는 것을 암시한다. 이러한 중간층에 연결된 보조 분류를 추가함으로써, 분류기의 하부 단계에서 차별을 장려 위로 전파 얻는다 구배 신호를 증가시키고, 추가의 정규화를 제공 할 것으로 예상된다. 이러한 분류는 셉션 (4A) 및 (4D)의 출력 모듈의 상부에 넣어 더 작은 컨벌루션 네트워크 형태를 취한다. 훈련하는 동안, 그들의 손실은 할인 중량 (보조 분류의 손실이 0.3에 의해 가중 된)를 사용하여 네트워크의 총 손실에 추가됩니다. 추론 때, 이러한 보조 네트워크가 삭제됩니다.
The exact structure of the extra network on the side, including the auxiliary classifier, is as follows: An average pooling layer with 5 x 5 filter size and stride 3, resulting in an 4 x 4 x 512 output for the (4a), and 4 x 4 x 528 for the (4d) stage. A 1 x 1 convolution with 128 filters for dimension reduction and rectified linear activation. A fully connected layer with 1024 units and rectified linear activation. A dropout layer with 70% ratio of dropped outputs. A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time). A schematic view of the resulting network is depicted in Figure 3.	다음과 같이 분류 포함한 보조 측에 여분의 네트워크의 정확한 구조이다 : 5 × 5 필터의 크기와 보폭 3, (4A)에 대한 4 × 4 × 512의 출력 결과와 (4D) 스테이지 4 × 4 × (528)와 평균 풀링 층. 차원 감소와 정류 선형 활성화를 위해 128 필터 1 개 1 회선. 1024 단위와 정류 선형 활성화와 완전히 연결 층. 삭제 출력의 70 %의 비율로 강하 층. 선형 분류기로서 softmax를 손실 층 (주 분류와 같은 1,000 클래스를 예측하지만, 추론시 제거). 얻어진 네트워크의 개략도가도 3에 도시되어있다.

Training Methodology

ENG	KOR
Our networks were trained using the DistBelief ⁶⁶ distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum ⁶⁷, fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging ⁶⁸ was used to create the final model used at inference time.	우리의 모델은 네트워크 및 데이터의 병렬 처리를 이용하여 적당한 양 DistBelief [1] 분산 기계 학습 시스템을 이용하여 훈련되었다. 우리는 단지 CPU 기반 구현을 사용하지만, 대략적인 추정치는 GoogLeNet 네트워크가, 주일 이내에 메모리 사용되는 주요 제한을 몇 가지 하이 엔드 GPU를 사용한 수렴 훈련 될 수 있음을 시사한다. 우리의 교육은 0.9 모멘텀 비동기 확률 그라데이션 하강을 사용 [2], (4 % 매 8 시대로 학습 속도를 감소) 속도 일정을 학습 고정. Polyak [3] 추론시에 사용되는 최종 모델을 생성하기 위해 사용되었다 평균화.
Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by ⁶⁹. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3=4 and 4=3. Also, we found that the photometric distortions by Andrew Howard ⁷⁰ were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use.	우리의 이미지 샘플링 방법은 경쟁을 선도 개월 동안 실질적으로 변경하고, 이미 통합 모델은 때때로 변경된 하이퍼 파라미터와 함께, 다른 옵션에 대한 훈련을받은, 드롭 아웃과 학습 속도처럼, 그래서 그것은에 대한 명확한 지침을 제공하기 어렵다 이러한 네트워크를 훈련하는 가장 효과적인 하나의 방법입니다. 문제를 더욱 복잡하게하기 위해, 모델의 일부는 주로 작은 상대 작물에 대한 교육을하고, 영감을 더 큰 사람에서 다른 사람, [4]. 여전히, 경쟁 후 잘 작동 확인되었다 한 처방 크기 애스펙트 비가 3 = 4 사이에 임의로 선택되는 8 % 화상 면적의 100 % 사이에 균일하게 분포되는 화상의 다양한 크기의 패치의 샘플링을 포함 4 = 3. 또한, 우리는 앤드류 하워드 광도 왜곡 [4] 어느 정도 과다 적합을 방지하는 데 유용 사실을 발견했습니다. 또한, 우리는 상대적으로 늦게과 다른 hyperparameter 변화와 함께 크기 조정에 대한 (동일한 확률과 선형, 영역, 가까운 이웃과 차,) 임의의 보간 방법을 사용하기 시작, 그래서 최종 결과가 긍정적으로 영향을받은 여부를 우리는 확실히 말할 수 없습니다 자신의 사용.

ILSVRC 2014 Classification Challenge Setup and Results

ENG	KOR
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.	ILSVRC 2,014 분류 도전 Imagenet 계층 1000 리프 노드 범주에 화상을 분류하는 작업을 수반한다. 검증을위한 교육에 대한 120 만 이미지, 50,000 및 테스트를위한 10 만 이미지가 있습니다. 각각의 이미지는 하나의 접지 진리의 범주와 관련되어, 성능은 가장 높은 점수를 분류 예측을 기반으로 측정된다. 제 5 예측 클래스에 대하여 지표 사실을 비교 먼저 예측 클래스에 대하여 지표 사실을 비교 최상위 1 정확도 레이트, 및 상위 5 에러율을 : 두 숫자는 일반적으로보고 된 이미지는 정확하게 분류 된 것으로 간주 땅의 진실에 관계없이에서의 순위, 상위 5 중 하나입니다 경우. 도전 순위 상업적 최상위 5 에러율을 사용한다.
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we elaborate below. We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, mainly because of an oversight) and learning rate policies, and they only differ in sampling methodologies and the random order in which they see input images. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. ⁷¹. Specifically, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224 x 224 crop as well as the square resized to 224 x 224, and their mirrored versions. This results in 4 x 3 x 6 x 2 = 144 crops per image. A similar approach was used by Andrew Howard ⁷² in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on). The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging. In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.	우리는 훈련로 사용하는 외부 데이터와 도전에 참가했다. 이 논문에서, 상기 트레이닝 기술 이외에, 우리는 아래 상세히 높은 성능을 얻기 위해 테스트 중에 기술들을 채택했다. 우리는 독립적으로 (하나의 넓은 버전 포함) 같은 GoogLeNet 모델의 7 버전을 훈련하고, 그들과 함께 앙상블 예측을 수행 하였다. 이 모델은 (주로하기 때문에 감독도 같은 초기 무게와) 같은 초기화 및 학습 속도 정책을 훈련하고, 그들은 단지 샘플링 방법과 그들이 입력 이미지를 볼 수있는 임의의 순서에 차이가 있었다. 테스트하는 동안, 우리는 Krizhevsky 등의 알보다 더 공격적 자르기 접근 방식을 채택했다. [1]. 특히, 우리는 짧은 치수 (높이 또는 너비)가 각각 256, 288, 320, 352 4 저울에 이미지 크기를 조정, 세로 이미지의 경우 (이 크기를 조정할 이미지의 왼쪽, 가운데, 오른쪽 사각을, 우리가 취할 상단 중앙과 하단 사각형). 각 사각형을 위해, 우리는 다음 4 모서리와 중앙 224 X 224 작물뿐만 아니라 224 X 224 크기를 조정할 광장과 미러 버전을. 이 이미지 당 4 × 3 × 6 × 2 = 144 작물에 발생합니다. 유사한 접근 방식은 우리가 경험적으로 제안 된 방법보다 약간 더 수행 할 검증 전년의 항목에서 [2] 앤드류 하워드에 의해 사용되었다. 우리는 작물의 합리적인 번호가 존재하는 후 (우리는 나중에 표시됩니다으로) 더 많은 작물의 이점은 한계가된다 등의 적극적인 자르기, 실제 애플리케이션에서 필요하지 않을 수 있습니다. softmax를 확률은 최종 예측을 얻기 위해 모든 개별 분류 여러 작물 반복해서 평균화된다. 우리의 실험에서 우리는 최대 작물을 통해 풀링 및 분류를 통해 평균과 유효성 검사 데이터에 대한 다른 접근 방법을 분석,하지만 그들은 단순 평균에 비해 성능이 저하로 이어집니다. 이 문서의 나머지 부분에서, 우리는 최종 제출의 전체 성능에 기여하는 다수의 요인을 분석.
Our final submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. The following table shows the statistics of some of the top-performing approaches.	도전에 우리의 최종 제출은 다른 참가자들 사이에서 첫 번째 순위, 검증 및 테스트 데이터 모두에 6.67 %의 최고 5 오류를 가져옵니다. 이는 2012 년 감독 방식에 비해 56.5 %의 상대 감소하고, 분류를 훈련을 위해 외부 데이터를 사용 둘 다 전년의 가장 좋은 방법 (Clarifai)에 비해 약 40 %의 상대 감소이다. 다음 표는 가장 실적 접근법 몇몇의 통계를 나타낸다.
We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.	또한 분석 모델 번호 및 다음 표에 화상을 예측할 때 사용되는 작물의 개수를 변화시킴으로써, 여러 개의 선택지 테스팅 성능을보고한다. 우리는 하나의 모델을 사용하는 경우, 우리는 검증 데이터에 가장 낮은 상위 1 오류 속도를 선택했다. 모든 숫자는 테스트 데이터 통계에 overfit하지 위해 검증 데이터 세트에보고된다.

ILSVRC 2014 Detection Challenge Setup and Results

ENG	KOR
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary from large to tiny. Results are reported using the mean average precision (mAP).	ILSVRC 탐지 작업은 200 가능한 클래스 중 이미지에서 객체 주위에 경계 상자를 생산하는 것입니다. 그들은 groundtruth의 클래스와 일치하고 그들의 바운딩 박스 (인 Jaccard 인덱스를 사용하여) 적어도 50 %만큼 오버랩하는 경우 검출 된 객체는 정확한 계산. 여분의 탐지는 오탐 (false positive)을 계산하고 범하고 있습니다. 분류 작업과는 달리, 각각의 이미지는 많은 개체 또는 없음을 포함 할 수 있으며, 그 규모는 큰에서 작은 다를 수 있습니다. 결과는 평균 평균 정밀도 (MAP)를 사용하여보고됩니다.
The approach taken by GoogLeNet for detection is similar to the R-CNN by ⁷³, but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search ⁷⁴ approach with multi-box ⁷⁵ predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by 2x. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box ⁷⁶ resulting, in total, in about 60% of the proposals used by ⁷⁷, while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.	검출 GoogLeNet 의해 촬영 방법은 [1]에 의한 R-CNN 유사하지만, 영역 분류로서 셉션 모델로 보강된다. 또한, 지역 제안 단계는 높은 객체 경계 상자 리콜 멀티 박스 [3] 예측과 선택적 검색 [2] 접근 결합하여 개선된다. 오탐 (false positive)의 수를 삭감하기 위해, 슈퍼 픽셀의 크기는 2 배 증가 하였다. 이것은 선택적 검색 알고리즘에서 오는 제안 반쪽. 93 %로 92 %에서 커버리지를 증가시키는 동시에 우리는 [1]에서 사용하는 제안의 약 60 %에서, 총, [3] 얻어진 멀티 박스 영역 (200)에서 오는 제안서를 다시 첨가. 증가 된 커버리지 제안서의 수를 절단 전체적인 효과는 단일 모델 케이스 평균치 정밀도 1 %의 개선이다. 43.9 %의 정확도로 40 %의 결과를 향상 각 지역을 분류 할 때 마지막으로, 우리는 6 ConvNets의 앙상블을 사용합니다. R-CNN에 그 반대를 참고, 우리는 때문에 시간의 부족으로 상자의 회귀를 경계 사용하지 않았다.
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.	먼저 상단 검출 결과를보고하고 검출 태스크의 초판 보낸 진행을 보여준다. 2,013 결과에 비해, 정밀도는 거의 두 배가되었다. 실적이 팀 모두 길쌈 네트워크를 사용합니다. 외부 데이터, 앙상블 모델이나 상황에 맞는 모델의 사용 : 우리는 각 팀에 대한 표 4의 공식 점수와 공통의 전략을보고했다. 외부 데이터는 일반적으로 나중에 검출 데이터에 정제 모델을 사전 훈련 ILSVRC12 분류 데이터이다. 몇몇 팀도 파악 데이터의 사용을 언급. 경계 박스 지역화 태스크의 좋은 부분을 검출 데이터 세트에 포함되지 않기 때문에, 하나는 동일한 방식으로 분류 사전 훈련에 사용되는 이러한 데이터의 일반적인 경계 박스 회귀 사전 훈련 할 수있다. GoogLeNet 항목은 pretraining에 대한 현지화 데이터를 사용하지 않았다.
In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.	표 5에서, 우리는 하나의 모델을 이용하여 결과를 비교. 실적이 모델은 깊은 통찰력으로하고 GoogLeNet이 앙상블과 상당히 강한 결과를 얻는 동안 놀라 울 만 3 모델의 앙상블로 0.3 포인트 향상시킨다.

Conclusions

ENG

KOR

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways on the basis of ⁷⁸.

우리의 결과는 용이하게 사용할 조밀 빌딩 블록들에 의해 예상되는 최적의 성긴 구조를 근사하는 컴퓨터 비전 위해 신경망을 개선하기위한 실용적인 방법이라고 확실한 증거를 수득하는 것. 이 방법의 가장 큰 장점은 얕은 적은 넓은 네트워크에 비해 계산 요구 사항의 완만 한 증가에 상당한 품질의 이득이다. 또한 우리의 탐지 작업 컨텍스트를 사용하지 않고 경계 상자의 회귀 분석을 수행하고이 사실 인 셉션 아키텍처의 힘의 증거를 제공하지도에도 불구하고 경쟁이었다 있습니다. 이 결과의 품질이 비슷 비슷한 폭과 깊이의 더 비싼 네트워크에 의해 달성 될 수있을 것으로 기대되고 있으나, 우리의 방법은 성긴 구조로 전환하는 것은 일반적으로 가능하고 유용한 것으로 생각 확실한 증거를 산출한다. 이것에 기초하여 자동화 된 방법으로 더욱 정제 성긴 구조를 만드는쪽으로 유망한 미래 작업 제안 [2].

Acknowledgements

ENG

KOR

We would like to thank Sanjeev Arora and Aditya Bhaskara for fruitful discussions on ⁷⁹. Also we are indebted to the DistBelief ⁸⁰ team for their support especially to Rajat Monga, Jon Shlens, Alex Krizhevsky, Jeff Dean, Ilya Sutskever and Andrea Frome. We would also like to thank to Tom Duerig and Ning Ye for their help on photometric distortions. Also our work would not have been possible without the support of Chuck Rosenberg and Hartwig Adam.

우리는 [2]에 유익한 토론을위한 산지 브 아 로라와 아 디트 Bhaskara에게 감사의 말씀을 전합니다. 또한 우리는 특히 라자하기 Monga, 존 Shlens, 알렉스 Krizhevsky, 제프 딘, 일리아 Sutskever 안드레아 프롬 그들의 지원을위한 DistBelief [4] 팀에게 빚을 수 있습니다. 우리는 또한 광도 왜곡에 그들의 도움 톰 Duerig 및 닝 너희에게 감사의 말씀을 전합니다. 또한 우리의 작업은 척 로젠버그와 하트 위그 아담의 지원이 없었다면 불가능했을 것입니다.

Inception model

(Inception model에서 Redirection 가능)

One by One (1x1) Convolution - counter-intuitively useful
[추천] Google Inception Model v1~v4 (ko) ⁸¹

Documentation

Going deeper with convolutions (GoogleNet): http://arxiv.org/pdf/1409.4842v1.pdf; Going_deeper_with_convolutions.pdf

GoogLeNet caffe train_val.prototxt: https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/train_val.prototxt

Favorite site

Going Deeper with Convolutions
[추천] Inception(GoogLeNet) 리뷰

References

Know your meme: We need to go deeper. http://knowyourmeme.com/memes/we-need-to-go-deeper. Accessed: 2014-09-15. ↩
Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013. ↩
U¨mit V. C¸ atalyu¨rek, Cevdet Aykanat, and Bora Uc¸ar. On two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656–683, February 2010. ↩
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep networks. In P. Bartlett, F.c.n. Pereira, C.j.c. Burges, L. Bottou, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1232–1240. 2012. ↩
Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014. ↩
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014. ↩
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. ↩
Andrew G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2013. ↩
Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. ↩
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, December 1989. ↩
Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. ↩
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013. ↩
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, July 1992. ↩
Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Mathieu, Rob Fergus, and Yann Le-Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013. ↩
Thomas Serre, Lior Wolf, Stanley M. Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007. ↩
Fengguang Song and Jack Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000 cpu cores. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, pages 333–342, New York, NY, USA, 2014. ACM. ↩
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013. ↩
Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In Christopher J. C. Burges, L´eon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2553–2561, 2013. ↩
Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. CoRR, abs/1312.4659, 2013. ↩
Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA, 2011. IEEE Computer Society. ↩
Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David J. Fleet, Tom´as Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014. ↩
Norman3.github.io_-_Google_Inception_Model_v1-v4.pdf ↩