Res-Net 论文精读

本文将对 Res-Net 论文进行精读, 初次阅读, 难免有错误或者理解粗浅的地方, 欢迎指正.

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize,and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers — 8×deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions[¹], where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Translation

更深的神经网络更难训练。我们提出了一种残差学习框架,以便于训练比以前使用的网络深得多的网络。我们明确地将层重新表述为学习残差函数,而不是学习无参考的函数,并以层输入作为参考。我们提供了全面的实证结果,表明这些残差网络更容易优化,并且可以从大幅增加的深度获得精度提升。在ImageNet数据集上,我们评估了最深达152层的残差网络 - 比VGG网络深8倍,但复杂度反而更低。这些残差网络的一个集成取得了3.57%的ImageNet测试集错误率。这个结果赢得了2015年ILSVRC分类任务的第一名。我们还在CIFAR-10数据集上分析了100层和1000层的网络。

表示的深度对于许多视觉识别任务至关重要。仅凭我们极深的表征,我们在COCO对象检测数据集上获得了28%的相对改进。深度残差网络是我们参加ILSVRC和COCO 2015竞赛的基础,我们也在ImageNet检测、ImageNet定位、COCO检测和COCO分割任务中获得了第一名。

词汇解释

reformulate:重新表述、重新阐述

functions with reference to the layer inputs:以层输入作为参考的函数

unreferenced functions:无参考的函数,指没有明确参考输入的函数

comprehensive empirical evidence:全面的实证结果,通过实验得到的证据

ensemble:集成,将多个模型的预测结果结合起来

Solely:仅仅,只不过是

专业术语

residual learning framework:残差学习框架,将神经网络层重新参数化为学习残差映射,而不是学习原始的映射。这种方法缓解了深层网络的梯度消失问题,使深层网络更容易训练。

ImageNet dataset:ImageNet数据集,一个著名的大规模图像分类数据集,包含1000个类别,数百万张图像。在计算机视觉领域被广泛用于训练和评估模型。

VGG nets:VGGNet,由牛津大学计算机视觉组提出的经典卷积神经网络模型,在ImageNet等数据集上表现优异,但网络较浅。

ILSVRC:ImageNet大规模视觉识别挑战赛(ILSVRC),是一项针对图像分类、目标检测、目标定位等任务的著名年度比赛,使用ImageNet数据集。

CIFAR-10:CIFAR-10数据集,一个小规模的图像分类数据集,包含10个类别,6万张32x32的小图像。通常用于测试和调试模型。

COCO object detection dataset:COCO目标检测数据集,由微软等机构共同推出的一个大规模目标检测、语义分割等任务的数据集,包含33万张图像及8000万个标注。

1. Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21,50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence[41, 44] reveals that network depth is of crucial importance,and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

深度卷积神经网络[22,21]带来了一系列图像分类[21,50,40]的突破。深度网络以端到端的多层方式自然集成了低/中/高层特征[50]和分类器,通过堆叠层数(深度)可以丰富"层次"特征。最新证据[41,44]显示,网络深度至关重要,在具有挑战性的ImageNet数据集[36]上取得领先结果[41,44,13,16]都利用了"非常深"[41]的模型,深度从16层[41]到30层[16]不等。许多其他重要的视觉识别任务[8,12,7,32,27]也从非常深的模型中获益良多。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

由于深度的重要性,一个问题随之而来:学习更好的网络是否就像堆叠更多层那么简单?回答这个问题的一个障碍是臭名昭著的梯度消失/爆炸问题[1,9],这阻碍了训练过程一开始就收敛。然而,这个问题已经被规范化初始化[23,9,37,13]和中间归一化层[16]很大程度上解决了,它们使得具有数十层的网络能够开始为随机梯度下降(SGD)与反向传播[22]收敛。

当更深的网络能够开始收敛时,一个退化问题暴露出来:随着网络深度的增加,精度会饱和(这可能并不奇怪),然后迅速降解。出乎意料的是,这种降解并非由过拟合引起,给一个足够深的模型添加更多层会导致更高的训练误差,如[11,42]所报告并经过我们的实验彻底验证。图1显示了一个典型的例子。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution(or unable to do so in feasible time).

训练精度的降解表明,并非所有系统都同样容易优化。让我们考虑一个较浅的架构及其通过添加更多层而得到的更深的对应体。通过构造,对于更深的模型存在一个解决方案:添加的层是恒等映射,其他层则从学习到的浅层模型复制而来。这种构造解的存在表明,更深的模型不应产生比其浅层对应体更高的训练误差。但实验表明,我们现有的求解器无法找到与构造解一样好或更好的解(或者无法在可行的时间内做到这一点)。