本文将对 Res-Net 论文进行精读, 初次阅读, 难免有错误或者理解粗浅的地方, 欢迎指正.


Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize,and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers — 8×deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions[1], where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.


更深的神经网络更难训练。我们提出了一种残差学习框架,以便于训练比以前使用的网络深得多的网络。我们明确地将层重新表述为学习残差函数,而不是学习无参考的函数,并以层输入作为参考。我们提供了全面的实证结果,表明这些残差网络更容易优化,并且可以从大幅增加的深度获得精度提升。在ImageNet数据集上,我们评估了最深达152层的残差网络 - 比VGG网络深8倍,但复杂度反而更低。这些残差网络的一个集成取得了3.57%的ImageNet测试集错误率。这个结果赢得了2015年ILSVRC分类任务的第一名。我们还在CIFAR-10数据集上分析了100层和1000层的网络。

表示的深度对于许多视觉识别任务至关重要。仅凭我们极深的表征,我们在COCO对象检测数据集上获得了28%的相对改进。深度残差网络是我们参加ILSVRC和COCO 2015竞赛的基础,我们也在ImageNet检测、ImageNet定位、COCO检测和COCO分割任务中获得了第一名。



functions with reference to the layer inputs:以层输入作为参考的函数

unreferenced functions:无参考的函数,指没有明确参考输入的函数

comprehensive empirical evidence:全面的实证结果,通过实验得到的证据




residual learning framework:残差学习框架,将神经网络层重新参数化为学习残差映射,而不是学习原始的映射。这种方法缓解了深层网络的梯度消失问题,使深层网络更容易训练。

ImageNet dataset:ImageNet数据集,一个著名的大规模图像分类数据集,包含1000个类别,数百万张图像。在计算机视觉领域被广泛用于训练和评估模型。

VGG nets:VGGNet,由牛津大学计算机视觉组提出的经典卷积神经网络模型,在ImageNet等数据集上表现优异,但网络较浅。



COCO object detection dataset:COCO目标检测数据集,由微软等机构共同推出的一个大规模目标检测、语义分割等任务的数据集,包含33万张图像及8000万个标注。

1. Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21,50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence[41, 44] reveals that network depth is of crucial importance,and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.


Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.



The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution(or unable to do so in feasible time).
