2 minute read

Theoretically, as the layers of the NeuralNet increase, the number of parameters increases, so the performance in training should increase. However, in reality, the training error decreases and then increases at some point. We would like to know about this.

Residual Block

The reason that the above problem occurs in Deep Neural Net is that the gradient explosively increases or decreases close to zero.

This is called gradient exploding and gradient vanishing. ResNet is an architecture to solve this problem.

The Layer Block of Neural Net is composed as follows.

image

Let a be the forward propagation of NeuralNet’s layer and activation. We will define a[L] as the output to which activation is applied after forward propagation of the l-th layer. Let z[L] be the forward propagation in the lth layer. Similarly, let’s define the weight as W[L] , the bias as b[L] , and the activation as g[L] .

image

Residual Block is to add a[L] with z[L+2] before calculating a[L+2] and then calculate a[L+2].

\[\begin{align} z^{[L+1]} &= W^{[L+1]}a^{[L]} + b^{[L+1]} \\ a^{[L+1]} &= g^{[L+1]}(z^{[L+1]}) \\ z^{[L+2]} &= W^{[L+2]}a^{[L+1]} + b^{[L+2]} \\ a^{[L+2]} &= g^{[L+2]}(z^{[L+2]} +a^{[L]} ) \\ \end{align}\]

If expressed as a formula, it can be expressed as above.

Why Resnet Work?

How does ResNet work? That is, how does ResNet solve gradient vanishing or gradient exploding? \(a^{[L+2]} = g^{[L+2]}(W^{[L+2]}a^{[L+1]} + b^{[L+2]} + a ^{[L]})\) If the activation of the L+2 layer is expanded, it can be expanded as above. If , W, and b are assumed to be 0, \(a^{[L+2]} = g^{[L+2]}( a^{[L]}) = a^{[L]}\) The expression is simplified as above. If the activation function uses ReLU, the activation of the lth layer is the result because the activation of the lth layer is 0 or more. Therefore, during forward propagation, activation of the L+2th layer will be just an identity function, which is relatively easy to learn than other complex functions.

Then, if the affine transformation of the L+2th layer is ignored in the above process, you may wonder whether the hidden layer is absolutely necessary.

If we simply assume that there is no hidden-layer and only activations are passed, we will simply learn the identity function. However, if a hidden-layer exists, more complex functions can be learned. Therefore, if a hidden layer is added, activation is delivered to the identity function as it is, ensuring minimal performance, and more complex functions can improve performance. Because, if you create a complex function, the activation for it is reflected in the L+2th activation, and you can learn it through back-propagation. Even if the activation of the L+2th layer is greater than 0, the vanishing gradient or exploding gradient does not occur because the gradient to the Lth layer is directly transmitted during back-propagation.

Therefore, adding a hidden layer is a good choice.

Implementation

Since the advent of the VGG architecture, an architecture that doubles the number of channels by going through Linear and Relu Block twice has become common. This also worked for ResNet. \(a^{[L+2]} = g^{[L+2]}(z^{[L+2]} + W_{s}a^{[L]})\) That is, the number of channels in the L+2th layer is different from the number of channels in the Lth layer. For example, if the number of L+2th channels is 256, the lth layer will be about 128. Therefore, you have to change the dimension through the matrix. \(a^{[L+2]} \in R^{n^{[L+2]} \times m} \qquad a^{[L]} \in R^{n^{[L]} \times m } \qquad W_s \in R^{n^{[L+2]} \times n^{[L]} }\) Simply, if you think of it as a Linear Layer, you can set the dimension as above.

Leave a comment