Stochastic gradient descent

심층 신경망은 표준 오류역전파 알고리즘 (Backpropagation)을 가지고 구별되게 학습될 수 있다. 이때, 가중치(weight)들은 아래의 등식을 이용한 확률적 경사 하강법(Stochastic gradient descent)을 통하여 갱신될 수 있다.

$$ \Delta w_{ij}(t + 1) = \Delta w_{ij}(t) + \eta\frac{\partial C}{\partial w_{ij}} $$

여기서, $\eta$ 는 학습률(learning rate)을 의미하며, $C$ 는 비용함수 (cost function)를 의미한다. 비용함수의 선택은 학습의 형태(지도 학습 (Supervised Learning), 자율 학습 (Unsupervised learning), 강화 학습 (Reinforcement learning) 등)와 활성화함수 (Activation function)같은 요인들에 의해서 결정된다.

예를 들면, 다중 클래스 분류 문제(multiclass classification problem)에 지도 학습을 수행할 때, 일반적으로 활성화함수와 비용함수는 각각 Softmax 함수와 교차 엔트로피 함수(cross entropy function)로 결정된다.

softmax 함수는 $p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)}$ 로 정의된다, 이때, $p_j$ 는 클래스 확률(class probability)을 나타내며, $x_j$ 와 $x_k$ 는 각각 유닛 $j$ 로의 전체 입력(total input)과 유닛 $k$ 로의 전체 입력을 나타낸다.

교차 엔트로피는 $C = -\sum_j d_j \log(p_j)$ 로 정의된다, 이때, $d_j$ 는 출력 유닛 $j$ 에 대한 목표 확률(target probability)을 나타내며, $p_j$ 는 해당 활성화함수를 적용한 이후의 $j$ 에 대한 확률 출력(probability output)이다.

Comparison of a few optimization methods

Comparison of a few optimization methods (animation by Alec Radford). The star denotes the global minimum on the error surface. Notice that stochastic gradient descent (SGD) without momentum is the slowest method to converge in this example. We're using Nesterov's Accelerated Gradient Descent (NAG) throughout this tutorial.

Machine_learning_-_Comparison_of_a_few_optimization_methods.gif

Momentum

Optimization: Stochastic Gradient Descent

If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum.

The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains.

Momentum is one method for pushing the objective more quickly along the shallow ravine. (모멘텀은 얕은 계곡을 따라 더 빨리 목표를 추진하기 위한 하나의 방법이다.)

v=γv+α∇θJ(θ;x(i),y(i))
θ=θ−v

In the above equation v is the current velocity vector which is of the same dimension as the parameter vector θ. The learning rate α is as described above, although when using momentum α may need to be smaller since the magnitude of the gradient will be larger. Finally γ∈(0,1] determines for how many iterations the previous gradients are incorporated into the current update. Generally γ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.

Favorite site

References

Daniel_Nouris_Blog_-_Using_convolutional_neural_nets_to_detect_facial_keypoints_tutorial.pdf ↩

Stochastic gradient descent

Comparison of a few optimization methods

Momentum

See also

Favorite site

References