Activation function
ArtificialNeuronModel_english.png
인공 신경망 모델에서 뉴런의 주요 기능은 입력과 연결 강도의 가중합 NET 를 구한 다음 활성화 함수에 의해 출력을 내보내는 것이다. 따라서, 어떤 활성화 함수를 선택하느냐에 따라 뉴런의 출력이 달라질 수도 있다.
활성화 함수 (activation function) 는 단조 증가하는 함수이어야 하며, 일반적으로 다음과 같이 분류할 수 있다. 이들 분류 방법은 서로 연관되어 있기 때문에 엄밀히 구분하여 설명하기 보다는 이해를 돕기 위하여 일반적으로 사용되는 함수를 위주로 기술한다.
Type
- 단극성 (unipolar) / 양극성 (bipolar) 함수
- 선형 (linear) / 비선형 (nonlinear) 함수
- 연속 (continuous) / 이진 (binary) 함수
Sigmoid function
시그모이드 함수(sigmoid function) 는 아래 그림과 같이 단극성 또는 양극성 비선형 연속 함수이며, 신경망 모델의 활성화 함수로써 가장 널리 사용되고 있다. 시그모이드 함수는 형태가 S 자 모양이므로 S 형 곡선이라고도 한다. 자세한 내용은 해당 항목을 참조.
Logistic-curve.svg.png
Effective Activation function
- ReLU / Rectified-Linear and Leaky-ReLU
- Sigmoid function
- TanH / Hyperbolic Tangent
- Absolute Value
- Power
- BNLL
Comparison of activation functions
Some desirable properties in an activation function include:
- Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
- Continuously differentiable: This property is necessary for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.
- Monotonic: When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.
- \(f(x)\approx x\) when \(x \approx 0\): This property enables the neural network to train efficiently when its weights are initialized with small random values. When the activation function does not satisfy this property, special care must be used when initializing the weights.
- Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.
The following table compares the properties of several activation functions:
Name | Plot | Monotonic | \(f(x)\approx x\) when \(x \approx 0\) | Range | ||
Identity | | \(f(x)=x\) | \(f'(x)=1\) | Yes | Yes | \((-\infty,\infty)\) |
Binary step | | \(f(x) = \left \{ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array} \right.\) | \(f'(x) = \left \{ \begin{array}{rcl} 0 & \mbox{for} & x \ne 0\\ ? & \mbox{for} & x = 0\end{array} \right.\) | Yes | No | \(\{0,1\}\) |
Logistic (a.k.a Soft step) | | \(f(x)=\frac{1}{1+e^{-x}}\) | \(f'(x)=f(x)(1-f(x))\) | Yes | No | \((0,1)\) |
TanH | | \(f(x)=\tanh(x)=\frac{2}{1+e^{-2x}}-1\) | \(f'(x)=1-f(x)^2\) | Yes | Yes | \((-1,1)\) |
ArcTan | | \(f(x)=\tan^{-1}(x)\) | \(f'(x)=\frac{1}{x^2+1}\) | Yes | Yes | \((-\frac{\pi}{2},\frac{\pi}{2})\) |
| \(f(x) = \left \{ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ x & \mbox{for} & x \ge 0\end{array} \right.\) | \(f'(x) = \left \{ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array} \right.\) | Yes | No | \([0,\infty)\) | |
SoftPlus | | \(f(x)=\log_e(1+e^x)\) | \(f'(x)=\frac{1}{1+e^{-x}}\) | Yes | No | \((0,\infty)\) |
Bent identity | | \(f(x)=\frac{\sqrt{x^2 + 1} - 1}{2} + x\) | \(f'(x)=\frac{x}{2\sqrt{x^2 + 1}} + 1\) | Yes | Yes | \((-\infty,\infty)\) |
SoftExponential | | \(f(\alpha,x) = \left \{ \begin{array}{rcl} -\frac{\log_e(1-\alpha (x + \alpha))}{\alpha} & \mbox{for} & \alpha < 0\\ x & \mbox{for} & \alpha = 0\\ \frac{e^{\alpha x} - 1}{\alpha} + \alpha & \mbox{for} & \alpha > 0\end{array}\right.\) | \(f'(\alpha,x) = \left \{ \begin{array}{rcl} \frac{1}{1-\alpha (\alpha + x)} & \mbox{for} & \alpha < 0\\ e^{\alpha x} & \mbox{for} & \alpha \ge 0\end{array} \right.\) | Yes | Yes iff \(\alpha\approx0\) | \((-\infty,\infty)\) |
Sinusoid | | \(f(x)=\sin(x)\) | \(f'(x)=\cos(x)\) | No | Yes | \([-1,1]\) |
Sinc | | \(f(x)=\left \{ \begin{array}{rcl} 1 & \mbox{for} & x = 0\\ \frac{\sin(x)}{x} & \mbox{for} & x \ne 0\end{array} \right.\) | \(f'(x)=\left \{ \begin{array}{rcl} 0 & \mbox{for} & x = 0\\ \frac{\cos(x)}{x} - \frac{\sin(x)}{x^2} & \mbox{for} & x \ne 0\end{array} \right.\) | No | No | \([\approx-.217234,1]\) |
Gaussian | | \(f(x)=e^{-x^2}\) | \(f'(x)=-2xe^{-x^2}\) | No | No | \((0,1]\) |
See also
- Perceptron
- Sigmoid function
- Neural network
- 합성함수 (Combination function)
- Loss function
- 오차함수 (Error function)