Activation functions are mathematical functions that are attached to each neuron in the networks and determine whether it should be activated or not. Typically, in each layer, the neurons perform a linear transformation on the input using the weight and bias:
Then, an activation function is applied to the above result:
The output of the current layer becomes the input for the next layer. This process is repeated along with all hidden layers in the network. This forward movement of information is called forward propagation.
Activation functions have some important significance:
- It introduces the non-linearity transformation in the neural network so that the network has the capacity to learn and perform more complex tasks. Imagine the neural network without activation functions, so there are only linear transformations on the inputs using the weights as bias. The neural network in such a case works as a linear regression model, then it will be less powerful and not be able to perform complex tasks.
- Besides, some activation functions also help to normalize the output of each neuron into a new range, between 0 and 1, or between -1 and 1.
Now, we are going to discover some popular activation functions:
1. Sigmoid function (Logistic activation function)
This function is defined by:
import matplotlib.pyplot as plt
import numpy as npx = np.arange(-6,6, 0.01)def plot(func, ylim):
plt.plot(x, func(x), c = 'r', lw = 3)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.axhline(c = 'k', lw = 1)
plt.axvline(c = 'k', lw = 1)
plt.ylim(ylim)
plt.box(on = None)
plt.grid(alpha = 0.4, ls = '-.')
Visualization of sigmoid function:
def sigmoid(x):
return 1/ (1 + np.exp(-x))plot(sigmoid, ylim = (-0.25, 1.25))
- Since output values are bounded between 0 and 1, then this function is especially used for the model where we have to predict the probability as an output.
- This function is differentiable, its derivation is given by:
def derivative_sigmoid(x):
return sigmoid(x)*(1-sigmoid(x))plot(derivative_sigmoid, ylim=(-0.02, 0.35))
Advantages:
- Outputs are normalized in range 0 and 1
- Smooth gradient
Disadvantages:
- Vanishing gradient: based on the figure above, the gradient values are closed to 0 when x > 6 or x < -6. The vanishing gradient may stop the neural network from further training.
- The outputs are not centered.
- The computational complexity is large.
2. Tanh function
Tanh function is defined as:
This function is very similar to the sigmoid function. However, it is symmetric around the origin. The outputs of this function are centered and range between -1 and 1.
def tanh(x):
return 2*sigmoid(2*x) - 1plot(tanh, ylim = (-1.4, 1.4))
tanh(x) is a monotone, continuous and differentiable function. Its derivation is determined by:
def derivative_tanh(x):
return 1 - (tanh(x))**2plot(derivative_tanh, ylim = (-0.2, 1.2))
Similar to the sigmoid function, the derivation of this function is closed to zero when x > 3 or x < -3. Hence, that also leads to vanishing gradient phenomena during training neural networks.
3. Rectified Linear Unit (ReLU) function
This function has been used widely in the deep learning domain. Its formula is defined by:
relu = np.vectorize(lambda x: x if x > 0 else 0, otypes=[np.float])plot(relu, ylim=(-0.3, 1.5))
This function is continuous, monotone in ℝ, and differentiable for all x ≠ 0.
The derivation of the ReLU function is determined by:
derivative_relu = np.vectorize(lambda x:1 if x > 0 else 0, otypes=[np.float])plot(derivative_relu, ylim = (-0.5, 1.5))
Advantages:
- This function is more computationally efficient compared to the sigmoid and the tanh function. Hence, using this function speeds up the training time.
Disadvantages:
- The outputs are neither normalized nor centered.
- This function is not differentiable at the origin point
- Since ReLU activates only the positive signals and deactivates all the negative ones, then it does not provide a consistent prediction for negative input values.
- When the inputs approach zero or negative, the derivation value is zero. Hence, during the back-propagation, the weights and biases for some neurons are not be updated.
4. Leaky ReLU function
This function is introduced as an improved version of the ReLU function, at which some drawbacks of ReLU are overcome.
a = 0.01
leaky_relu = np.vectorize(lambda x: x if x > 0 else a*x, otypes= [np.float])plot(leaky_relu, ylim = (-0.5, 1.5))
In this case, all negatives values x are replaced by ax, where a ∈ (0, 1). That allows us to keep updating the weights and biases of all neurons.
Advantages:
- This function is simple and its computational complexity is still smaller than the sigmoid and the tanh functions.
- Keep updating weights and biases of all neurons, so it does enable back-propagation.
Disadvantages: Although it keeps active the negative neurons, these neurons are rescaled into a very small value. In other words, the signals corresponding to these negative neurons are significantly reduced. Hence, leaky ReLU can not provide consistent predictions for negative input values.
5. Softmax function
The softmax function is defined by:
def softmax(x):
z = np.exp(x)
return z/z.sum()softmax([0.4, 2, 5])>>> array([0.00948431, 0.04697607, 0.94353962])
The derivative of softmax function:
When j ≠ i:
Let’s denote:
Then, the derivative of the softmax function can be rewritten as:
The softmax function returns the output values between 0 and 1, and the sum of these outputs equals 1. Hence, this function is useful for multiclass classification where we aim to predict the probability of a data point belonging to a particular class.
Conclusion: We have discovered some activation functions which appear frequently when building convolutional neural networks. I hope this post is helpful for you.
Feel free to write in the comment section if you have any questions.
Thank you!
My blog page: https://lekhuyen.medium.com/