Activation Functions

Activation Functions#

Note

We will refer to activation functions as activations.

What are activation functions?#

Activations transform an input vector into another vector. These functions are usually applied after a linear layer, so that the output of a linear layer will abide by some rules. A \( RELU(x) \) function sets the rule that only positive numbers are allowed, it’s defined as \( \max \{0, x\} \). A \( Sigmoid(x) \) function limits the output values to be within \( (0, 1) \) because it’s defined as \( \frac{1}{1 + e^{-x}} \).

In deep learning though, when there are many linear layers stacked on top of each other, usually we don’t care too much about the activations that are hidden, that is, not at the last layer.

Why activations?#

Without activations, deep learning is meaningless. To understand why that’s the case, we use a very simple example.

Suppose that there is a small neural network that has only two layer. To simplify the problem further, let’s assume that these two layers only have weight matrices but not bias matrices. That is, the network can be represented by the function: \( F(x) = (B)(A) x \), with \( A \) the weight of the first layer, and \( B \) the weight of the second layer.

However, we can see the function in a different way: \( F(x) = (BA)x \), which means that we can construct a simpler network, with only one layer, whose weight matrix is \( BA \), that does the exactly same thing! It means that adding layer literally doesn’t really help us in anyway.

Truly, because of how every neural network can be thought of as a chain of matrix multiplications, simply adding more linear layers are never going to help expand the type of functions that we want to approximate, because with only linear functions, we can only approximate linear functions! That’s without activation function.

If we apply activation \( \sigma \) to the network: \( F(X) = \sigma (((B) \sigma (A) ) x \), then we can’t decompose the function easily. In fact, with enough layers, the function would be so complicated that it can approximate any function! And that’s all because of the power of activation functions.