## What are activation functions? Why are they used?

Activation functions decide how strong of a signal an individual neuron should give. In some cases, the signal gets completely removed (the result of applying the function is 0).

You do dot-product operations between a data vector and a weight matrix when training your Neural Network. Even in a multi-layer model, you'll do dot-product that result in a linear operation. Activation functions break that linearity. To do that, you need to pick a function that does not satisfy the expression:

$f(\alpha x_1 + \beta x_2) = \alpha f(x_1) + \beta f(x_2)$

A Neural Network without an activation function would not be able to learn non-linear relationships.

A wide variety of activation functions exist, some of which are 30+ years old. Here, we'll have a look at the most common ones. They also have some slight variations that try to improve some aspects of the original one.

## Sigmoid

One of the first well-known activation functions is the Sigmoid. Its definition is:

$Sigmoid(x) = \frac{e^x}{1 + e^x}$

Unfortunately, the gradient values become very small at the tail of the function. The result is too small weight update when training the model.

## ReLU

ReLU is currently the most popular activation function. It zeroes out negative numbers and keeps untouched the positive ones:

$Relu(x) = \max(0, x)$

Computing the ReLU is done very efficiently (compared to other activation functions) - it is simple thresholding. But zeroing out the parameter values results in their "death" - will not be ignored during training.

## Softmax

Softmax is used as an "output normalizer" when doing classification and you want the probability for each class.

The softmax forces the sum of values for each output neuron to be 1. The definition is:

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

It applies the exponential function to each element of the input vector $x_i$. Normalization is ensured by dividing by the sum of all exponentials.