Entropy can be defined as the uncertainty, unpredictability, randomness, or disorder in a set of possible outcomes.

When we build a machine learning model and the model assigns probabilities to some scenarios, if the model is not very sure about its predictions, this means that the unpredictability and randomness in the model’s predictions are high. But we wouldn’t want that. We want our model to be sure and certain about its predictions. And because of that, when we build a model, we should update the parameters of this model so that we can reduce the overall uncertainty and randomness, in other words, entropy, of these predictions. But how do we represent this unpredictability and randomness mathematically?

One way to measure the unpredictability of an outcome is to measure how much we are surprised after observing that outcome. Because if the outcome is highly predictable, unpredictability is low and we don’t tend to surprise that much after observing that outcome. If the outcome is not predictable and if we observe that outcome, we tend to be surprised. Therefore, if we can be able to calculate how much we surprised after observing an outcome, we can see this surprise level as unpredictability. Now the question is: how do we express the level of surprise?

The simplest way to express the level of surprise mathematically is by using the negative probability of the outcome. Because if we observe data point \(x\), for instance, and if we show the probability of observing data point \(x\) with \(p(x)\), we tend to be more surprised after observing data \(x\) if the probability of observing data \(x\) is low. And if the probability of observing data \(x\) is high, we tend to be less surprised after observing data \(x\).

If we press a light switch, for example, the probability of the light turning on is very high. Therefore, we are not surprised at all when we observe that the light is on after pressing the switch. However, if the light does not turn on, we are quite surprised since this is not an expected situation.

We can express this dynamic mathematically is by using the formula below.

\[ \begin {aligned} - p(x) \end {aligned} \]

Because as \(p(x)\) increases, the value of \(p(x)\) decreases. And as \(p(x)\) decreases, the value of \(p(x)\) increases. In other words, as the probability of observing data \(x\) increases, the surprise we experience after observing \(x\) will decrease and as the probability of observing data \(x\) decreases, the surprise we experience after observing \(x\) will increase because it is an unlikely event for us.

But what if we observe multiple outcomes \(x_i\)? Let’s say that we have a machine-learning model that takes 3 images as input. And let’s assume that the 1st image is about a dog, the 2nd image is about a bear and the 3rd image is about a bee. In total, there are 3 classes (dog, bear, and bee) and let’s assume that the model assigns different probabilities to these 3 classes for each image.

If the model assigns a probability of 0.75 to the dog class in the 1st image, a probability of 0.6 to the bear class in the 2nd image, and the probability of 0.4 to the bee class in the 3rd image, we can calculate the probability of classifying 1st image as dog AND classifying 2nd image as bear AND classifying 3rd image as bee by multiplying these probabilities with the formula below.

\[ \prod _{i=1}^{n} p(x_i) \]

Previously, we mentioned that if there is a single outcome and the probability of \(p(x)\) is assigned to this outcome \(x\), we can express the level of surprise by using \(-p(x)\). If we apply the same logic in here, when there are multiple outcomes, we can measure the overall surprise level by taking the negative of the product of probabilities.

\[ - \prod _{i=1}^{n} p(x_i) \]

So if we turn back to the image classification example, the formula above attempts to measure the level of surprise we will experience after seeing that the model classified the 1st image as a dog with 0.75 probability, 2nd image as a bear with 0.6 probability, and 3rd image as a bee with 0.4 probability.

However, there are some issues with this method. First of all, the multiplication of many probabilities that are between 0 and 1 will result in a very small number. For instance, if we multiply 0.75, 0.6, 0.4, the result will be 0.18 which does not represent the combined effect of using 0.75, 0.6, and 0.4 together. To handle this issue, we can simply take the log(.) of this operation.

\[ \begin {aligned} -\log \left ( \prod _{i=1}^{n} p(x_i)\right ) &= - \sum _ {i=1}^{n} \log p(x_i) \\ \end {aligned} \]

But there is still an issue with this formula. What if, for instance, we have thousands of outcomes? The sum of their negative log probabilities is not intuitive and it makes it difficult to compare with other scenarios. To solve this issue, we can simply take the average of this by dividing the result by \(\frac {1}{n}\)

\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} \log p(x_i) \right ) \\ \end {aligned} \]

So the function above gives us the overall uncertainty/entropy of the system when there are multiple data points \(x_i\).

But note that in the formula above, all outcomes have the same level of contribution to the overall surprise level. For instance, if we have a less probable event, we are surprised a lot but considering that this event won’t happen frequently, it shouldn’t dominate the overall surprise level. But with the formula above, it does. Therefore, we should incorporate the probability of the events as well.

\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log p(x_i) \right ) \end {aligned} \]

Since we take the probabilities of the events into account, this might be the best way to incorporate all the surprise levels. But the thing is the formula above represents the entropy in one distribution (\(p\)). So, what if we have another distribution (let’s call it \(q\)) and we want to measure how well this distribution (\(q\)) approximates the other distribution (\(p\))?

The probability of observing an outcome \(x_i\) using the \(q\) distribution is denoted with \(q(x_i)\). In that case, if some frequent event happens (\(p(x_i)\) is high) and \(q\) is not very surprised after that event happens (high \(q(x_i)\) and low \(-\log q(x_i)\)), this can be seen as a good indicator that the distribution \(q\) resembles \(p\).

Similarly, if some rare event happens (\(p(x_i)\) is low) and \(q\) is very surprised when that event happens (low \(q(x_i)\) and high \(-\log q(x_i)\)), this is another good indicator that \(q\) acts like \(p\).

If, however, some rare event happens (\(p(x_i)\) is low) and \(q\) is not surprised at all when that event happens (high \(q(x_i)\) and low \(-\log q(x_i)\)), this means that \(q\) acts different from \(p\).

Lastly, if some frequent event happens (\(p(x_i)\) is high) and \(q\) surprises a lot (low \(q(x_i)\) and high \(-\log q(x_i)\)), this is another indicator that \(q\) acts quite different from \(p\).

We can express this dynamics with the simple and intuitive equation below:

\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log q(x_i) \right ) \end {aligned} \]

And this is basically called cross-entropy. It is a measurement of how much one distribution (\(q\)) acts like the other (\(p\)).

However, knowing this might still not be enough. In some cases, we may also want to measure the difference between the overall surprise level we experience with \(p\) and \(q\) distributions after an event \(x_i\) happens. If we express the probability of an event \(x_i\) happens with \(p(x_i)\), we can simply measure the difference between the surprise levels we experience with \(p\) and \(q\) distributions with the equation below:

\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) (\log p(x_i) - \log q(x_i)) \right ) &= - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log \frac {p(x_i)}{q(x_i)} \right ) \end {aligned} \]

And this is called Kullback-Leibler (KL) Divergence. It calculates the difference between two probability distributions.

Now, let’s say that we have a machine learning model and it outputs the probability distribution \(q\). If this model is not optimal, the probabilities we use won’t be very accurate. Therefore, using the probabilities generated by the machine learning model to compute the overall surprise level we experience might not be the best method because if the parameters of our machine learning model are not optimal, the model may not assign the most accurate probabilities to events, and computing the entropy with inaccurate probabilities is not ideal. That’s why we should incorporate the parameters of our model as well when we calculate the overall surprise level we experience.

To do this, we can use the concept of likelihood instead of probability. Probability refers to the chance of a specific event occurring. And likelihood is a measure of how plausible the set of model parameters is, given that the specific event occurred. For instance, let’s say that we have some information about the current date, temperature outside, location, etc. We want to predict the probability of observing rain tomorrow. Here, probability is the measure of how likely it is to rain tomorrow given the fact that we are currently in New York, in summer, today was sunny, etc. Likelihood, on the other hand, is a function of the model parameters and the observed data and it represents the likelihood of observing the given data if the model with those parameter values is true. The model parameters that maximize this likelihood are considered the best fit for the model given the data. We can write likelihood with \(L(\theta | x) = p(x | \theta )\). Here \(\theta \) represents the set of parameters of the model and \(x\) represents the observed data (whether the weather is rainy or not). \(L(\theta | x)\) represents the probability of observing the data \(x\) with the model parameters \(\theta \).

Suppose it is winter, and the daily temperature has been around 0°F for the last 30 days. If our model assigns a high probability to summer-like temperatures, this would indicate a low likelihood, given the observed winter temperatures. Thus, likelihood helps us evaluate the suitability of our model parameters based on the observed data. And finding the model parameters that maximize the likelihood, and that make the observed data \(x_i\) most probable under the assumed model is called maximum likelihood estimation.

\[ \begin {aligned} - \sum _{i=1}^{N} p(\theta \mid x_i) \log q(\theta \mid x_i) \end {aligned} \]

The formula above is also called negative log-likelihood. In the likelihood function, \(\theta \) represents the parameters of the model and \(x_i\) represents the observed data. And \(p(\theta \mid {x_i})\) represents how likely it is to observe the data \(x_i\) with the model parameters with one distribution and \(q(\theta \mid {x_i})\) represents how likely it is to observe the data \(x_i\) with the model parameters with another separate distribution.

If we have a machine learning model and that model assigns the probabilities of \(q(x_i)\) to observing data \(x_i\) while the real probability of observing these data is \(p(x_i)\), we can measure the overall performance of the model and optimize the parameters of the model in such a way that decreases the difference between the true probabilities and predicted probabilities with this formula.

Once we find the parameters with maximum likelihood estimation and once we ensure that the model parameters are optimal and reliable, we can use probability instead of likelihood and measure the predictability/certainty of the predicted probabilities with the most optimal model we found.

\[ \begin {aligned} - \sum _{i=1}^{N} p(x_i) \log q(x_i) \end {aligned} \]

If there are only 2 distinct labels in our data in total, we can write the probability like the one below.

\[ p(x_i) = \begin {cases} p & \text {if } x_i = 1 \\ 1 - p & \text {if } x_i = 0 \end {cases} \]

In that case, we can rewrite the cross-entropy formula like the one below.

\[ \begin {aligned} & - \sum _{i=1}^{N} p(x_i) \log q(x_i) \\\ & = - p \log q(x_1) - (1-p) \log (1-q(x_2)) \end {aligned} \]

And we call this formula binary cross-entropy.

When we use the loss function like the one above, we update the parameters of the model if there is a large difference between the actual probability \(p(x_i)\) and the predicted probability \(q(x_i)\). But the issue is that the larger this difference, the more penalized the model.

\[ \begin {aligned} - \sum _{i=1}^{N} p(x_i) \log q(x_i) \end {aligned} \]

Sometimes, we may want to penalize the parameters (weights) of the model more directly to prevent them from becoming very large and make them smoother. To do this, we can add the sum of squares of all model parameters. Through this way, if there is a big difference between the predicted probability \(p(x_i)\) and \(q(x_i)\) and if the system wants to make drastic updates on the parameters of the model as a result of this, the sum of squares of model parameters won’t allow this because large weights now increase the value of the loss function and therefore should be avoided. This is called regularization and there are two types of regularization: L1 and L2. The function below is an example of L2 regularization.

\begin {equation*} - \sum _{i=1}^{N} p(x_i) \log q(x_i) + \lambda \sum _{j=1}^{M} {W_j}^2 \end {equation*}

Another way to penalize large weights is L1 regularization.

\begin {equation*} - \sum _{i=1}^{N} p(x_i) \log q(x_i) + \lambda \sum _{j=1}^{M} |W_j| \end {equation*}