Entropy can be defined as the uncertainty, unpredictability, randomness, or disorder
in a set of possible outcomes.
When we build a machine learning model and the model assigns probabilities
to some scenarios, if the model is not very sure about its predictions, this
means that the unpredictability and randomness in the model’s predictions
are high. But we wouldn’t want that. We want our model to be sure and
certain about its predictions. And because of that, when we build a model, we
should update the parameters of this model so that we can reduce the overall
uncertainty and randomness, in other words, entropy, of these predictions. But
how do we represent this unpredictability and randomness mathematically?
One way to measure the unpredictability of an outcome is to measure how much
we are surprised after observing that outcome. Because if the outcome is highly
predictable, unpredictability is low and we don’t tend to surprise that much after
observing that outcome. If the outcome is not predictable and if we observe that
outcome, we tend to be surprised. Therefore, if we can be able to calculate how much
we surprised after observing an outcome, we can see this surprise level as
unpredictability. Now the question is: how do we express the level of surprise?
The simplest way to express the level of surprise mathematically is by using the
negative probability of the outcome. Because if we observe data point \(x\), for instance,
and if we show the probability of observing data point \(x\) with \(p(x)\), we tend to be more
surprised after observing data \(x\) if the probability of observing data \(x\) is low. And if the
probability of observing data \(x\) is high, we tend to be less surprised after observing
data \(x\).
If we press a light switch, for example, the probability of the light turning on is
very high. Therefore, we are not surprised at all when we observe that the light is on
after pressing the switch. However, if the light does not turn on, we are quite
surprised since this is not an expected situation.
We can express this dynamic mathematically is by using the formula below.
\[ \begin {aligned} - p(x) \end {aligned} \]
Because as \(p(x)\) increases, the value of \(p(x)\) decreases. And as \(p(x)\) decreases, the value of \(p(x)\)
increases. In other words, as the probability of observing data \(x\) increases, the surprise
we experience after observing \(x\) will decrease and as the probability of observing data \(x\)
decreases, the surprise we experience after observing \(x\) will increase because it is an
unlikely event for us.
But what if we observe multiple outcomes \(x_i\)? Let’s say that we have a
machine-learning model that takes 3 images as input. And let’s assume that the 1st
image is about a dog, the 2nd image is about a bear and the 3rd image is about a
bee. In total, there are 3 classes (dog, bear, and bee) and let’s assume that
the model assigns different probabilities to these 3 classes for each image.
If the model assigns a probability of 0.75 to the dog class in the 1st image, a
probability of 0.6 to the bear class in the 2nd image, and the probability of 0.4 to the
bee class in the 3rd image, we can calculate the probability of classifying
1st image as dog AND classifying 2nd image as bear AND classifying 3rd
image as bee by multiplying these probabilities with the formula below.
\[ \prod _{i=1}^{n} p(x_i) \]
Previously, we mentioned that if there is a single outcome and the probability of \(p(x)\)
is assigned to this outcome \(x\), we can express the level of surprise by using \(-p(x)\). If we
apply the same logic in here, when there are multiple outcomes, we can measure the
overall surprise level by taking the negative of the product of probabilities.
\[ - \prod _{i=1}^{n} p(x_i) \]
So if we turn back to the image classification example, the formula above
attempts to measure the level of surprise we will experience after seeing that the
model classified the 1st image as a dog with 0.75 probability, 2nd image as
a bear with 0.6 probability, and 3rd image as a bee with 0.4 probability.
However, there are some issues with this method. First of all, the multiplication
of many probabilities that are between 0 and 1 will result in a very small
number. For instance, if we multiply 0.75, 0.6, 0.4, the result will be 0.18
which does not represent the combined effect of using 0.75, 0.6, and 0.4
together. To handle this issue, we can simply take the log(.) of this operation.
\[ \begin {aligned} -\log \left ( \prod _{i=1}^{n} p(x_i)\right ) &= - \sum _ {i=1}^{n} \log p(x_i) \\ \end {aligned} \]
But there is still an issue with this formula. What if, for instance, we have thousands of outcomes? The sum of their negative log probabilities is not intuitive and it makes it difficult to compare with other scenarios. To solve this issue, we can simply take the average of this by dividing the result by \(\frac {1}{n}\)
\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} \log p(x_i) \right ) \\ \end {aligned} \]
So the function above gives us the overall uncertainty/entropy of the system when
there are multiple data points \(x_i\).
But note that in the formula above, all outcomes have the same level of
contribution to the overall surprise level. For instance, if we have a less probable
event, we are surprised a lot but considering that this event won’t happen frequently,
it shouldn’t dominate the overall surprise level. But with the formula above, it
does. Therefore, we should incorporate the probability of the events as well.
\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log p(x_i) \right ) \end {aligned} \]
Since we take the probabilities of the events into account, this might be the best
way to incorporate all the surprise levels. But the thing is the formula above
represents the entropy in one distribution (\(p\)). So, what if we have another distribution
(let’s call it \(q\)) and we want to measure how well this distribution (\(q\)) approximates the
other distribution (\(p\))?
The probability of observing an outcome \(x_i\) using the \(q\) distribution is denoted with \(q(x_i)\).
In that case, if some frequent event happens (\(p(x_i)\) is high) and \(q\) is not very surprised after
that event happens (high \(q(x_i)\) and low \(-\log q(x_i)\)), this can be seen as a good indicator that the
distribution \(q\) resembles \(p\).
Similarly, if some rare event happens (\(p(x_i)\) is low) and \(q\) is very surprised when that
event happens (low \(q(x_i)\) and high \(-\log q(x_i)\)), this is another good indicator that \(q\) acts like \(p\).
If, however, some rare event happens (\(p(x_i)\) is low) and \(q\) is not surprised at all when
that event happens (high \(q(x_i)\) and low \(-\log q(x_i)\)), this means that \(q\) acts different from \(p\).
Lastly, if some frequent event happens (\(p(x_i)\) is high) and \(q\) surprises a lot
(low \(q(x_i)\) and high \(-\log q(x_i)\)), this is another indicator that \(q\) acts quite different from \(p\).
We can express this dynamics with the simple and intuitive equation below:
\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log q(x_i) \right ) \end {aligned} \]
And this is basically called cross-entropy. It is a measurement of how much one
distribution (\(q\)) acts like the other (\(p\)).
However, knowing this might still not be enough. In some cases, we may also want
to measure the difference between the overall surprise level we experience with \(p\) and \(q\)
distributions after an event \(x_i\) happens. If we express the probability of an
event \(x_i\) happens with \(p(x_i)\), we can simply measure the difference between the
surprise levels we experience with \(p\) and \(q\) distributions with the equation below:
\[ \begin {aligned} - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) (\log p(x_i) - \log q(x_i)) \right ) &= - \frac {1}{n} \left ( \sum _ {i=1}^{n} p(x_i) \log \frac {p(x_i)}{q(x_i)} \right ) \end {aligned} \]
And this is called Kullback-Leibler (KL) Divergence. It calculates the difference
between two probability distributions.
Now, let’s say that we have a machine learning model and it outputs the
probability distribution \(q\). If this model is not optimal, the probabilities we use won’t
be very accurate. Therefore, using the probabilities generated by the machine
learning model to compute the overall surprise level we experience might not be the
best method because if the parameters of our machine learning model are not
optimal, the model may not assign the most accurate probabilities to events, and
computing the entropy with inaccurate probabilities is not ideal. That’s why we
should incorporate the parameters of our model as well when we calculate the overall
surprise level we experience.
To do this, we can use the concept of likelihood instead of probability. Probability
refers to the chance of a specific event occurring. And likelihood is a measure of
how plausible the set of model parameters is, given that the specific event
occurred. For instance, let’s say that we have some information about the
current date, temperature outside, location, etc. We want to predict the
probability of observing rain tomorrow. Here, probability is the measure
of how likely it is to rain tomorrow given the fact that we are currently
in New York, in summer, today was sunny, etc. Likelihood, on the other
hand, is a function of the model parameters and the observed data and
it represents the likelihood of observing the given data if the model with
those parameter values is true. The model parameters that maximize this
likelihood are considered the best fit for the model given the data. We can
write likelihood with \(L(\theta | x) = p(x | \theta )\). Here \(\theta \) represents the set of parameters of the model
and \(x\) represents the observed data (whether the weather is rainy or not). \(L(\theta | x)\)
represents the probability of observing the data \(x\) with the model parameters \(\theta \).
Suppose it is winter, and the daily temperature has been around 0°F for the last
30 days. If our model assigns a high probability to summer-like temperatures, this
would indicate a low likelihood, given the observed winter temperatures. Thus,
likelihood helps us evaluate the suitability of our model parameters based on the
observed data. And finding the model parameters that maximize the likelihood, and
that make the observed data \(x_i\) most probable under the assumed model is called
maximum likelihood estimation.
\[ \begin {aligned} - \sum _{i=1}^{N} p(\theta \mid x_i) \log q(\theta \mid x_i) \end {aligned} \]
The formula above is also called negative log-likelihood. In the likelihood
function, \(\theta \) represents the parameters of the model and \(x_i\) represents the observed
data. And \(p(\theta \mid {x_i})\) represents how likely it is to observe the data \(x_i\) with the model
parameters with one distribution and \(q(\theta \mid {x_i})\) represents how likely it is to observe
the data \(x_i\) with the model parameters with another separate distribution.
If we have a machine learning model and that model assigns the probabilities of \(q(x_i)\)
to observing data \(x_i\) while the real probability of observing these data is \(p(x_i)\), we can
measure the overall performance of the model and optimize the parameters of the
model in such a way that decreases the difference between the true probabilities and
predicted probabilities with this formula.
Once we find the parameters with maximum likelihood estimation and once we
ensure that the model parameters are optimal and reliable, we can use probability
instead of likelihood and measure the predictability/certainty of the predicted
probabilities with the most optimal model we found.
\[ \begin {aligned} - \sum _{i=1}^{N} p(x_i) \log q(x_i) \end {aligned} \]
If there are only 2 distinct labels in our data in total, we can write the probability
like the one below.
\[ p(x_i) = \begin {cases} p & \text {if } x_i = 1 \\ 1 - p & \text {if } x_i = 0 \end {cases} \]
In that case, we can rewrite the cross-entropy formula like the one below.
\[ \begin {aligned} & - \sum _{i=1}^{N} p(x_i) \log q(x_i) \\\
& = - p \log q(x_1) - (1-p) \log (1-q(x_2)) \end {aligned} \]
And we call this formula binary cross-entropy.
When we use the loss function like the one above, we update the parameters of the model if there is a large difference between the actual probability \(p(x_i)\) and the predicted probability \(q(x_i)\). But the issue is that the larger this difference, the more penalized the model.
\[ \begin {aligned} - \sum _{i=1}^{N} p(x_i) \log q(x_i) \end {aligned} \]
Sometimes, we may want to penalize the parameters (weights) of the model more directly to prevent them from becoming very large and make them smoother. To do this, we can add the sum of squares of all model parameters. Through this way, if there is a big difference between the predicted probability \(p(x_i)\) and \(q(x_i)\) and if the system wants to make drastic updates on the parameters of the model as a result of this, the sum of squares of model parameters won’t allow this because large weights now increase the value of the loss function and therefore should be avoided. This is called regularization and there are two types of regularization: L1 and L2. The function below is an example of L2 regularization.
\begin {equation*} - \sum _{i=1}^{N} p(x_i) \log q(x_i) + \lambda \sum _{j=1}^{M} {W_j}^2 \end {equation*}
Another way to penalize large weights is L1 regularization.
\begin {equation*} - \sum _{i=1}^{N} p(x_i) \log q(x_i) + \lambda \sum _{j=1}^{M} |W_j| \end {equation*}