At the heart of all pattern search or classification problems (either explicitly or implicitly) lies Bayes' Decision Theory. Bayes' decision simply says, given an input observation of unknown classification, make the decision that will minimize the probability of a classification error. For example, in this unit, you will be introduced to the k-nearest neighbor algorithm. It can be demonstrated that this algorithm can make Bayes' decision. Read this chapter to familiarize yourself with Bayes' decision.
Bayesian Decision Theory
Suppose that we know both the prior probabilities \(P(w_j)\) and the conditional densities\(P(x | w_j)\). Suppose further that we measure the features of a sample and discover that its
value is \(x\). How does this measurement influence our attitude concerning the true state of
nature - that is, the category of the input? We note first that the(joint) probability density of
finding a pattern that is in category \(w_i\) and has feature value \(x\) can be written in two ways: \(P(w_j, x) = P(W_j | ) P(x) = P(x)=P(X | W_j)P(W_j)\). Rearranging these leads us to the answer to our
question, which is called Bayes' formula:
\(P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)}\) (1)
where in this case of c categories
\(p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)\) (2)
Two-Category Classification
If we have an observation x for which \(P(W_1 | x)\) is greater than \(P(W_2 | x)\), we would
naturally be inclined to decide that the true state of nature is \(W_1\). Similarly, \(P(W_2 | x)\) is
greater than \(P(W_1 | x)\), we would be inclined to choose \(W_2\). Thus we have justified the
following Bayes decision rule for minimizing the probability of error:
Decide \(w_1\) if \(P\left(\omega_{1} \mid x\right)>P\left(\omega_{2} \mid x\right)\), otherwise decide \(w_2\) (3)
In Eq. (1), \(P(x)\) is a scale factor and unimportant for our problem. By using Eq.(1), we can
instead express the rule in terms of the conditional and prior probabilities. And we notice
\(P(W_1 | x) + P(W_2 | x) = 1\). By eliminating this scale factor, we obtain the following completely
equivalent decision rule:
Decide \(w_1\) if \(p\left(x \mid \omega_{1}\right) P\left(\omega_{1}\right)>p\left(x \mid \omega_{2}\right) P\left(\omega_{2}\right)\) otherwise decide \(w_2\) (4)
While the two-category case is just a special instance of the multi-category case, it has
traditionally received separate treatment. Indeed, a classifier that places a pattern in one of
only two categories has a special name - a dichotomizer. Instead of using two dichotomizer
discriminant functions \(g_1\) and \(g_2\) and assigning \(x\) to \(w_1\) if \(g_1 >g_2\) , it is more common
to define a single discriminant function
\(g (x) = g_1(x) - g_2(x)\) (5)
and to use the following decision rule:
Thus, a dichotomizer can be viewed as a machine that computes a single discriminant
function \(g(x)\) and classifies \(x\) according to the algebraic sign of the result. Of the various
forms in which the minimum-error-rate discriminant function can be written, the following
two(derived from Eqs. (1) and (5) are particularly convenient:
\(g(x)=P\left(\omega_{1} \mid x\right)-P\left(\omega_{2} \mid x\right)\) (6)
\(g(x)=\ln \frac{p\left(x \mid \omega_{1}\right)}{p\left(x \mid \omega_{2}\right)}+\ln \frac{p\left(\omega_{1}\right)}{p\left(\omega_{2}\right)}\) (7)
Multi-Category Classification
Let \(w_1, ....., w_c\) be the finite set of c states of nature. Let the feature vector \( x\) be a d-component vector-valued random variable, and let \(P(x | w_j)\) be the state- conditional probability density function for \(x\) - the probability density function for \(x\) conditioned on \(w_j\) being the true state of nature. As before, \(P(w_j)\) describes the prior probability that nature is in state \(w_j\). Then the posterior probability \(P(w_j | x)\) can be computed from \(P(x | w_j)\) by Bayes formula:
\(P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)}\) (8)
where the evidence is now
\(p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)\) (9)
A Bayes classifier is easily and naturally represented in this way. For the minimum-error-
rate case, we can simplify things further by taking \(g_i(x)=P(ω_i | x)\), so that the maximum
discriminant function corresponds to the maximum posterior probability.
Clearly, the choice of discriminant functions is not unique. We can always multiply all the
discriminant functions by the same positive constant or shift them by the same additive
constant without influencing the decision. More generally, if we replace every \(g(x)\) by \(f)g)x))\), where \(f (.)\) is a monotonically increasing function, the resulting classification is
unchanged. This observation can lead to significant analytical and computational
simplifications. In particular, for minimum-error-rate classification, any of the following
choices gives identical classification results, but some can be much simpler to understand or compute than others:
\(g(x)=P\left(\omega_{i} \mid x\right)=\frac{p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right)}{\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}\) (10)
\(g(x)=P\left(\omega_{i} \mid x\right)=p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right)\) (11)
\(g(x) = P(w_i | x) = In p(x | w_i) + In P(w_i)\) (12)
where ln denotes natural logarithm.