Methods for Pattern Classification: Bayesian Decision Theory | Saylor Academy

Bayesian Decision Theory

Suppose that we know both the prior probabilities \(P(w_j)\) and the conditional densities\(P(x | w_j)\). Suppose further that we measure the features of a sample and discover that its value is \(x\). How does this measurement influence our attitude concerning the true state of nature - that is, the category of the input? We note first that the(joint) probability density of finding a pattern that is in category \(w_i\) and has feature value \(x\) can be written in two ways: \(P(w_j, x) = P(W_j | ) P(x) = P(x)=P(X | W_j)P(W_j)\). Rearranging these leads us to the answer to our question, which is called Bayes' formula:

\(P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)}\) (1)

where in this case of c categories

\(p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)\) (2)

Two-Category Classification

If we have an observation x for which \(P(W_1 | x)\) is greater than \(P(W_2 | x)\), we would naturally be inclined to decide that the true state of nature is \(W_1\). Similarly, \(P(W_2 | x)\) is greater than \(P(W_1 | x)\), we would be inclined to choose \(W_2\). Thus we have justified the following Bayes decision rule for minimizing the probability of error:

Decide \(w_1\) if \(P\left(\omega_{1} \mid x\right)>P\left(\omega_{2} \mid x\right)\), otherwise decide \(w_2\) (3)

In Eq. (1), \(P(x)\) is a scale factor and unimportant for our problem. By using Eq.(1), we can instead express the rule in terms of the conditional and prior probabilities. And we notice \(P(W_1 | x) + P(W_2 | x) = 1\). By eliminating this scale factor, we obtain the following completely equivalent decision rule:

Decide \(w_1\) if \(p\left(x \mid \omega_{1}\right) P\left(\omega_{1}\right)>p\left(x \mid \omega_{2}\right) P\left(\omega_{2}\right)\) otherwise decide \(w_2\) (4)

While the two-category case is just a special instance of the multi-category case, it has traditionally received separate treatment. Indeed, a classifier that places a pattern in one of only two categories has a special name - a dichotomizer. Instead of using two dichotomizer discriminant functions \(g_1\) and \(g_2\) and assigning \(x\) to \(w_1\) if \(g_1 >g_2\) , it is more common to define a single discriminant function

\(g (x) = g_1(x) - g_2(x)\) (5)

and to use the following decision rule:

Decide \(w_1\) if \(g (x) > 0\), otherwise decide \(w_2\)

Thus, a dichotomizer can be viewed as a machine that computes a single discriminant function \(g(x)\) and classifies \(x\) according to the algebraic sign of the result. Of the various forms in which the minimum-error-rate discriminant function can be written, the following two(derived from Eqs. (1) and (5) are particularly convenient:

\(g(x)=P\left(\omega_{1} \mid x\right)-P\left(\omega_{2} \mid x\right)\) (6)

\(g(x)=\ln \frac{p\left(x \mid \omega_{1}\right)}{p\left(x \mid \omega_{2}\right)}+\ln \frac{p\left(\omega_{1}\right)}{p\left(\omega_{2}\right)}\) (7)

Multi-Category Classification

Let \(w_1, ....., w_c\) be the finite set of c states of nature. Let the feature vector \( x\) be a d-component vector-valued random variable, and let \(P(x | w_j)\) be the state- conditional probability density function for \(x\) - the probability density function for \(x\) conditioned on \(w_j\) being the true state of nature. As before, \(P(w_j)\) describes the prior probability that nature is in state \(w_j\). Then the posterior probability \(P(w_j | x)\) can be computed from \(P(x | w_j)\) by Bayes formula:

\(P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)}\) (8)

where the evidence is now

\(p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)\) (9)

A Bayes classifier is easily and naturally represented in this way. For the minimum-error- rate case, we can simplify things further by taking \(g_i(x)=P(ω_i | x)\), so that the maximum discriminant function corresponds to the maximum posterior probability.

Clearly, the choice of discriminant functions is not unique. We can always multiply all the discriminant functions by the same positive constant or shift them by the same additive constant without influencing the decision. More generally, if we replace every \(g(x)\) by \(f)g)x))\), where \(f (.)\) is a monotonically increasing function, the resulting classification is unchanged. This observation can lead to significant analytical and computational simplifications. In particular, for minimum-error-rate classification, any of the following choices gives identical classification results, but some can be much simpler to understand or compute than others:

\(g(x)=P\left(\omega_{i} \mid x\right)=\frac{p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right)}{\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}\) (10)

\(g(x)=P\left(\omega_{i} \mid x\right)=p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right)\) (11)

\(g(x) = P(w_i | x) = In p(x | w_i) + In P(w_i)\) (12)

where ln denotes natural logarithm.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python