Probability Distributions and their Stories
Discrete distributions
Categorical distribution
- Story. A probability is assigned to each of a set of discrete outcomes.
- Example. A hen will peck at grain A with probability \(θ_A\), grain B with probability \(θ_B\), and grain C with probability \(θ_C\).
- Parameters. The distribution is parametrized by the probabilities assigned to each event. We define \(θ_y\) to be the probability assigned to outcome \(y\). The set of \(θ_y\)'s are the parameters, and are constrained by
\(\begin{align}
\sum_y \theta_y = 1
\end{align}\).
- Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to
N
, inclusive. - Probability mass function.
\(\begin{align}
f(y;\{\theta_y\}) = \theta_y
\end{align}\).
- Usage (with theta length \(n\))
Package Syntax NumPy np.random.choice(len(theta), p=theta)
SciPy scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()
Stan categorical(theta)
- Related distributions.
- The Discrete Uniform distribution is a special case where all \(θ_y\) are equal.
- The Bernoulli distribution is a special case where there are two categories that can be encoded as having outcomes of zero or one. In this case, the parameter for the Bernoulli distribution is \(θ=θ_0=1−θ_1\).
- Notes.
- This distribution must be manually constructed if you are using the scipy.stats module using
scipy.stats.rv_discrete()
. The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF. - To sample out of a Categorical distribution, use
numpy.random.choice()
, specifying the values of \(θ\) using the p kwarg.
- This distribution must be manually constructed if you are using the scipy.stats module using
def categorical_pmf(x, θ1, θ2, θ3):
thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
if (thetas < 0).any():
return np.array([np.nan]*len(x))
return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
if x < 1:
return 0
elif x >= 4:
return 1
else:
return np.sum(thetas[:int(x)])
def categorical_cdf(x, θ1, θ2, θ3):
thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
if (thetas < 0).any():
return np.array([np.nan]*len(x))
return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
x_max=4,
custom_pmf=categorical_pmf,
custom_cdf=categorical_cdf,
params=params,
x_axis_label='category',
title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)