Probability Distributions and their Stories

Discrete distributions

Categorical distribution

Story. A probability is assigned to each of a set of discrete outcomes.
Example. A hen will peck at grain A with probability \(θ_A\), grain B with probability \(θ_B\), and grain C with probability \(θ_C\).
Parameters. The distribution is parametrized by the probabilities assigned to each event. We define \(θ_y\) to be the probability assigned to outcome \(y\). The set of \(θ_y\)'s are the parameters, and are constrained by

\(\begin{align}
\sum_y \theta_y = 1
\end{align}\).
Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive.
Probability mass function.

\(\begin{align}
f(y;\{\theta_y\}) = \theta_y
\end{align}\).

Usage (with theta length \(n\))

Package	Syntax
NumPy	`np.random.choice(len(theta), p=theta)`
SciPy	`scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()`
Stan	`categorical(theta)`

Related distributions.
- The Discrete Uniform distribution is a special case where all \(θ_y\) are equal.
- The Bernoulli distribution is a special case where there are two categories that can be encoded as having outcomes of zero or one. In this case, the parameter for the Bernoulli distribution is \(θ=θ_0=1−θ_1\).
Notes.
- This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.
- To sample out of a Categorical distribution, use numpy.random.choice() , specifying the values of \(θ\) using the p kwarg.

def categorical_pmf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
    if x < 1:
        return 0
    elif x >= 4:
        return 1
    else:
        return np.sum(thetas[:int(x)])
    
def categorical_cdf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
          dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
          dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
                            x_max=4,
                            custom_pmf=categorical_pmf,
                            custom_cdf=categorical_cdf,
                            params=params,
                            x_axis_label='category',
                            title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)