Big Data gives organizations unprecedented opportunities to tap into their data to mine valuable business intelligence. Read this study to learn how businesses can utilize this analytics framework to analyze consumers' product preferences, leading to more effective marketing and production strategies.
Probabilistic Language Modeling for Sentiment Analysis
Originally,
the term "language model" has been widely explored in the speech
recognition community, and it refers to a probability distribution which
represents the statistical regularities for the generation of a
language. In other words, a language model is a probabilistic
function that assigns a probability mass to a string drawn from some
vocabulary. In the context of Information Retrieval (IR), a language
model \(M_d\) is used to estimate the probability that a document \(d\) generates a
query \(q\). In particular, such a probabilistic inference is used to
mimic the concept of document "relevance" of \(d\) respect to \(q\). The basic
unigram language model is defined according to the following formulas:
\( P(q \mid d) \propto P\left(q \mid M_d\right)=\prod_{t \in q} P\left(t \mid M_d\right) \) (1)
\( P\left(t \mid M_d\right)=(1-\lambda) P_{M L}\left(t \mid M_d\right)+\lambda P_{M L}\left(t \mid M_D\right) \) (2)
\( P_{M L}\left(t \mid M_d\right)=\frac{t f(t, d)}{|d|} \) (3)
where \(M_d\) is the language
model of the document \(d\). With Jelinek-Mercer smoothing, the
probability of the document generating a query term \(t\) (i.e., \(P(t|M_d)\)) is estimated
according to the maximum likelihood model \(P_{ML}(t|M_d)\), and the maximum likelihood
model of the entire collection \(P{ML}(t|M_d)\). \( \lambda \) is the Jelinek-Mercer smoothing
parameter. The smoothing process is used to alleviate the problem
of over-estimating the probabilities for query terms found in a
document and the problem of under-estimating the probabilities for terms
not found in the document. The function \(tf (t,d)\) returns the term frequency of
term \(t\) in the document \(d\), and \(|d|\) is the document length measured by the number
of tokens contained in the document.
However, previous studies
found that applying the probabilities of query related terms of a
relevant context instead of the probabilities of the individual query
terms estimated based on the entire document collection (i.e., a general
product review context) to a document language model will lead to a
more effective smoothing process, and hence lead to good IR performance. Following the similar kind of idea, we develop an inferential
language model to compute the probability that a document \(d\) (e.g., a
product review) will generate a term \(t\) found in a Sentiment Lexicon (SL).
In order to ensure a more robust and effective smoothing process, the
inferential language model can take into account terms (opinion
evidences) associated with the opinion indicators in a relevant online
review context. In particular, the associated opinion evidences are
discovered based on the context-sensitive text mining process over an
online review context. The inferential language model for
context-sensitive opinion scoring is then defined as follows.
\(P(S L \mid d) \propto P\left(S L \mid M_d\right)=\prod_{t \in S L} P\left(t \mid M_d\right) \)(4)
\( P\left(t \mid M_d\right)=(1-\lambda) P_{M L}\left(t \mid M_d\right)+\lambda P_{I N F}\left(t \mid M_d\right) \) (5)
\( P_{I N F}\left(t \mid M_d\right)=\tanh \left(\sum_{\left(t \rightarrow r^{\prime}\right) \in O E} P\left(t \rightarrow t^{\prime}\right) \cdot P_{M L}\left(t^{\prime} \mid M_d\right)\right) \) (6)
where \(P(SL|d)\)
is the document language model for estimating the probabilities that
the document \(d\) will generates the opinion indicators defined in a
sentiment lexicon (SL). However, to address the common problem that
sentiment lexicons may not capture all possible sentiments of a problem
domain (e.g., context-sensitive opinion evidences are missing), the
proposed language model can take into account other opinion evidences
contained in the document by means of the inferential language model \(P_{INF} \; (t | M_d)\).
The set of context-sensitive opinion evidences \(OE\) is dynamically generated
according to a context-sensitive text mining technique.
The term
association (term inference) of the form \( t \rightarrow t' \) is applied to the inferential
language model to compute the probability that a document generates a
term (e.g., an opinion indicator) which is contextually associated with
another opinion indicator captured in a sentiment lexicon. For easy
of implementation, we only include the top \(x\) term associations captured
in \(OE\) for each opinion indicator \(t\). It should be noted that the inference
that \(d\) generating \(t'\) involves a certain degree of uncertainty. As a result,
the maximum likelihood estimation of \(P_{ML} (t' | M_d)\) is moderated by a factor \(P(t \rightarrow t')\). The
hyperbolic tangent function is applied to moderate the probability
function \(P_{INF} \; (t|M_d)\) such that its values fall in the unit interval.