Big Data Stream Analytics for Sentiment Analysis: Probabilistic Language Modeling for Sentiment Analysis | Saylor Academy

Probabilistic Language Modeling for Sentiment Analysis

Originally, the term "language model" has been widely explored in the speech recognition community, and it refers to a probability distribution which represents the statistical regularities for the generation of a language. In other words, a language model is a probabilistic function that assigns a probability mass to a string drawn from some vocabulary. In the context of Information Retrieval (IR), a language model \(M_d\) is used to estimate the probability that a document \(d\) generates a query \(q\). In particular, such a probabilistic inference is used to mimic the concept of document "relevance" of \(d\) respect to \(q\). The basic unigram language model is defined according to the following formulas:

\( P(q \mid d) \propto P\left(q \mid M_d\right)=\prod_{t \in q} P\left(t \mid M_d\right) \) (1)

\( P\left(t \mid M_d\right)=(1-\lambda) P_{M L}\left(t \mid M_d\right)+\lambda P_{M L}\left(t \mid M_D\right) \) (2)

\( P_{M L}\left(t \mid M_d\right)=\frac{t f(t, d)}{|d|} \) (3)

where \(M_d\) is the language model of the document \(d\). With Jelinek-Mercer smoothing, the probability of the document generating a query term \(t\) (i.e., \(P(t|M_d)\)) is estimated according to the maximum likelihood model \(P_{ML}(t|M_d)\), and the maximum likelihood model of the entire collection \(P{ML}(t|M_d)\). \( \lambda \) is the Jelinek-Mercer smoothing parameter. The smoothing process is used to alleviate the problem of over-estimating the probabilities for query terms found in a document and the problem of under-estimating the probabilities for terms not found in the document. The function \(tf (t,d)\) returns the term frequency of term \(t\) in the document \(d\), and \(|d|\) is the document length measured by the number of tokens contained in the document.

However, previous studies found that applying the probabilities of query related terms of a relevant context instead of the probabilities of the individual query terms estimated based on the entire document collection (i.e., a general product review context) to a document language model will lead to a more effective smoothing process, and hence lead to good IR performance. Following the similar kind of idea, we develop an inferential language model to compute the probability that a document \(d\) (e.g., a product review) will generate a term \(t\) found in a Sentiment Lexicon (SL). In order to ensure a more robust and effective smoothing process, the inferential language model can take into account terms (opinion evidences) associated with the opinion indicators in a relevant online review context. In particular, the associated opinion evidences are discovered based on the context-sensitive text mining process over an online review context. The inferential language model for context-sensitive opinion scoring is then defined as follows.

\(P(S L \mid d) \propto P\left(S L \mid M_d\right)=\prod_{t \in S L} P\left(t \mid M_d\right) \)(4)

\( P\left(t \mid M_d\right)=(1-\lambda) P_{M L}\left(t \mid M_d\right)+\lambda P_{I N F}\left(t \mid M_d\right) \) (5)

\( P_{I N F}\left(t \mid M_d\right)=\tanh \left(\sum_{\left(t \rightarrow r^{\prime}\right) \in O E} P\left(t \rightarrow t^{\prime}\right) \cdot P_{M L}\left(t^{\prime} \mid M_d\right)\right) \) (6)

where \(P(SL|d)\) is the document language model for estimating the probabilities that the document \(d\) will generates the opinion indicators defined in a sentiment lexicon (SL). However, to address the common problem that sentiment lexicons may not capture all possible sentiments of a problem domain (e.g., context-sensitive opinion evidences are missing), the proposed language model can take into account other opinion evidences contained in the document by means of the inferential language model \(P_{INF} \; (t | M_d)\). The set of context-sensitive opinion evidences \(OE\) is dynamically generated according to a context-sensitive text mining technique.

The term association (term inference) of the form \( t \rightarrow t' \) is applied to the inferential language model to compute the probability that a document generates a term (e.g., an opinion indicator) which is contextually associated with another opinion indicator captured in a sentiment lexicon. For easy of implementation, we only include the top \(x\) term associations captured in \(OE\) for each opinion indicator \(t\). It should be noted that the inference that \(d\) generating \(t'\) involves a certain degree of uncertainty. As a result, the maximum likelihood estimation of \(P_{ML} (t' | M_d)\) is moderated by a factor \(P(t \rightarrow t')\). The hyperbolic tangent function is applied to moderate the probability function \(P_{INF} \; (t|M_d)\) such that its values fall in the unit interval.

Course Introduction

Course Syllabus

Unit 1: Defining the Business Objective and Sourcing Data

1.1: Data Analysis Processes

Lifecycle of a Data Analysis Project

The Market Research Process

1.2: Data Analysis Business Objectives

Big Data Stream Analytics for Sentiment Analysis

Data Modeling and Data Analytics

1.3: Data Collection and Gathering Best Practices

Using BI and Decision-Making Process in Start-ups

Unit 1 Study Resources

Unit 1 Review Video

Unit 1 Review Slides

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Data Analysis

2.1: Data Analysis Methods and Models

Introduction to Analytics

The Stages of Analytics Development

Quantitative Methods

Qualitative Methods

Quantitative and Qualitative Data

Statistical Language

The Difference between Qualitative and Quantitative

Qualitative and Quantitative Research

Data-Driven Decisions

Research Design

2.2: Synthesizing Data Findings

Measures of the Center of the Data

Frequency, Frequency Tables, and Levels of Measurement

Frequency Tables

Unit 2 Study Resources

Unit 2 Review Video

Unit 2 Review Slides

Study Guide: Unit 2

Unit 2 Assessment

Unit 2 Assessment

Unit 3: Visualization Principles and Processes

3.1: Visualization Concepts and Definitions

Data Visualization

Why Is Data Visualization Important?

Presenting Data in Meaningful and Interesting Ways

3.2: Interactive Visualizations and Dashboards

Data Visualization

Interactive Visualization of Refugee Demographics in the U.S.

3.3: Challenges in Visualization

Visualization in Exploratory Data Analysis

Visualizing Big Data with Augmented and Virtual Reality

Describing Data

Unit 3 Study Resources

Unit 3 Review Video

Unit 3 Review Slides

Study Guide: Unit 3

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Visualization Tools and Techniques

4.1: Number Representation

Visual Aids

Using PowerPoint with Excel

Using Charts with Word and PowerPoint

Describing Data

4.2: Formatting and Organizing Data

Best Visualization Practices

4.3: Selecting Visual Representations

Visualization Tools

Visualization Thought Process

4.4: Representing Data Values

Presenting Data with Graphs and Tables

4.5: Coordinating Data Positions and Scales

Improving Visualizations

Unit 4 Study Resources

Unit 4 Review Video

Unit 4 Review Slides

Study Guide: Unit 4

Unit 4 Assessment

Unit 4 Assessment

Unit 5: Evaluating Data Visualizations

5.1: Develop the Data Story