Big Data Stream Analytics for Sentiment Analysis: The Big Data Stream Analytics Framework | Saylor Academy

The Big Data Stream Analytics Framework

An overview of the proposed framework that leverages Big Data Stream Analytics for online Sentiment Analysis (BDSASA) is depicted in Figure 2. The BDSASA framework consists of seven layers, namely data stream layer, data pre-processing layer, data mining layer, prediction layer, learning and adaptation layer, presentation layer, and storage layer. For these layers, we will apply sophisticated and state-of-the-art techniques for rapid service prototyping. For instance, Storm, the open-source Distributed Data Stream Engine (DDSE) for big data is applied to process streaming data fed from dedicated APIs and crawlers at the Data Stream Layer. For instance, the Topsy API is used to retrieve product related comments from Twitter.

The Storage Layer leverages Apache HBase and HDFS for real-time storage and retrieval of big volume of consumer reviews discussing products and services. The Stanford Dependency Parser and the GATE NER module are applied to build the Data Pre-processing Layer. Our pilot tests show that the size of the multilingual social media data streams is within the range between 0.2 and 0.4 Gigabytes on a daily basis, and this volume is steadily growing. For the feature extraction layer, the Affect Miner utilizes a novel community-based affect intensity measure to predict consumers' moods towards products. Among the big six classes i.e., anger, fear, happiness, sadness, surprise, and neutral commonly used in affect analysis, we focus on the anger, fear, sadness, and happiness classes relevant for product sentiment analysis. The WordNet-Affect lexicon extended by a statistical learning method is used by the Affect Miner. Since social media messages are generally noisy, one novelty of our framework is that we reduce the noise of the "affect intensity" measure by processing messages really related to consumers' comments about products or services.

Previous research employed the HMM method to mine the latent "intents" of actors. We exploit a novel and more sophisticated online generative model and the corresponding distributed Gibbs sampling algorithm to build our Latent Intent Extractor that predicts the intents of consumers for potential product or service acquisitions. The Sentiment Extractor utilizes well-known sentiment lexicons such as OpinionFinder to extract the sentiment words embedded in consumer reviews. Finally, overall sentiment polarity prediction for consumer reviews is performed based on a novel inferential language modeling method. The computational details of this inferential language modeling method for context-sensitive sentiment analysis will be explained in the next section. The overall sentiment polarity against a product or a product category is communicated to the user of the system via the presentation layer. Different modes of presentations (e.g., text, graphics, multimedia on desktops or mobile devices) are supported by our framework.

Figure 2. An overview of the BDSASA framework.

Figure 2. An overview of the BDSASA framework.

In addition, a novel parallel co-evolutionary genetic algorithm (PCGA) is designed so that the proposed prediction model is equipped with a learning and adaptation mechanism that continuously tunes the whole service with respect to possibly changing features of the problem domain. The PCGA can divide a large search space into some subspaces for a parallel and diversified search, which improves both the efficiency and the effectiveness of the heuristic search process. Each subspace (i.e., a sub-population) is hosted by a separate cluster. Three fundamental decisions are involved for the design a genetic algorithm (GA), that is, a fitness function, chromosome encoding, and a procedure that drives the evolution process of chromosomes. First, the fitness function of our PCGA is developed based on a performance metric (e.g., accuracy of sentiment polarity prediction). Second, since various components of the proposed service should be continuously refined, there are multiple sub-populations of chromosomes to be encoded and co-evolved simultaneously. During each evolution cycle, the best chromosome of a sub-population (e.g., prediction features, social media sources, system parameters) is exchanged with that of other sub-populations. Armed with all the essential information, each chromosome of a sub-population represents a feasible prediction, and its fitness can be assessed accordingly.

Course Introduction

Course Syllabus

Unit 1: Defining the Business Objective and Sourcing Data

1.1: Data Analysis Processes

Lifecycle of a Data Analysis Project

The Market Research Process

1.2: Data Analysis Business Objectives

Big Data Stream Analytics for Sentiment Analysis

Data Modeling and Data Analytics

1.3: Data Collection and Gathering Best Practices

Using BI and Decision-Making Process in Start-ups

Unit 1 Study Resources

Unit 1 Review Video

Unit 1 Review Slides

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Data Analysis

2.1: Data Analysis Methods and Models

Introduction to Analytics

The Stages of Analytics Development

Quantitative Methods

Qualitative Methods

Quantitative and Qualitative Data

Statistical Language

The Difference between Qualitative and Quantitative

Qualitative and Quantitative Research

Data-Driven Decisions

Research Design

2.2: Synthesizing Data Findings

Measures of the Center of the Data

Frequency, Frequency Tables, and Levels of Measurement

Frequency Tables

Unit 2 Study Resources

Unit 2 Review Video

Unit 2 Review Slides

Study Guide: Unit 2

Unit 2 Assessment

Unit 2 Assessment

Unit 3: Visualization Principles and Processes

3.1: Visualization Concepts and Definitions

Data Visualization

Why Is Data Visualization Important?

Presenting Data in Meaningful and Interesting Ways

3.2: Interactive Visualizations and Dashboards

Data Visualization

Interactive Visualization of Refugee Demographics in the U.S.

3.3: Challenges in Visualization

Visualization in Exploratory Data Analysis

Visualizing Big Data with Augmented and Virtual Reality

Describing Data

Unit 3 Study Resources

Unit 3 Review Video

Unit 3 Review Slides

Study Guide: Unit 3

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Visualization Tools and Techniques

4.1: Number Representation

Visual Aids

Using PowerPoint with Excel

Using Charts with Word and PowerPoint

Describing Data

4.2: Formatting and Organizing Data

Best Visualization Practices

4.3: Selecting Visual Representations

Visualization Tools

Visualization Thought Process

4.4: Representing Data Values

Presenting Data with Graphs and Tables

4.5: Coordinating Data Positions and Scales

Improving Visualizations

Unit 4 Study Resources

Unit 4 Review Video

Unit 4 Review Slides

Study Guide: Unit 4

Unit 4 Assessment

Unit 4 Assessment

Unit 5: Evaluating Data Visualizations

5.1: Develop the Data Story