Big Data Analytics for Disparate Data: 2. Disparate Data and Big Data | Saylor Academy

2. Disparate Data and Big Data

2.1. Data Quality Problems

Data from a single source is often considered clean. However, the nature of data is often "messy" due to its heterogeneity when it's in its native formats and "dirty" (including missing, mislabeled, incorrect, or possibly spurious data). This makes the data incompatible with other data sources. Table 2 lists data quality (veracity) problems from the single-source and the multi-source, and between the schema level and the instance level, respectively.

Table 2. Categories of Data Quality Problems from Data Sources

Single-Source:

Schema Level

Single-Source:

Instance Level

Multi-Source:

Schema Level

Multi-Source:

Instance Level

Poor schema design, lack
of integrity constraints

-Referential integrity

-Uniqueness

...

Data entry errors

-Misspellings

-Contradictory values

-Duplicates/redundancy

...

Heterogeneous schema
designs and data models

-Structural conficts

-Naming conflicts

...

Overlapping, inconsistent
and contradicting data

-Inconsistent timing

-Inconsistent aggregating

...

The first step in handling data quality (veracity) problems is a data cleaning process that involves dealing with typos, missing fields, and spelling conventions, etc. Data cleaning is also called data cleansing or scrubbing. Data cleaning is not often an easy task because data ownership is unclear in many organizations. Data cleaning is labor-intensive and time-consuming, but it is necessary for successful data mining. Potential methods of filtering big noise can come from classic reliability methods: for example, data from several independent sources may be more reliable than that which is from dependent sources.

Data cleaning often deals with detecting and removing inconsistencies and errors from data. Misspellings during data entry, missing information, or other invalid data result in severe data quality problems. Data cleaning is especially needed before integrating heterogeneous data sources and should be handled simultaneously along with schema-related data transformations. Consistent, reliable, and accurate information is often achieved by eliminating duplicate data and consolidating different data representations. Data cleaning can include five complementary procedures: searching and identifying errors, defining and determining error types, correcting inaccuracies, documenting error types and error examples, and modifying data entry procedures to reduce future errors. Data completeness, formats, rationality, and restriction shall be inspected during the cleaning process.

2.2. Missing Data

Missing data can be categorized into the following three types:

(1) Missing completely at random – The data missing is completely at random (MCAR) if the presence of missing data on a variable is not related to any other observed or unobserved variable.

(2) Missing at random – If the presence of missing data on a variable is related to other observed variables but not to its own unobserved value, then the data is missing at random (MAR).

(3) Not missing at random – If the missing data for a variable is neither MCAR nor MAR, then it is not missing at random (NMAR).

Most methods for handling missing data assume that the data is either MCAR or MAR. In this situation, the mechanism producing the missing data can be ignored and the relationships of interest is modelled after replacing or deleting the missing data. If the number of records with missing values is small, the records can be omitted. However, if there are a large number of variables, even a small proportion of missing data could aﬀect many records. R and its functions can be used to identify missing data in a data set.

Missing values can cause troubles for modelers. Some tools handle missing values by ignoring them; otherwise, suitable replacements should be determined. Data imputation is a research area that seeks to impute missing values for producing improved models. Multiple imputation (MI) is often a choice for complex missing values problems. It is based on repeated simulations and provides an approach to handling missing data. The following are common methods for handling missing data:

Discarding instances – deleting all instances where there is at least one missing value and using the remainder. However, the disadvantage of this method is that discarding data may destroy the reliability of the results obtained from the data and this method is generally not recommended although it is possibly worth trying when the proportion of missing data is small.
Replacement by the most frequent value or an average value. An eﬀective way for a categorical attribute is using the most frequently occurring (non-missing) value. A swift, simple method of ﬁlling in the unknown continuous values is using a central value such as the mean and the median. However, the presence of outliers (extreme values) can distort the calculation of the mean making it unsuitable to use the mean before checking the distribution of the variable. Therefore, for variables with outliers or for skewed distributions of variables, the median is a better choice.

Preserving standard deviation. When the mean is used as a central value to fill in the missing value, this method is called preserving mean. Preserving standard deviation is another method. It is better because it provides a less biased estimate of the missing value. The mean is just a measure of centrality. The standard deviation reflects the information about the variability within a variable's distribution. Therefore, it reflects much more information about a variable than the mean does.

Exploring the relationship between variables to fill in missing data. This can be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.

2.3. Duplicates and Redundancy

Duplicate instances or observations in a data set are often resulted from heterogeneous systems or sources. Duplicates can have a negative effect on the training process of machine learning. R and its functions can be used to detect and remove duplicates in a data set.

An attribute is redundant if it can be "derived" from another attribute or a set of other attributes. A large number of redundant values can slow down or confuse the process of knowledge discovery. For machine learning schemes, redundant attributes or variables can result in less accurate models. Redundant variables reflect a high degree of possible nonlinear correlation. Removing redundant attributes, especially if there are a lot of them, will improve modeling speed. For some algorithms (e.g. regression and logistic regression), redundant data will make them fail. Removing attributes that are strongly correlated to others helps avoid multicollinearity problems that possibly occur in various models (multicollinearity is an issue that two or more predictors share the same linear relationship with the response variable.). Therefore, steps must be taken to avoid redundancy in addition to data cleaning during the data integration process.

Many redundancies can be detected by correlation analysis. Given two variables, correlation analysis can be used to measure how strongly one variable relates to the other based on the available data. For numerical variables, the correlation between two variables can be evaluated through computing the correlation coefficient. A strong (positive or negative) correlation between two variables indicates that the two variables have much overlapping information and one of the two variables should be removed before further analysis.

Data reduction can reduce data size through eliminating duplicates, removing redundant attributes, aggregating, and clustering. The subset selection of attributes reduces the data set size through removing redundant or irrelevant attributes; therefore, dimension reduction is achieved. One of the huge challenges is how to handle noise in big data when processing big data from open sources. Improving the data quality (i.e., filtering out the big noise in big data) is an important issue.

2.4. Variety in Disparate Data

Structured data results in rapid analysis; however, unstructured data sets (emails, audio and video streams, and social media content) make analysis time-consuming or even very difficult. Unstructured data should be correctly categorized, interpreted, and consistently labelled. A great challenge to high quality and efficient data integration includes a high data volume and heterogeneity caused by the high number of data sources. Nevertheless, high efficiency can be achieved using Big Data analytics and its platforms for parallel processing and blocking-like techniques.

Successful data analytics depends upon having access to semantically-rich data which connects all the relevant information of a given analytical task. However, heterogeneous data flows result from different big data sources with various natures, quality levels, and forms (structured, semi-structured, and unstructured). Another important point to note is that sensed data are often at low levels and semantically poor. This makes the data integration process very difficult, which means new methods for the data aggregation and fusion of heterogeneous data sources are needed. Because the integration process is an immense challenge when there is a mix of structured, semi structured, and unstructured data. The following areas are some suggested strategic focuses of Big Data analytics:

Modelling and discovering causality in structured data (well understood from data mining points of view)
Modelling and discovering causality in unstructured data (poorly understood, but progress is being made in machine learning and artificial intelligence, etc.)
Integrating unstructured causality models with structured causality models (not well understood, but progress has been made in complex event processing and system dynamics, etc.)

Course Introduction

Course Syllabus

Unit 1: Business Intelligence Overview

1.1: What is Business Intelligence?

Business Intelligence

Introduction to Business Intelligence

1.1.1: What Business Intelligence is Not

Frontiers of Business Intelligence and Analytics

Business Intelligence Dashboards

1.1.2: Business Intelligence vs. Competitive Intelligence

What is Competitive Intelligence?

1.1.3: From Systems Engineering to Business Engineering

Information Architecture Analysis

Systems Engineering

Business Engineering

1.2.1: Contemporary Applications

Business Intelligence in ERP

Improving Outcomes with Business Intelligence

How Businesses Use Information

1.2.2: BI Approaches for Each Lifecycle Stage

The Business Cycle

Big Data Analytics in Supply Chain Management

1.2.3: BI for Prediction

Goal-Oriented BI

Big Data Analytics

BI System Effectiveness

Data Mining Analytics for BI and Decision Support

1.3: The Future of BI

Future Trends in Information Systems

Internet Trends

Trends in Information Technology

Technology Trends in the COVID-19 Pandemic

The Future of BI

1.3.1: Adapting Business Models to Globalization and Technology

Global Business Strategies for Responding to Cultural Differences

Internationalization and the Need of Business Model Innovation

1.3.2: Maintaining the Firm-Centric Approach

Designing BI Solutions in the Era of Big Data

1.3.3: Incorporating Data from the Internet of Things (IoT)

The Internet of Things

The Cognitive Internet of Things and Big Data

Data Science in Heavy Industry and the Internet of Things

Causality and Variables

The Internet of Things is Revolutionary

Unit 1 Discussion

Unit 1 Study Resources

Unit 1 Review Video

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: BI as Business Support

2.1: Defining the Problem

Choice and Happiness

2.1.1: Framing Internal Client Discussions

Overview of Managerial Decision-Making

2.1.2: Drafting the Terms of Reference (TOR)

Defining the Scope of your Project

Developing Terms of Reference

2.1.3: Negotiating the Project Scope

Scope Planning

Negotiation

2.2: The Art and Science of Decision-Making

Decision-Making in Management

Decision-Making Processes in the Workplace

2.2.1: Thinking about Thinking

Experience vs. Memory

Evidence Logs and Metacognitive Logs

2.2.2: Use Analysis, or "Go with Your Gut"?

Problem Solving, Thinking, and Intelligence

Using a Heuristics Checklist

2.2.3: Decision-Making Approaches

Decision-Making Tools

2.2.4: Structuring Decision-Making Effectively

RAPID Decision-Making

2.3: Using Data to Make Decisions

Business Intelligence Dashboards

2.3.1: Everyday Data

2.3.2: Why Expert Judgement is No Better than Yours

Why You Think You're Right Even if You're Wrong

2.3.3: How Forecasting can Help Decision-Making