There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.2. Disparate Data and Big Data
2.1. Data Quality Problems
Data from a single source is often considered clean. However, the nature of data is often "messy" due to its heterogeneity when it's in its native formats and "dirty" (including missing, mislabeled, incorrect, or possibly spurious data). This makes the data incompatible with other data sources. Table 2 lists data quality (veracity) problems from the single-source and the multi-source, and between the schema level and the instance level, respectively.
Table 2. Categories of Data Quality Problems from Data Sources
Single-Source: Schema Level |
Single-Source: Instance Level |
Multi-Source: Schema Level |
Multi-Source: Instance Level |
Poor schema design, lack -Referential integrity -Uniqueness ... |
Data entry errors -Misspellings -Contradictory values -Duplicates/redundancy ... |
Heterogeneous schema -Structural conficts -Naming conflicts ... |
Overlapping, inconsistent -Inconsistent timing -Inconsistent aggregating ... |
The first step in handling data quality (veracity) problems is a data cleaning process that involves dealing with typos, missing fields, and spelling conventions, etc. Data cleaning is also called data cleansing or scrubbing. Data cleaning is not often an easy task because data ownership is unclear in many organizations. Data cleaning is labor-intensive and time-consuming, but it is necessary for successful data mining. Potential methods of filtering big noise can come from classic reliability methods: for example, data from several independent sources may be more reliable than that which is from dependent sources.
Data cleaning often deals with detecting and removing inconsistencies and errors from data. Misspellings during data entry, missing information, or other invalid data result in severe data quality problems. Data cleaning is especially needed before integrating heterogeneous data sources and should be handled simultaneously along with schema-related data transformations. Consistent, reliable, and accurate information is often achieved by eliminating duplicate data and consolidating different data representations. Data cleaning can include five complementary procedures: searching and identifying errors, defining and determining error types, correcting inaccuracies, documenting error types and error examples, and modifying data entry procedures to reduce future errors. Data completeness, formats, rationality, and restriction shall be inspected during the cleaning process.
2.2. Missing Data
Missing data can be categorized into the following three types:
(1) Missing completely at random – The data missing is completely at random (MCAR) if the presence of missing data on a variable is not related to any other observed or unobserved variable.
(2) Missing at random – If the presence of missing data on a variable is related to other observed variables but not to its own unobserved value, then the data is missing at random (MAR).
(3) Not missing at random – If the missing data for a variable is neither MCAR nor MAR, then it is not missing at random (NMAR).
Most methods for handling missing data assume that the data is either MCAR or MAR. In this situation, the mechanism producing the missing data can be ignored and the relationships of interest is modelled after replacing or deleting the missing data. If the number of records with missing values is small, the records can be omitted. However, if there are a large number of variables, even a small proportion of missing data could affect many records. R and its functions can be used to identify missing data in a data set.
Missing values can cause troubles for modelers. Some tools handle missing values by ignoring them; otherwise, suitable replacements should be determined. Data imputation is a research area that seeks to impute missing values for producing improved models. Multiple imputation (MI) is often a choice for complex missing values problems. It is based on repeated simulations and provides an approach to handling missing data. The following are common methods for handling missing data:
- Discarding instances – deleting all instances where there is at least one missing value and using the remainder. However, the disadvantage of this method is that discarding data may destroy the reliability of the results obtained from the data and this method is generally not recommended although it is possibly worth trying when the proportion of missing data is small.
- Replacement by the most frequent value or an average value. An effective way for a categorical attribute is using the most frequently occurring (non-missing) value. A swift, simple method of filling in the unknown continuous values is using a central value such as the mean and the median. However, the presence of outliers (extreme values) can distort the calculation of the mean making it unsuitable to use the mean before checking the distribution of the variable. Therefore, for variables with outliers or for skewed distributions of variables, the median is a better choice.
Preserving standard deviation. When the mean is used as a central value to fill in the missing value, this method is called preserving mean. Preserving standard deviation is another method. It is better because it provides a less biased estimate of the missing value. The mean is just a measure of centrality. The standard deviation reflects the information about the variability within a variable's distribution. Therefore, it reflects much more information about a variable than the mean does.
Exploring the relationship between variables to fill in missing data. This can be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.
2.3. Duplicates and Redundancy
Duplicate instances or observations in a data set are often resulted from heterogeneous systems or sources. Duplicates can have a negative effect on the training process of machine learning. R and its functions can be used to detect and remove duplicates in a data set.
An attribute is redundant if it can be "derived" from another attribute or a set of other attributes. A large number of redundant values can slow down or confuse the process of knowledge discovery. For machine learning schemes, redundant attributes or variables can result in less accurate models. Redundant variables reflect a high degree of possible nonlinear correlation. Removing redundant attributes, especially if there are a lot of them, will improve modeling speed. For some algorithms (e.g. regression and logistic regression), redundant data will make them fail. Removing attributes that are strongly correlated to others helps avoid multicollinearity problems that possibly occur in various models (multicollinearity is an issue that two or more predictors share the same linear relationship with the response variable.). Therefore, steps must be taken to avoid redundancy in addition to data cleaning during the data integration process.
Many redundancies can be detected by correlation analysis. Given two variables, correlation analysis can be used to measure how strongly one variable relates to the other based on the available data. For numerical variables, the correlation between two variables can be evaluated through computing the correlation coefficient. A strong (positive or negative) correlation between two variables indicates that the two variables have much overlapping information and one of the two variables should be removed before further analysis.
Data reduction can reduce data size through eliminating duplicates, removing redundant attributes, aggregating, and clustering. The subset selection of attributes reduces the data set size through removing redundant or irrelevant attributes; therefore, dimension reduction is achieved. One of the huge challenges is how to handle noise in big data when processing big data from open sources. Improving the data quality (i.e., filtering out the big noise in big data) is an important issue.
2.4. Variety in Disparate Data
Structured data results in rapid analysis; however, unstructured data sets (emails, audio and video streams, and social media content) make analysis time-consuming or even very difficult. Unstructured data should be correctly categorized, interpreted, and consistently labelled. A great challenge to high quality and efficient data integration includes a high data volume and heterogeneity caused by the high number of data sources. Nevertheless, high efficiency can be achieved using Big Data analytics and its platforms for parallel processing and blocking-like techniques.
Successful data analytics depends upon having access to semantically-rich data which connects all the relevant information of a given analytical task. However, heterogeneous data flows result from different big data sources with various natures, quality levels, and forms (structured, semi-structured, and unstructured). Another important point to note is that sensed data are often at low levels and semantically poor. This makes the data integration process very difficult, which means new methods for the data aggregation and fusion of heterogeneous data sources are needed. Because the integration process is an immense challenge when there is a mix of structured, semi structured, and unstructured data. The following areas are some suggested strategic focuses of Big Data analytics:
- Modelling and discovering causality in structured data (well understood from data mining points of view)
- Modelling and discovering causality in unstructured data (poorly understood, but progress is being made in machine learning and artificial intelligence, etc.)
- Integrating unstructured causality models with structured causality models (not well understood, but progress has been made in complex event processing and system dynamics, etc.)