The amount of data collected is staggering. The article was written in the middle of 2019; how much data is now collected daily? The National Security Agency monitors hot spots for terrorist activities using drone feeds. They admitted several years ago that analyzing what they had already collected would take decades, and the collection continues. The key to effective analysis is identifying the most relevant datasets and applying the correct analytic techniques, returning to our mix of art and science. As the article indicates, very little has been studied on removing uncertainty from the value of datasets growing daily. At least with BI, you are typically looking mainly at the data created within your firm, which places some limits on the amounts and type of data, but in a firm as large as, say, Amazon, imagine the amount of data created every day, not only at the point of purchase but in all of its hundreds (maybe thousands) of automated fulfillment centers around the world. Looking at figure 1, the 5Vs of Big Data characteristics, think about the challenges of the kinds and amount of data collected daily by your firm. Is it housed in a common system or different systems depending on the department collecting and using it? How would you characterize its various Vs? Is it manageable? What level and types of uncertainty would you assign to the various datasets you regularly work with?
Background
This section reviews background information on the main characteristics of big data, uncertainty, and the analytics processes that address the uncertainty inherent in big data.
Big data
In May 2011, big data was announced as the next frontier for productivity, innovation, and competition. In 2018, the number of Internet users grew 7.5% from 2016 to over 3.7 billion people. In 2010, over 1 zettabyte (ZB) of data was generated worldwide and rose to 7 ZB by 2014. In 2001, the emerging characteristics of big data were defined with three V's (Volume, Velocity, and Variety). Similarly, IDC defined big data using four V's (Volume, Variety, Velocity, and Value) in 2011. In 2012, Veracity was introduced as a fifth characteristic of big data. While many other V's exist, we focus on the five most common characteristics of big data, as next illustrated in Fig. 1.
Fig. 1

Common big data characteristics
Volume refers to the massive amount of data generated every second and applies to the size and scale of a dataset. It is impractical to define a universal threshold for big data volume (i.e., what constitutes a 'big dataset') because the time
and type of data can influence its definition. Currently, datasets that reside in the exabyte (EB) or ZB ranges are generally considered as big data, however challenges still exist for datasets in smaller size ranges. For example, Walmart collects
2.5 PB from over a million customers every hour. Such huge volumes of data can introduce scalability and uncertainty problems (e.g., a database tool may not be able to accommodate infinitely large datasets). Many existing data analysis techniques
are not designed for large-scale databases and can fall short when trying to scan and understand the data at scale.
Variety refers to the different forms of data in a dataset including structured data, semi-structured data, and unstructured
data. Structured data (e.g., stored in a relational database) is mostly well-organized and easily sorted, but unstructured data (e.g., text and multimedia content) is random and difficult to analyze. Semi-structured data (e.g., NoSQL databases) contains
tags to separate data elements, but enforcing this structure is left to the database user. Uncertainty can manifest when converting between different data types (e.g., from unstructured to structured data), in representing data of mixed data types,
and in changes to the underlying structure of the dataset at run time. From the point of view of variety, traditional big data analytics algorithms face challenges for handling multi-modal, incomplete and noisy data. Because such techniques (e.g.,
data mining algorithms) are designed to consider well-formatted input data, they may not be able to deal with incomplete and/or different formats of input data. This paper focuses on uncertainty with regard to big data analytics, however uncertainty
can impact the dataset itself as well.
Efficiently analysing unstructured and semi-structured data can be challenging, as the data under observation comes from heterogeneous sources with a variety of data types and representations. For example,
real-world databases are negatively influenced by inconsistent, incomplete, and noisy data. Therefore, a number of data preprocessing techniques, including data cleaning, data integrating, and data transforming used to remove noise from data. Data
cleaning techniques address data quality and uncertainty problems resulting from variety in big data (e.g., noise and inconsistent data). Such techniques for removing noisy objects during the analysis process can significantly enhance the performance
of data analysis. For example, data cleaning for error detection and correction is facilitated by identifying and eliminating mislabeled training samples, ideally resulting in an improvement in classification accuracy in ML.
Velocity comprises
the speed (represented in terms of batch, near-real time, real time, and streaming) of data processing, emphasizing that the speed with which the data is processed must meet the speed with which the data is produced. For example, Internet of Things
(IoT) devices continuously produce large amounts of sensor data. If the device monitors medical information, any delays in processing the data and sending the results to clinicians may result in patient injury or death (e.g., a pacemaker that reports
emergencies to a doctor or facility). Similarly, devices in the cyber-physical domain often rely on real-time operating systems enforcing strict timing standards on execution, and as such, may encounter problems when data provided from a big data
application fails to be delivered on time.
Veracity represents the quality of the data (e.g., uncertain or imprecise data). For example, IBM estimates that poor data quality costs the US economy $3.1 trillion per year. Because data
can be inconsistent, noisy, ambiguous, or incomplete, data veracity is categorized as good, bad, and undefined. Due to the increasingly diverse sources and variety of data, accuracy and trust become more difficult to establish in big data analytics.
For example, an employee may use Twitter to share official corporate information but at other times use the same account to express personal opinions, causing problems with any techniques designed to work on the Twitter dataset. As another example,
when analyzing millions of health care records to determine or detect disease trends, for instance to mitigate an outbreak that could impact many people, any ambiguities or inconsistencies in the dataset can interfere or decrease the precision of
the analytics process.
Value represents the context and usefulness of data for decision making, whereas the prior V's focus more on representing challenges in big data. For example, Facebook, Google, and Amazon have leveraged the value
of big data via analytics in their respective products. Amazon analyzes large datasets of users and their purchases to provide product recommendations, thereby increasing sales and user participation. Google collects location data from Android users
to improve location services in Google Maps. Facebook monitors users' activities to provide targeted advertising and friend recommendations. These three companies have each become massive by examining large sets of raw data and drawing and retrieving
useful insight to make better business decisions.
Uncertainty
Generally, "uncertainty is a situation which involves unknown or imperfect information". Uncertainty exists in every phase of big data learning and comes from many different sources, such as data collection (e.g., variance in environmental conditions
and issues related to sampling), concept variance (e.g., the aims of analytics do not present similarly) and multimodality (e.g., the complexity and noise introduced with patient health records from multiple sensors include numerical, textual, and
image data). For instance, most of the attribute values relating to the timing of big data (e.g., when events occur/have occurred) are missing due to noise and incompleteness. Furthermore, the number of missing links between data points in social
networks is approximately 80% to 90% and the number of missing attribute values within patient reports transcribed from doctor diagnoses are more than 90%. Based on IBM research in 2014, industry analysts believe that, by 2015, 80% of the world's
data will be uncertain.
Various forms of uncertainty exist in big data and big data analytics that may negatively impact the effectiveness and accuracy of the results. For example, if training data is biased in any way, incomplete, or obtained
through inaccurate sampling, the learning algorithm using corrupted training data will likely output inaccurate results. Therefore, it is critical to augment big data analytic techniques to handle uncertainty. Recently, meta-analysis studies that
integrate uncertainty and learning from data have seen a sharp increase. The handling of the uncertainty embedded in the entire process of data analytics has a significant effect on the performance of learning from big data. Other research also indicates
that two more features for big data, such as multimodality (very complex types of data) and changed-uncertainty (the modeling and measure of uncertainty for big data) is remarkably different from that of small-size data. There is also a positive correlation
in increasing the size of a dataset to the uncertainty of data itself and data processing. For example, fuzzy sets may be applied to model uncertainty in big data to combat vague or incorrect information. Moreover, and because the data may contain
hidden relationships, the uncertainty is further increased.
Therefore, it is not an easy task to evaluate uncertainty in big data, especially when the data may have been collected in a manner that creates bias. To combat the many types of uncertainty
that exist, many theories and techniques have been developed to model its various forms. We next describe several common techniques.
Bayesian theory assumes a subjective interpretation of the probability based on past event/prior knowledge.
In this interpretation the probability is defined as an expression of a rational agent's degrees of belief about uncertain propositions. Belief function theory is a framework for aggregating imperfect data through an information
fusion process when under uncertainty. Probability theory incorporates randomness and generally deals with the statistical characteristics of the input data. Classification entropy measures ambiguity between classes to provide an
index of confidence when classifying. Entropy varies on a scale from zero to one, where values closer to zero indicate more complete classification in a single class, while values closer to one indicate membership among several different classes.
Fuzziness is used to measure uncertainty in classes, notably in human language (e.g., good and bad). Fuzzy logic then handles the uncertainty associated with human perception by creating an approximate reasoning mechanism. The methodology
was intended to imitate human reasoning to better handle uncertainty in the real world. Shannon's entropy quantifies the amount of information in a variable to determine the amount of missing information on average in a random source. The
concept of entropy in statistics was introduced into the theory of communication and transmission of information by Shannon. Shannon entropy provides a method of information quantification when it is not possible to measure criteria weights using
a decision–maker. Rough set theory provides a mathematical tool for reasoning on vague, uncertain or incomplete information. With the rough set approach, concepts are described by two approximations (upper and lower) instead of one precise
concept, making such methods invaluable to dealing with uncertain information systems. Probabilistic theory and Shannon's entropy are often used to model imprecise, incomplete, and inaccurate data. Moreover, fuzzy set and rough theory are used for
modeling vague or ambiguous data, as shown in Fig. 2.
Fig. 2
Measuring uncertainty in big data
Evaluating the level of uncertainty is a critical step in big data analytics. Although a variety of techniques exist to analyze big data, the accuracy of the analysis may be negatively affected if uncertainty in the data or the technique itself is ignored. Uncertainty models such as probability theory, fuzziness, rough set theory, etc. can be used to augment big data analytic techniques to provide more accurate and more meaningful results. Based on the previous research, Bayesian model and fuzzy set theory are common for modeling uncertainty and decision-making. Table 1 compares and summarizes the techniques we have identified as relevant, including a comparison between different uncertainty strategies, focusing on probabilistic theory, Shannon's entropy, fuzzy set theory, and rough set theory.
Table 1 Comparison of uncertainty strategies
Uncertainty models | Features |
---|---|
Probability theory Bayesian theory Shannon’s entropy |
Powerful for handling randomness and subjective uncertainty where precision is required Capable of handling complex data |
Fuzziness |
Handles vague and imprecise information in systems that are difficult to model Precision not guaranteed Easy to implement and interpret |
Belief function |
Handle situations with some degree of ignorance Combines distinct evidence from several sources to compute the probability of specific hypotheses Considers all evidence available for the hypothesis Ideal for incomplete and high complex data Mathematically complex but improves uncertainty reduction |
Rough set theory |
Provides an objective form of analysis Deals with vagueness in data Minimal information necessary to determine set membership Only uses the information presented within the given data |
Classification entropy |
Handles ambiguity between the classes |
Big data analytics
Big data analytics describe the process of analyzing massive datasets to discover patterns, unknown correlations, market trends, user preferences, and other valuable information that previously could not be analyzed with traditional tools. With the formalization
of the big data's five V characteristics, analysis techniques needed to be reevaluated to overcome their limitations on processing in terms of time and space. Opportunities for utilizing big data are growing in the modern world of digital data. The
global annual growth rate of big data technologies and services is predicted to increase about 36% between 2014 and 2019, with the global income for big data and business analytics anticipated to increase more than 60%.
Several advanced data
analysis techniques (i.e., ML, data mining, NLP, and CI) and potential strategies such as parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature selection, and instance selection can convert big problems to
small problems and can be used to make better decisions, reduce costs, and enable more efficient processing.
With respect to big data analytics, parallelization reduces computation time by splitting large problems into smaller instances of
itself and performing the smaller tasks simultaneously (e.g., distributing the smaller tasks across multiple threads, cores, or processors). Parallelization does not decrease the amount of work performed but rather reduces computation time as the
small tasks are completed at the same point in time instead of one after another sequentially.
The divide-and-conquer strategy plays an important role in processing big data. Divide-and-conquer consists of three phases: (1) reduce
one large problem into several smaller problems, (2) complete the smaller problems, where the solving of each small problem contributes to the solving of the large problem, and (3) incorporate the solutions of the smaller problems into one large solution
such that the large problem is considered solved. For many years the divide-and-conquer strategy has been used in very massive databases to manipulate records in groups rather than all the data at once.
Incremental learning is a learning
algorithm popularly used with streaming data that is trained only with new data rather than only training with existing data. Incremental learning adjusts the parameters in the learning algorithm over time according to each new input data and each
input is used for training only once.
Sampling can be used as a data reduction method for big data analytics for deriving patterns in large data sets by choosing, manipulating, and analyzing a subset of the data. Some research indicates
that obtaining effective results using sampling depends on the data sampling criteria used.
Granular computing groups elements from a large space to simplify the elements into subsets, or granules. Granular computing is an
effective approach to define uncertainty of objects in the search space as it reduces large objects to a smaller search space.
Feature selection is a conventional approach to handle big data with the purpose of choosing a subset of
relative features for an aggregate but more precise data representation. Feature selection is a very useful strategy in data mining for preparing high-scale data.
Instance selection is practical in many ML or data mining tasks as a
major feature in data pre-processing. By utilizing instance selection, it is possible to reduce training sets and runtime in the classification or training phases.
The costs of uncertainty (both monetarily and computationally) and challenges
in generating effective models for uncertainties in big data analytics have become key to obtaining robust and performant systems. As such, we examine several open issues of the impacts of uncertainty on big data analytics in the next section.