Uncertainty in Big Data Analytics

Uncertainty Perspective of Big Data Analytics

This section examines the impact of uncertainty on three AI techniques for big data analytics. Specifically, we focus on ML, NLP, and CI, although many other analytics techniques exist. For each presented technique, we examine the inherent uncertainties and discuss methods and strategies for their mitigation.

Machine learning and big data

When dealing with data analytics, ML is generally used to create models for prediction and knowledge discovery to enable data-driven decision-making. Traditional ML methods are not computationally efficient or scalable enough to handle both the characteristics of big data (e.g., large volumes, high speeds, varying types, low value density, incompleteness) and uncertainty (e.g., biased training data, unexpected data types, etc.). Several commonly used advanced ML techniques proposed for big data analysis include feature learning, deep learning, transfer learning, distributed learning, and active learning. Feature learning includes a set of techniques that enables a system to automatically discover the representations needed for feature detection or classification from raw data. The performances of the ML algorithms are strongly influenced by the selection of data representation. Deep learning algorithms are designed for analyzing and extracting valuable knowledge from massive amounts of data and data collected from various sources (e.g., separate variations within an image, such as a light, various materials, and shapes), however current deep learning models incur a high computational cost. Distributed learning can be used to mitigate the scalability problem of traditional ML by carrying out calculations on data sets distributed among several workstations to scale up the learning process. Transfer learning is the ability to apply knowledge learned in one context to new contexts, effectively improving a learner from one domain by transferring information from a related domain. Active learning refers to algorithms that employ adaptive data collection (i.e., processes that automatically adjust parameters to collect the most useful data as quickly as possible) in order to accelerate ML activities and overcome labeling problems. The uncertainty challenges of ML techniques can be mainly attributed to learning from data with low veracity (i.e., uncertain and incomplete data) and data with low value (i.e., unrelated to the current problem). We found that, among the ML techniques, active learning, deep learning, and fuzzy logic theory are uniquely suited to support the challenge of reducing uncertainty, as shown in Fig. 3. Uncertainty can impact ML in terms of incomplete or imprecise training samples, unclear classification boundaries, and rough knowledge of the target data. In some cases, the data is represented without labels, which can become a challenge. Manually labeling large data collections can be an expensive and strenuous task, yet learning from unlabeled data is very difficult as classifying data with unclear guidelines yields unclear results. Active learning has solved this issue by selecting a subset of the most important instances for labeling. Deep learning is another learning method that can handle incompleteness and inconsistency issues in the classification procedure. Fuzzy logic theory has been also shown to model uncertainty efficiently. For example, in fuzzy support vector machines (FSVMs), a fuzzy membership is applied to each input point of the support vector machines (SVM). The learning procedure then has the benefits of flexibility provided by fuzzy logic, enabling an improvement in the SVM by decreasing the result of noises in data points. Hence, while uncertainty is a notable problem for ML algorithms, incorporating effective techniques for measuring and modeling uncertainty can lead towards systems that are more flexible and efficient, respective.

Fig. 3

How ML techniques handle uncertainty in big data

Natural language processing and big data

NLP is a technique grounded in ML that enables devices to analyze, interpret, and even generate text. NLP and big data analytics tackle huge amounts of text data and can derive value from such a dataset in real-time. Some common NLP methods include lexical acquisition (i.e., obtains information about the lexical units of a language), word sense disambiguation (i.e., determining which sense of the word is used in a sentence when a word has multiple meanings), and part-of-speech (POS) tagging (i.e., determining the function of the words through labeling categories such as verb, noun, etc.). Several NLP-based techniques have been applied to text mining including information extraction, topic modeling, text summarization, classification, clustering, question answering, and opinion mining. For example, financial and fraud investigations may involve finding evidence of a crime in massive datasets. NLP techniques (particularly named entity extraction and information retrieval) can help manage and sift through huge amounts of textual information, such as criminal names and bank records, to support fraud investigations. Moreover, NLP techniques can help to create new traceability links and recover traceability links (i.e., missing or broken links at run-time) by finding semantic similarity among available textual artifacts. Furthermore, NLP and big data can be used to analyze news articles and predict rises and falls on the composite stock price index.

Uncertainty impacts NLP in big data in a variety of ways. For example, keyword search is a classic approach in text mining that is used to handle large amounts of textual data. Keyword search accepts as input a list of relevant words or phrases and searches the desired set of data (e.g., a document or database) for occurrences of the relevant words (i.e., search terms). Uncertainty can impact keyword search, as a document that contains a keyword is not an assurance of a document's relevance. For example, a keyword search usually matches exact strings and ignores words with spelling errors that may still be relevant. Boolean operators and fuzzy search technologies permit greater flexibility in that they can be used to search for words similar to the desired spelling. Although keyword or key phrase search is useful, limited sets of search terms can miss key information. In comparison, using a wider set of search terms can result in a large set of 'hits' that can contain large numbers of irrelevant false positives. Another example of uncertainty impacting NLP involves automatic POS taggers that must handle the ambiguity of certain words (Fig. 4) (e.g., the word "bimonthly" can mean twice a month or every two months depending on the context, the word "quite" having different meaning to American and British audiences, etc.), as well as classification problems due to the ambiguity of periods ('.') that can be interpreted as part of a token (e.g., abbreviation), punctuation (e.g., full stop), or both. Although recent research indicates that using IBM Content Analytics (ICA) can mitigate these problems, there remains the open issue in this topic regarding large-scale data. Also, uncertainty and ambiguity impact the POS tagging especially when using biomedical language, which quite different from general English. It has been reported uncertainty and not sufficient tagging accuracy when trained taggers from Treebank corpus and applied to biomedical data. To this end, stream processing systems deal with high data throughput while achieving low response latencies. The integration of NLP techniques with the help of uncertainty modeling such as fuzzy and probabilistic sets with big data analytics may offer the ability to support handling big textual data in real time, however additional work is necessary in this area.

Fig. 4

Words with more than one POS tag (ambiguity)

Computational intelligence and big data

CI includes a set of nature-inspired computational techniques that play an important role in big data analysis. CIs have been used to tackle complicated data processes and analytics challenges such as high complexity, uncertainty, and any processes where traditional techniques are not sufficient. Common techniques that are currently available in CI are evolutionary algorithms (EAs), artificial neural networks (ANN), and fuzzy logic, with examples spanning search-based problems such as parameter optimization to optimizing a robot controller.

CI techniques are suitable for dealing with the real-world challenges of big data as they are fundamentally capable of handling numerous amounts of uncertainty. For example, generating models for predicting emotions of users is one problem with many potential pitfalls for uncertainty. Such models deal with large databases of information relating to human emotion and its inherent fuzziness. Many challenges still exist in current CI techniques, especially when dealing with the value and veracity characteristics of big data. Accordingly, there is great interest in developing new CI techniques that can efficiently address massive amounts of data and to have the ability to quickly respond to modifications in the dataset. Big data analysis can be optimized by employing algorithms such as swarm intelligence, AI, and ML. These techniques are used for training machines in performing predictive analysis tasks, collaborative filtering, and building empirical statistical predictive models. It is possible to minimize the complexity and uncertainty on processing massive volumes of data and improve analysis results by using CI-based big data analytics solutions.

To support CI, fuzzy logic provides an approach for approximate reasoning and modeling of qualitative data for uncertainty challenges in big data analytics using linguistic quantifiers (i.e., fuzzy sets). It represents uncertain real-word and user-defined concepts and interpretable fuzzy rules that can be used for inference and decision-making. Big data analytics also bear challenges due to the existence of noise in data where the data consists of high degrees of uncertainty and outlier artifacts. Iqbal et al. have demonstrated that fuzzy logic systems can efficiently handle inherent uncertainties related to the data. In another study, fuzzy logic-based matching algorithms and MapReduce were used to perform big data analytics for clinical decision support. The developed system demonstrated great flexibility and could handle data from various sources. Another useful CI technique for tackling the challenges of big data analytics are EAs that discover the optimal solution(s) to a complex problem by mimicking the evolution process by gradually developing a population of candidate solutions. Since big data includes high volume, variety, and low veracity, EAs are excellent tools for analyzing such datasets. For example, applying parallel genetic algorithms to medical image processing yields an effective result in a system using Hadoop. However, the results of CI-based algorithms may be impacted by motion, noise, and unexpected environments. Moreover, an algorithm that can deal with one of these problems may function poorly when impacted by multiple factors.