3. Big data Analytics Platforms

3.2. Common Big Data Analytics Tools

Traditionally, data and business analytics are performed using an integrated suite of machine learning and data mining algorithms. These tools provide mechanisms to analyze small to large scale data for business decision-making process. The machine learning algorithms and tools for data analytics can be broadly categorized into:

i. Clustering and segmentation: Divides a large collection of entities into smaller groups that shows some similarities. An example is analyzing a collection of customers to differentiate smaller segments for targeted marketing.

ii. Classification is a process of organizing data into predefined classes based on attributes that are either pre-selected by an analyst or identified as a result of a clustering model. An example is using the segmentation model to determine which segment a new customer would be categorized.

iii. Regression is used to discover relationships among a dependent variable and one or more independent variables and helps determine how the dependent variable's values change in relation to the independent variable values. An example is using mobile money subscription data, usage level, transaction type, transaction amount and geographic location to predict the future penetration of mobile money payment.

iv. Association and itemset mining looks for statistically relevant relationships among variables in a large data set. For example, this could help direct digital banking representatives to offer specific incentives to mobile money app users based on the usage level, transaction amount and transaction volume.

v. Similarity and correlation, which is used to inform undirected clustering algorithms. Similarity-scoring algorithms can be used to determine the similarity of entities placed in a candidate cluster.

However, the huge volume of big data has rendered the traditional data analysis approach ineffective for processing huge amounts of generated data in the current cyber-physical and mobile connected world. Therefore, various big data tools have been proposed and implemented recently for efficient data generation, transmission, processing, storage and analysis of big data [10]. Big data analytics tools and approaches are shown in Figure 6.

figure 6

Figure 6. Overview of big data and business analytics in Hadoop.

These tools are continuously updated and many new tools are being introduced on a regular basis. There is always some meaning attached to the names given to the software projects, but there are no criteria defined for the namespace. Therefore, there is no connection between their names. For example, flume is named after the water race used in some sawmills to bring logs to the mill and pig was named on the fact that pigs eat anything. A few of the most common tools used in a Hadoop cluster are depicted in Figure 7. Other popular tools, can be found from distro companies such as Cloudera, Hortonworks, IBM, or MapR.

figure 7

Figure 7. Common tools used in a Hadoop cluster.

i. MapReduce: This is a Hadoop distributed programming framework for batch processing compute jobs designed to use key-value pairs. It is responsible for resource scheduling and job management. Moreover, MapReduce consists of two primary parts: The mapper, and the reducer. The mapper filters and transforms data. The data blocks of a large HDFS file is fed into the MapReduce. Each mapper will read its assigned data blocks and then process the data blocks by cleaning out dirty data and/or duplicates. It then produces an intermediate output file which is then shuffled across the network to the reducer for the reduce phase. The reducer sorts each file by key and aggregates them into a larger file. Another sort will occur on the keys to creating the final output file.

ii. Hive: A SQL-like interface for Hadoop originated at Facebook. Hive is a SQL-like interface to Hadoop. It allows SQL users to use common SQL commands and common relational table structures to create MapReduce jobs without having to know MapReduce. Hive treats all the data like it belongs in tables and allows us to create table definitions over the top of the data files. In addition, it converts inputs into MapReduce jobs and organizes unstructured data metadata into tables.

iii. Pig: A scripting language for data flow and was originally created at Yahoo. Moreover, pig converts the scripts into MapReduce jobs. Pig scripts have common storage called the piggy bank. The pig schema is optional at runtime.

iv. Flume: The tool uses an agent for extracting large amounts of data into and out of Hadoop. Flume is well suited for gathering weblogs from multiple sources by the use of agents. Each agent is freestanding and is easily connected to each other. Flume comes with many connectors, making it fast and easy to build reliable and robust agents. Furthermore, the flume is highly scalable across many machines.

v. Sqoop: A tool for moving data into and from RDBMS. The name Sqoop is a combination of SQL with the word Hadoop. It is a great tool for exporting and importing data between any RDBMS and Hadoop Distributed File System. Sqoop uses both a JDBC and a command line interface. It supports parallelization across the cluster and has the ability to be deployed as a MapReduce job to manage the export or import of data.

vi. Apache Spark: Spark is an open source computing framework that can run data on a disk and in- memory. Spark is built to run onto HDFS and is able to use YARN. It is designed to combine SQL, streaming, and complex analytics. It has high-level libraries that enable programmers to rapidly write jobs for streaming, machine learning, graph processing, and the R statistical programming language. The fast processing power of the Apache Spark makes it gain popularity over other existing solutions like Apache Mahout and MapReduce. In machine learning, Spark runs compute job ten times faster than Apache Mahout. On a large-scale statistical analysis, it is benchmarked to run a hundred times faster in memory than the same job running in MapReduce. Spark is robust and versatile. It has successfully combined a number of different functions into a single software solution. Spark applications can be written in Java, Scala, and Python and this makes it easy for programmers to write in their native language. It can read any existing Hadoop data file. It also reads from HBase, Cassandra, and many other data sources. Spark is scalable to 2000 nodes and it will continue to expand its ability to scale compute jobs.

vii. Oozie: Oozie is a workflow and coordination tool used in a Hadoop cluster. It runs across a supercomputing platform. It allows jobs to run in parallel while waiting for input from other jobs. One of the interesting advantages to Oozie is that it comes with a very complex scheduling tool. This allows for coordination of jobs waiting for other dependencies within the supercomputing platform.

viii. HBase: This is a popular NoSQL columnar database deployed on top of Hadoop. HBase is an Apache project based on Google's Big Table model of data storage. It has no schema and provides a column-oriented view of data.

ix. Mahout: Mahout is a scalable, simple and extensible machine learning library supported by Java, Scala, and Python for building distributed learning algorithm in Hadoop. The current version of Mahout called Samsara focuses on math environment for the task such as linear algebra, statistical operation and data structure using the R like syntax. Some of the commonly distributed machine-learning algorithm in the Mahout Library is singular value decomposition, principal component analysis, collaborative filtering, clustering, and classification. Mahout-Samara allows the user to build a distributed machine learning algorithm instead of depending on the pre-made algorithms. Mahout has provided comprehensive algorithm suits for MapReduce and Apache Spark.

x. MLib: MLlib is an open source machine learning library native to Apache Spark. It has a Spark API that allows the user to develop distributed machine learning algorithms in Java, Scala, Python, and R. The main features of MLlib include easy deployment capabilities, and runs faster than Mahout that use the MapReduce framework due to high in-memory computation and Spark Resilient distributed dataset. Moreover, MLlib contains a number of machine algorithms for large scale learning. These include classification, clustering, topic modeling, model evaluation, distributed linear algebra, and feature transformation.

xi. Apache Tez : Apache Tez is an open source platform built on top of YARN for the analysis of directed Acyclic-graph (DAG) task. It provides a simplified API in Java and python for iterative shell task. Moreover, the Apache Tez platform has higher performance than MapReduce and allows Hive and Pig to run complex DAG task.

xii. Flink: Another distributed platform for stream and batch processing and providing machine learning, Table and Dataset API for creating an application in Java and Scala. It combines the flexibility, scalability, fast and reliability of Distribute MapReduce to efficiently analyze big data which can be implemented in a single node cluster or in the cloud enterprise system.

xiii. Storm: Storm provides a platform for stream and real-time processing. The basic components of the storm are the Spout, a Twitter streaming API and Bolt for computational logic and data processing. It provides online machine learning, real-time data analytics and is deployed by many organizations such as Twitter, Yahoo, and Spotify Japan Yelp for processing of a large amount of real-time data within seconds. Storm runs heterogeneous topologies for different tasks and can be integrated with HBase, HDFS Kafta for large-scale data processing and storage. Storm being an open source Apache project provide distributed real-time computation system using programming APIs such as Java and Scala. In addition, the platform is built on top of Hadoop for data integration, end to end authentication and data transfer between Hadoop and relational database.

Apart from the above-listed tools, other tools for big data storage, processing, and management include Apache Casandra, NoSQL, and Zookeeper, Avro, Chukwa, Python, R and Scala programming languages. These tools key functions, features, strengths and weakness are summarized in Table 2 below.

Table 2. Key features of big data analytics tools, strengths, and weaknesses.

Key Functions Tools Features Strengths Weaknesses
Data storage management Hadoop distributed file system (HDFS) Used for storage for high volumes of data. It is reliable and faults tolerant Enable data to be read once and written many times with less expensive data storage. Lack of the ability to efficiently support random read of a small amount of data. In addition, it difficult to manage Hadoop clusters.
Big database management NoSQL Non-relational database for storage, querying, and management of structured and unstructured data. Require no normalization, union or join porting application. Moreover, provide elastic scaling by distributing the data across multiple hosts to reduce computation overload Has a large amount of complexity, overlap and constant changes, and therefore require high expertise to implement
Hbase NoSQL column database for data storage and column oriented data view. Provide a mechanism for the storage of large dataset on top of the Hadoop distributed file system. Moreover, helps to aggregate and analyze billions of rows of the dataset in less time Cross data operation and joins are difficult to implement. Also, HBase has a single point of failure and challenging to perform data migration from RDBMS external sources.
Casandra Apache Casandra was first developed at Facebook for analysis of the large volume of data. Used by a large number of companies to handle a large volume of generated datasets. Moreover, Casandra is a column-oriented database with high throughput and quick response time. It does not support database operation such as a subquery, join and data aggregation. Also, it provides limited storage space with single column values
Apache Hive Apache hive is used for big data operations such as summarization, query and data analysis using SQL like interface Facilitate and maintain writing and managing of the large dataset using indexing approach. Apache Hive is not suitable for online transaction processing. Also, it does not support database operations such as a subquery, update, and deletes.
Sqoop Tools for importing and exporting large dataset into and from RDBMS Provide a computational off-loading mechanism to reducing data processing time Complicated to provide change operation and require special handling to implement incremental data import
Apache Spark Hadoop tools for real-time processing and machine learning Efficient for a reading/write operation, batch processing, join streams and ability to handle failures of any worker nodes. Furthermore, Spark support implementation using multiple and commonly used programming languages with built-in App. Challenging to provide real-time processing. Also, have a problem processing small dataset and require manual optimization for a specific dataset.
Big data processing MapReduce Hadoop distributed programming framework for batch processing, resources scheduling and compute job management. Highly scalable due to the ability to store a large volume of distributed data and also cost-effective Inability to handle interactive, in-memory and graph processing. In addition, map reduce are not configured for small dataset.
YARN Responsible for resource allocation and job scheduling in Hadoop. It is the operating systems of Hadoop 2.0 that manage resources across multiple clusters, maintain meta-data of information and keep track of user information. Addition of YARN in Hadoop help to ensure efficient utilization of resources and high availability of data. Challenging to set up accurate parameter configuration and require extensive knowledge of each parameter
Mahout Tools for large arrays of data processing scheme such as clustering, classification, regression, collaborative filtering, statistical modeling, and segmentation. Used for complementary and distributed mining of large volume of data Lack of support for popular big data development languages such as Scala. Furthermore, Mahout has little documentation to support effective learning
Oozie Workflow and coordination tool for parallelization of jobs in Hadoop cluster. Allow workflow of execution of multiple jobs with fault tolerance. Moreover, it provides web service API for seamless control of scheduled jobs. Oozie is not suitable for off-grid scheduling.
Apache Tez Data processing framework to define workflow and steps of execution using a directed acyclic graph. Flexible with a simplified interface for speedy data processing. Moreover, it is easy to switch over from MapReduce platform It utilizes MapReduce strict map, shuffle and reduce approach and very challenging to process data that didn’t fit into such pattern.
Flink Big data processing tools for handling batch and streaming operation. It is efficient for real-time analysis and distributed stream processing in Hadoop. Provide high-performance data operation with efficient fault tolerance mechanism based on a distributed snapshot. In addition, Apache Flink provides a single run-time environment for both data streaming and batch processing. Flink is not widely used for big data processing and lacks a high number of community contributions.
Flume Apache flume is used for extracting data in and out Hadoop. Provide simple and flexible architecture for efficiently aggregating and moving large streaming data into HDFS Low scalability and the high point of failure.
Pig Is responsible for data flow representation, cleaning, and analysis of large dataset using Hadoop ecosystems Apache Pig is easy to learn and analyze big data without writing complicated MapReduce program. Lack of appropriate documentation and support when encountered errors during operation.
Storm Tools for online machine learning, real-time data analytics for analyzing a large amount of real-time data, streaming and real-time processing Efficient for non-complicated streaming operation, low latency, and high throughput streaming operation Lack of advanced features for event time processing, data aggregation and implicit support for state management.
Zookeeper Zookeeper ensure robust synchronization, configuration management and name identification with Hadoop cluster Provide high data availability, serialization, reliability and minimize data inconsistencies with clusters. Require high maintenance of large arrays of the stack within the clusters
Chukwa Open source tool built on top of HDFS and MapReduce framework for monitoring of large distributed systems Provide features such as scalability, flexibility and robust tools for data monitoring, visualization and analyzing results. Apache Chukwa is highly dependent on the Hadoop cluster and MySQL with a lack of technical support for users.
Avro Provide a platform for big data query processing and data reduction to minimize computation time. Fast and smaller in size which helps to improve query processing. Provide slower serialization of data.
Statistical analysis, programming and machine learning MLlib Open source machine learning in Apache spark for big data processing, classification, and clustering. In addition, it is highly interoperable with python libraries such as Numpy, Scipy and R languages MLib is very fast, dynamic in nature, reusable features and fault tolerance High latency, memory usage, require manual optimization and lack of efficient file management system
R programming Open source programming language for data visualization and analysis, complex data handling, efficient data storage, and vector operation Has strong support for common data operations such as data cleaning, reading and writing into memory, storage, data mining, machine learning, and data visualization. Furthermore, it is appropriate for handling big data processing and analysis Issues bothering on efficient memory management, and slow. In addition, it has a steep learning curve and maybe challenging to master by a non-programmer
Python programming General purpose programming language and deploy large open source packages for computing and data modeling, preprocessing, data mining, machine learning, natural language processing, and network graph analysis User-friendly, object-oriented, flexible and support multiple platforms for integration with other big data processing system such as Apache Spark Slow and not efficient for memory intensive operation
Scala Programming Object and functional programming language for complex application development which requires a Java virtual machine environment for data processing. Scala support big data processing and management through Apache Spark Fast, simple and inherently immutable that minimize much-threaded safety in similar languages Challenging to learn, lack of easy implementation and limited backward compatibility