Big data and business analytics methods for improved business decision-making, technological approaches, applications, and open research challenges. Big data has brought companies in developed countries many positive effects, which those in emerging and developing nations may replicate. However, big data's many challenges include data security, management, characteristics, compliance, and regulation. This paper contains a neatly wrapped breakdown outlining the structure, components, and tools that provide effective and efficient processing for the Hadoop ecosystem.
3. Big data Analytics Platforms
3.1. The Hadoop Ecosystem
Hadoop is an open source project lead by Apache. It was originally designed to handle massive amounts of data, rapidly, efficiently and inexpensively. It handles both structured and unstructured data. Moreover, Hadoop stores bits and bytes of data on commodity hardware. In addition, Hadoop is user-friendly and scales massively. There can be up to 10, 20, 30, 40, or more racks filled with data nodes for a single supercomputing platform. It has the intelligence to run the distributed file system and the parallel processing work. Hadoop consists of three parts, Hadoop distributed file system, (HDFS), yet another resource negotiator (YARN) and common as shown in Figure 4 below.
Figure 4. Functional view of Hadoop.
Hadoop Distributed File System (HDFS) is the storage layer that is responsible for creating a distributed repository. While Yet Another Resource Negotiator (YARN) is the data refinery layer and is a processing level for scheduling parallel compute jobs. This structure abstracts away the complexities of distributed computing.
YARN provides resource management and job scheduling in the Hadoop distributed processing platform. Moreover, Hadoop YARN extends the capability of Hadoop to support varieties of applications thereby reducing the limitation of Hadoop to only run MapReduce applications. Consequently, YARN enables Apache Hadoop to provide services such as interactive querying, data streaming, and real-time analytics applications.
With Hadoop, writing a MapReduce job by the programmer is easy as they do not have to determine data location or size and a number of parallels compute jobs. The primary components for the Hadoop cluster include the master server, the switches, the racks, and the data servers as shown in Figure 5. These are commonly called data workers, data nodes, or just nodes. The master server has responsibilities for managing and coordinating the entire Hadoop cluster (data nodes). Furthermore, it performs health checks and takes corrective action when required, mapping the location of all the data and directing all data movement, scheduling and rescheduling all compute jobs, and error handling (including loss of a data node and rescheduling of failed compute jobs). The data server is responsible for data storage and processing, and further, provides resources such as CPU and memory. In addition, data server reports health checks status and on-going job progress during data processing.
Figure 5. The primary component of Hadoop cluster.
Another important component of the Hadoop ecosystem is Common. Common is made up of utilities and tools to perform various operations such as codec compression, error detection, input/output utilities and authorization of proxy users. Furthermore, common is responsible for data and user authentication, services level authorization and configuration of rack awareness.
Generally, Hadoop is configured on rack-based servers. On top of each rack, the network switch is configured for intra-rack communication. Furthermore, another network switch is configured for handling communication between rack switches and the client that runs the Hadoop client-related software. Hadoop uses HDFS for holding files. It is responsible for breaking large files into smaller chunks (128 MB - configurable), placing them in different slave nodes and replicating them for providing high availability.