Data Modeling and Data Analytics

Data Analytics

This section presents and discusses the types of operations that can be performed over the data models described in the previous section and also establishes comparisons between them. A complementary discussion is provided in Section 4.

Operational Databases

Systems using operational databases are designed to handle a high number of transactions that usually perform changes to the operational data, i.e. the data an organization needs to assure its everyday normal operation. These systems are called Online Transaction Processing (OLTP) systems and they are the reason why RDBMSs are so essential nowadays. RDBMSs have increasingly been optimized to perform well in OLTP systems, namely providing reliable and efficient data processing.

The set of operations supported by RDBMSs is derived from the relational algebra and calculus underlying the Relational Model. As mentioned before, SQL is the standard language to perform these operations. SQL can be divided in two parts involving different types of operations: Data Definition Language (SQL-DDL) and Data Manipulation Language (SQL-DML).

SQL-DDL allows performing the creation (CREATE), update (UPDATE) and deletion (DROP) of the various database objects. First it allows managing schemas, which are named collections of all the database objects that are related to one another. Then inside a schema, it is possible to manage tables specifying their columns and types, primary keys, foreign keys and constraints. It is also possible to manage views, domains and indexes. An index is a structure that speeds up the process of accessing to one or more columns of a given table, possibly improving the performance of queries.

Figure 6. Example of a wide-column store for the Academic Management System.

Figure 6. Example of a wide-column store for the Academic Management System.

For example, considering the Academic Management System, a system manager could create a table for storing information of a student by executing the following SQL-DDL command:

CREATE TABLE Student (  

Student ID NOT NULL IDENTITY, 

Name VARCHAR(255) NOT NULL,  

Date of Birth DATE NOT NULL,  

ID Card VARCHAR(255) NOT NULL,  

Place of Origin VARCHAR(255),  

Country VARCHAR(255),  

PRIMARY KEY (Student ID))

On the other hand, SQL-DML is the language that enables to manipulate database objects and particularly to extract valuable information from the database. The most commonly used and complex operation is the SELECT operation, which allows users to query data from the various tables of a database. It is a powerful operation because it is capable of performing in a single query the equivalent of the relational algebra's selection, projection and join operations. The SELECT operation returns as output a table with the results. With the SELECT operation is simultaneously possible to: define which tables the user wants to query (through the FROM clause), which rows satisfy a particular condition (through the WHERE clause), which columns should appear in the result (through the SELECT clause), order the result (in ascending or descending order) by one or more columns (through the ORDER BY clause), group rows with the same column values (through the GROUP BY clause) and filter those groups based on some condition (through the HAVING clause). The SELECT operation also allows using aggregation functions, which perform arithmetic computation or aggregation of data (e.g. counting or summing the values of one or more columns).

Many times there is the need to combine columns of more than one table in the result. To do that, the user can use the JOIN operation in the query. This operation performs a subset of the Cartesian product between the involved tables, i.e. returns the row pairs where the matching columns in each table have the same value. The most common queries that use joins involve tables that have one-to-many relationships. If the user wants to include in the result the rows that did not satisfied the join condition, then he can use the outer joins operations (left, right and full outer join). Besides specifying queries, DML allows modifying the data stored in a database. Namely, it allows adding new rows to a table (through the INSERT statement), modifying the content of a given table's rows (through the UPDATE statement) and deleting rows from a table (through the DELETE statement). SQL-DML also allows combining the results of two or more queries into a single result table by applying the Union, Intersect and Except operations, based on the Set Theory .

For example, considering the Academic Management System, a system manager could get a list of all students who are from G8 countries by entering the following SQL-DML query:

SELECT Name, Country  

FROM Student  

WHERE Country in (“Canada”, “France”, “Germany”, “Italy”, “Japan”, “Russia”, “UK”, “USA”)  

ORDER BY Country

Decision Support Databases

The most common data model used in DW is the OLAP cube, which offers a set of operations to analyze the cube model. Since data is conceptualized as a cube with hierarchical dimensions, its operations have familiar names when manipulating a cube, such as slice, dice, drill and pivot. Figure 7 depicts these operations considering the Student's facts of the AMS case study (see Figure 2).

The slice operation begins by selecting one of the dimensions (or faces) of the cube. This dimension is the one we want to consult and it is followed by "slicing" the cube to a specific depth of interest. The slice operation leaves us with a more restricted selection of the cube, namely the dimension we wanted (front face) and the layer of that dimension (the sliced section). In the example of Figure 7 (top-left), the cube was sliced to consider only data of the year 2004.

Dice is the operation that allows restricting the front face of the cube by reducing its size to a smaller targeted domain. This means that the user produces a smaller "front face" than the one he had at the start. Figure 7 (top- right) shows that the set of students has decreased after the dice operation.

Drill is the operation that allows to navigate by specifying different levels of the dimensions, ranging from the most detailed ones (drill down) to the most summarized ones (drill up). Figure 7 (bottom-left) shows the drill down so the user can see the cities from where the students of the country Portugal come from.

The pivot operation allows changing the dimension that is being faced (change the current front face) to one that is adjacent to it by rotating the cube. By doing this, the user obtains another perspective of the data, which requires the queries to have a different structure but can be more beneficial for specific queries. For instance, he can slice and dice the cube away to get the results he needed, but sometimes with a pivot most of those operations can be avoided by perceiving a common structure on future queries and pivoting the cube in the correct fashion. Figure 7 (bottom-right) shows a pivot operation where years are arranged vertically and countries horizontally.

The usual operations issued over the OLAP cube are about just querying historical events stored in it. So, a common dimension is a dimension associated to time. The most popular language for manipulating OLAP cubes is MDX (Multidimensional Expressions), which is a query language for OLAP databases that supports all the operations mentioned above. MDX is exclusively used to analyze and read data since it was not designed with SQL-DML in mind. The star schema and the OLAP cube are designed a priori with a specific purpose in mind and cannot accept queries that differentiate much from the ones they were design to respond too. The benefit in this, is that queries are much simpler and faster, and by using a cube it is even quicker to detect patterns, find trends and navigate around the data while "slicing and dicing" with it.

Figure 7. Representation of cube operations for the Academic Management System: slice (top-left), dice (top-right), drill up/down (bottom-left) and pivot (bottom-right).

Again considering the Academic Management System example, the following query represents an MDX select statement. The SELECT clause sets the query axes as the name and the gender of the Student dimension and the year 2015 of the Date dimension. The FROM clause indicates the data source, here being the Students cube, and the WHERE clause defines the slicer axis as the "Computer Science" value of the Academic Program dimension. This query returns the students (by names and gender) that have enrolled in Computer Science in the year 2015.

SELECT

{ [Student].[Name],

[Student].[Gender]} ON COLUMNS

{ [Date].[Academic Year] &[2015] } ON ROWS

FROM [Students Cube]

WHERE ([Academic Program].[Name] &[Computer Science])

Big Data Technologies

Big Data Analytics consists in the process of discovering and extracting potentially useful information hidden in huge amounts of data (e.g. discover unknown patterns and correlations). Big Data Analytics can be separated in the following categories: (1) Batch-oriented processing; (2) Stream processing; (3) OLTP and; (4) Interactive ad-hoc queries and analysis.

Batch-oriented processing is a paradigm where a large volume of data is firstly stored and only then analyzed, as opposed to Stream processing. This paradigm is very common to perform large-scale recurring tasks in parallel like parsing, sorting or counting. The most popular batch-oriented processing model is MapReduce, and more specifically its open-source implementation in Hadoop1. MapReduce is based on the divide and conquer (D&C) paradigm to break down complex Big Data problems into small sub-problems and process them in parallel. MapReduce, as its name hints, comprises two major functions: Map and Reduce. First, data is divided into small chunks and distributed over a network of nodes. Then, the Map function, which performs operations like filtering or sorting, is applied simultaneously to each chunk of data generating intermediate results. After that, those intermediate results are aggregated through the Reduce function in order to compute the final result. Figure 8 illustrates an example of the application of MapReduce in order to calculate the number of students enrolled in a given academic program by year. This model schedules computation resources close to data location, which avoids the communication overhead of data transmission. It is simple and widely applied in bioinformatics, web mining and machine learning. Also related to Hadoop's environment, Pig² and Hive³ are two frameworks used to express tasks for Big Data sets analysis in MapReduce programs. Pig is suitable for data flow tasks and can produce sequences of MapReduce programs, whereas Hive is more suitable for data summarization, queries and analysis. Both of them use their own SQL-like languages, Pig Latin and Hive QL, respectively. These languages use both CRUD and ETL operations.

Streaming processing is a paradigm where data is continuously arriving in a stream, at real-time, and is analyzed as soon as possible in order to derive approximate results. It relies in the assumption that the potential value of data depends on its freshness. Due to its volume, only a portion of the stream is stored in memory. Streaming processing paradigm is used in online applications that need real-time precision (e.g. dashboards of production lines in a factory, calculation of costs depending on usage and available resources). It is supported by Data Stream Management Systems (DSMS) that allow performing SQL-like queries (e.g. select, join, group, count) within a given window of data. This window establishes the period of time (based on time) or number of events (based on length). Storm and S4 are two examples of such systems.

Figure 8. Example of Map Reduce applied to the Academic Management System.

Figure 8. Example of Map Reduce applied to the Academic Management System.

OLTP, as we have seen before, is mainly used in the traditional RDBMS. However, these systems cannot assure an acceptable performance when the volume of data and requests is huge, like in Facebook or Twitter. Therefore, it was necessary to adopt NoSQL databases that allow achieving very high performances in systems with such large loads. Systems like Cassandra⁴, HBase⁵ or MongoDB⁶ are effective solutions currently used. All of them provide their own query languages with equivalent CRUD operations to the ones provided by SQL. For example, in Cassandra is possible to create Column Families using CQL, in HBase is possible to delete a column using Java, and in MongoDB insert a document into a collection using JavaScript. Below there is a query in JavaScript for a MongoDB database equivalent to the SQL-DML query presented previously.

db.students.find({ country: [“Canada”, “France”, “Germany”, “Italy”, “Japan”, “Russia”, “UK”, “USA”] }, { name: 1, country: 1 }). sort({ country: 1 })

At last, Interactive ad-hoc queries and analysis consists in a paradigm that allows querying different large- scale data sources and query interfaces with a very low latency. This type of systems argue that queries should not need more then few seconds to execute even in a Big Data scale, so that users are able to react to changes if needed. The most popular of these systems is Drill7. Drill works as a query layer that transforms a query written in a human-readable syntax (e.g. SQL) into a logical plan (query written in a platform-independent way). Then, the logical plan is transformed into a physical plan (query written in a platform-specific way) that is executed in the desired data sources (e.g. Cassandra, HBase or MongoDB).