Data Modeling and Data Analytics
Data Analytics
This section presents and discusses the types of
operations that can be performed over the data models described in the
previous section and also establishes comparisons between them. A
complementary discussion is provided in Section 4.
Operational Databases
Systems
using operational databases are designed to handle a high number of
transactions that usually perform changes to the operational data, i.e.
the data an organization needs to assure its everyday normal operation.
These systems are called Online Transaction Processing (OLTP) systems
and they are the reason why RDBMSs are so essential nowadays. RDBMSs
have increasingly been optimized to perform well in OLTP systems, namely
providing reliable and efficient data processing.
The set
of operations supported by RDBMSs is derived from the relational algebra
and calculus underlying the Relational Model. As mentioned
before, SQL is the standard language to perform these operations. SQL
can be divided in two parts involving different types of operations:
Data Definition Language (SQL-DDL) and Data Manipulation Language
(SQL-DML).
SQL-DDL allows performing the creation (CREATE), update (UPDATE) and deletion (DROP) of the various
database objects. First it allows managing schemas, which are named
collections of all the database objects that are related to one another.
Then inside a schema, it is possible to manage tables specifying their
columns and types, primary keys, foreign keys and constraints. It is
also possible to manage views, domains and indexes. An index is a
structure that speeds up the process of accessing to one or more columns
of a given table, possibly improving the performance of queries.
Figure 6. Example of a wide-column store for the Academic Management System.
For example, considering the Academic Management System, a system manager could create a table for storing information of a student by executing the following SQL-DDL command:
CREATE TABLE Student (
Student ID NOT NULL IDENTITY,
Name VARCHAR(255) NOT NULL,
Date of Birth DATE NOT NULL,
ID Card VARCHAR(255) NOT NULL,
Place of Origin VARCHAR(255),
Country VARCHAR(255),
PRIMARY KEY (Student ID))
On the other
hand, SQL-DML is the language that enables to manipulate database
objects and particularly to extract valuable information from the
database. The most commonly used and complex operation is the SELECT
operation, which allows users to query data from the various tables of a
database. It is a powerful operation because it is capable of
performing in a single query the equivalent of the relational algebra's
selection, projection and join operations. The SELECT operation returns
as output a table with the results. With the SELECT operation is
simultaneously possible to: define which tables the user wants to query
(through the FROM clause), which rows satisfy a particular condition
(through the WHERE clause), which columns should appear in the result
(through the SELECT clause), order the result (in ascending or
descending order) by one or more columns (through the ORDER BY clause),
group rows with the same column values (through the GROUP BY clause) and
filter those groups based on some condition (through the HAVING
clause). The SELECT operation also allows using aggregation functions,
which perform arithmetic computation or aggregation of data (e.g.
counting or summing the values of one or more columns).
Many
times there is the need to combine columns of more than one table in the
result. To do that, the user can use the JOIN operation in the query.
This operation performs a subset of the Cartesian product between the
involved tables, i.e. returns the row pairs where the matching columns
in each table have the same value. The most common queries that use
joins involve tables that have one-to-many relationships. If the user
wants to include in the result the rows that did not satisfied the join
condition, then he can use the outer joins operations (left, right and
full outer join). Besides specifying queries, DML allows modifying the
data stored in a database. Namely, it allows adding new rows to a table
(through the INSERT statement), modifying the content of a given table's
rows (through the UPDATE statement) and deleting rows from a table
(through the DELETE statement). SQL-DML also allows combining the
results of two or more queries into a single result table by applying
the Union, Intersect and Except operations, based on the Set Theory .
For example, considering the Academic Management System, a
system manager could get a list of all students who are from G8
countries by entering the following SQL-DML query:
SELECT Name, Country
FROM Student
WHERE Country in (“Canada”, “France”, “Germany”, “Italy”, “Japan”, “Russia”, “UK”, “USA”)
ORDER BY Country
Decision Support Databases
The most common data model used in DW is the OLAP cube, which offers a set of operations to analyze the cube model. Since data is conceptualized as a cube with hierarchical dimensions, its operations have familiar names when manipulating a cube, such as slice, dice, drill and pivot. Figure 7 depicts these operations considering the Student's facts of the AMS case study (see Figure 2).The slice operation begins by selecting one of the dimensions (or faces) of the cube. This dimension is the one we want to consult and it is followed by "slicing" the cube to a specific depth of interest. The slice operation leaves us with a more restricted selection of the cube, namely the dimension we wanted (front face) and the layer of that dimension (the sliced section). In the example of Figure 7 (top-left), the cube was sliced to consider only data of the year 2004.
Dice is the operation that allows restricting the front face of the cube by reducing its size to a smaller targeted domain. This means that the user produces a smaller "front face" than the one he had at the start. Figure 7 (top- right) shows that the set of students has decreased after the dice operation.
Drill is the operation that allows to navigate by specifying different levels of the dimensions, ranging from the most detailed ones (drill down) to the most summarized ones (drill up). Figure 7 (bottom-left) shows the drill down so the user can see the cities from where the students of the country Portugal come from.
The pivot operation allows changing the dimension that is being faced (change the current front face) to one that is adjacent to it by rotating the cube. By doing this, the user obtains another perspective of the data, which requires the queries to have a different structure but can be more beneficial for specific queries. For instance, he can slice and dice the cube away to get the results he needed, but sometimes with a pivot most of those operations can be avoided by perceiving a common structure on future queries and pivoting the cube in the correct fashion. Figure 7 (bottom-right) shows a pivot operation where years are arranged vertically and countries horizontally.
The usual operations issued over the OLAP cube are about just querying historical events stored in it. So, a common dimension is a dimension associated to time. The most popular language for manipulating OLAP cubes is MDX (Multidimensional Expressions), which is a query language for OLAP databases that supports all the operations mentioned above. MDX is exclusively used to analyze and read data since it was not designed with SQL-DML in mind. The star schema and the OLAP cube are designed a priori with a specific purpose in mind and cannot accept queries that differentiate much from the ones they were design to respond too. The benefit in this, is that queries are much simpler and faster, and by using a cube it is even quicker to detect patterns, find trends and navigate around the data while "slicing and dicing" with it.

SELECT
{ [Student].[Name],
[Student].[Gender]} ON COLUMNS
{ [Date].[Academic Year] &[2015] } ON ROWS
FROM [Students Cube]
WHERE ([Academic Program].[Name] &[Computer Science])
Big Data Technologies
Big Data Analytics consists in the process of discovering and extracting potentially useful information hidden in huge amounts of data (e.g. discover unknown patterns and correlations). Big Data Analytics can be separated in the following categories: (1) Batch-oriented processing; (2) Stream processing; (3) OLTP and; (4) Interactive ad-hoc queries and analysis.Batch-oriented processing is a paradigm where a large volume of data is firstly stored and only then analyzed, as opposed to Stream processing. This paradigm is very common to perform large-scale recurring tasks in parallel like parsing, sorting or counting. The most popular batch-oriented processing model is MapReduce, and more specifically its open-source implementation in Hadoop1. MapReduce is based on the divide and conquer (D&C) paradigm to break down complex Big Data problems into small sub-problems and process them in parallel. MapReduce, as its name hints, comprises two major functions: Map and Reduce. First, data is divided into small chunks and distributed over a network of nodes. Then, the Map function, which performs operations like filtering or sorting, is applied simultaneously to each chunk of data generating intermediate results. After that, those intermediate results are aggregated through the Reduce function in order to compute the final result. Figure 8 illustrates an example of the application of MapReduce in order to calculate the number of students enrolled in a given academic program by year. This model schedules computation resources close to data location, which avoids the communication overhead of data transmission. It is simple and widely applied in bioinformatics, web mining and machine learning. Also related to Hadoop's environment, Pig2 and Hive3 are two frameworks used to express tasks for Big Data sets analysis in MapReduce programs. Pig is suitable for data flow tasks and can produce sequences of MapReduce programs, whereas Hive is more suitable for data summarization, queries and analysis. Both of them use their own SQL-like languages, Pig Latin and Hive QL, respectively. These languages use both CRUD and ETL operations.
Streaming processing is a paradigm where data is continuously arriving in a stream, at real-time, and is analyzed as soon as possible in order to derive approximate results. It relies in the assumption that the potential value of data depends on its freshness. Due to its volume, only a portion of the stream is stored in memory. Streaming processing paradigm is used in online applications that need real-time precision (e.g. dashboards of production lines in a factory, calculation of costs depending on usage and available resources). It is supported by Data Stream Management Systems (DSMS) that allow performing SQL-like queries (e.g. select, join, group, count) within a given window of data. This window establishes the period of time (based on time) or number of events (based on length). Storm and S4 are two examples of such systems.
Figure 8. Example of Map Reduce applied to the Academic Management System.

db.students.find({ country: [“Canada”, “France”, “Germany”, “Italy”, “Japan”, “Russia”, “UK”, “USA”] }, { name: 1, country: 1 }). sort({ country: 1 })