Analyzing Big Data can create significant advantages for an organization because it enables the discovery of patterns and correlations in datasets. This paper discusses the state of Big Data management with a particular focus on data modeling and data analytics.
Data Modeling
This section gives an in-depth look of the most
popular data models used to define and support Operational Databases,
Data Warehouses and Big Data technologies.
Table 1. Approaches and perspectives of the survey.
Approaches |
Operational |
Decision Support |
Big Data |
Data Modeling Perspective |
ER and Relational Models |
Star Schema and OLAP Cube Models |
Key-Value, Document, Wide-Column and Graph |
RDBMS |
DW |
Big Data-Based Systems |
|
Data Analytics Perspective |
OLTP |
OLAP |
Multiple Classes (Batch-oriented processing, stream-processing, OLTP and Interactive ad-hoc queries) |
Databases
are widely used either for personal or enterprise use, namely due to
their strong ACID guarantees (atomicity, consistency, isolation and
durability) guarantees and the maturity level of Database Management
Systems (DBMSs) that support them.
The data modeling
process may involve the definition of three data models (or schemas)
defined at different abstraction levels, namely Conceptual, Logical and
Physical data models. Figure 1 shows part of the three data
models for the AMS case study. All these models define three entities
(Person, Student and Professor) and their main relationships (teach and
supervise associations).
Conceptual Data Model. A conceptual data
model is used to define, at a very high and platform-independent level
of abstraction, the entities or concepts, which represent the data of
the problem domain, and their relationships. It leaves further details
about the entities (such as their attributes, types or primary keys) for
the next steps. This model is typically used to explore domain concepts
with the stakeholders and can be omitted or used instead of the logical
data model.
Logical Data Model. A logical data model is a
refinement of the previous conceptual model. It details the domain
entities and their relationships, but standing also at a
platform-independent level. It depicts all the attributes that
characterize each entity (possibly also including its unique identifier,
the primary key) and all the relationships between the entities
(possibly including the keys identifying those relationships, the
foreign keys). Despite being independent of any DBMS, this model can
easily be mapped on to a physical data model thanks to the details it
provides.
Physical Data Model. A physical data model visually
represents the structure of the data as implemented by a given class of
DBMS. Therefore, entities are represented as tables, attributes are
represented as table columns and have a given data type that can vary
according to the chosen DBMS, and the relationships between each table
are identified through foreign keys. Unlike the previous models, this
model tends to be platform-specific, because it reflects the database
schema and, consequently, some platform-specific aspects (e.g.
database-specific data types or query language extensions).
Summarizing,
the complexity and detail increase from a conceptual to a physical data
model. First, it is important to perceive at a higher level of
abstraction, the data entities and their relationships using a
Conceptual Data Model. Then, the focus is on detailing those entities
without worrying about implementation details using a Logical Data
Model. Finally, a Physical Data Model allows to represent how data is
supported by a given DBMS.
Operational Databases
Databases
had a great boost with the popularity of the Relational Model
proposed by E. F. Codd in 1970. The Relational Model overcame the
problems of predecessors data models (namely the Hierarchical Model and
the Navigational Model). The Relational Model caused the emergence
of Relational Database Management Systems (RDBMSs), which are the most
used and popular DBMSs, as well as the definition of the Structured
Query Language (SQL) as the standard language for defining and
manipulating data in RDBMSs. RDBMSs are widely used for maintaining data
of daily operations. Considering the data modeling of operational
databases there are two main models: the Relational and the
Entity-Relationship (ER) models.
Relational Model. The Relational
Model is based on the mathematical concept of relation. A relation is
defined as a set (in mathematics terminology) and is represented as a
table, which is a matrix of columns and rows, holding information about
the domain entities and the relationships among them. Each column of the
table corresponds to an entity attribute and specifies the attribute's
name and its type (known as domain). Each row of the
table (known as tuple) corresponds to a single element of the
represented domain entity. In the Relational Model each row is unique
and therefore a table has an attribute or set of attributes known as
primary key, used to univocally identify those rows. Tables are related
with each other by sharing one or more common attributes. These
attributes correspond to a primary key in the referenced (parent) table
and are known as foreign keys in the referencing (child) table. In
one-to-many relationships, the referenced table corresponds to the
entity of the "one" side of the relationship and the referencing table
corresponds to the entity of the "many" side. In many- to-many
relationships, it is used an additional association table that
associates the entities involved through their respective primary keys.
The Relational Model also features the concept of View, which is like a
table whose rows are not explicitly stored in the database, but are
computed as needed from a view definition. Instead, a view is defined as
a query on one or more base tables or other views.
Figure 1. Example of three data models (at different abstraction levels) for the Academic Management System.
Physical Data Model
Entity-Relationship
(ER) Model. The Entity Relationship (ER) Model, proposed by Chen
in 1976, appeared as an alternative to the Relational Model in order to
provide more expressiveness and semantics into the database design from
the user's point of view. The ER model is a semantic data model, i.e.
aims to represent the meaning of the data involved on some specific
domain. This model was originally defined by three main concepts:
entities, relationships and attributes. An entity corresponds to an
object in the real world that is distinguishable from all other objects
and is characterized by a set of attributes. Each attribute has a range
of possible values, known as its domain, and each entity has its own
value for each attribute. Similarly to the Relational Model, the set of
attributes that identify an entity is known as its primary key.
Entities
can be though as nouns and correspond to the tables of the Relational
Model. In turn, a relationship is an association established among two
or more entities. A relationship can be thought as a verb and includes
the roles of each participating entities with multiplicity constraints,
and their cardinality. For instance, a relationship can be of one-to-one
(1:1), one-to-many (1:M) or many-to-many (M:N). In an ER diagram,
entities are usually represented as rectangles, attributes as circles
connected to entities or relationships through a line, and relationships
as diamonds connected to the intervening entities through a line.
The
Enhanced ER Model provided additional concepts to represent more
complex requirements, such as generalization, specialization,
aggregation and composition. Other popular variants of ER diagram
notations are Crow's foot, Bachman, Barker's, IDEF1X and UML Profile for
Data Modeling.
Decision Support Databases
The
evolution of relational databases to decision support databases,
hereinafter indistinctly referred as "Data Warehouses" (DWs), occurred
with the need of storing operational but also historical data, and the
need of analyzing that data in complex dashboards and reports. Even
though a DW seems to be a relational database, it is different in the
sense that DWs are more suitable for supporting query and analysis
operations (fast reads) instead of transaction processing (fast reads
and writes) operations. DWs contain historical data that come from
transactional data, but they also might include other data sources.
DWs are mainly used for OLAP (online analytical processing) operations.
OLAP is the approach to provide report data from DW through
multidimensional queries and it is required to create a
multi-dimensional database.
Usually, DWs include a
framework that allows extracting data from multiple data sources and
transform it before loading to the repository, which is known as ETL
(Extract Transform Load) framework.
Data modeling in DW
consists in defining fact tables with several dimension tables,
suggesting star or snowflake schema data models. A star schema has
a central fact table linked with dimension tables. Usually, a fact
table has a large number of attributes (in many cases in a denormalized
way), with many foreign keys that are the primary keys to the dimension
tables. The dimension tables represent characteristics that describe the
fact table. When star schemas become too complex to be queried
efficiently they are transformed into multidimensional arrays of data
called OLAP cubes (for more information on how this transformation is
performed the reader can consult the following references).
A
star schema is transformed to a cube by putting the fact table on the
front face that we are facing and the dimensions on the other faces of
the cube. For this reason, cubes can be equivalent to star schemas
in content, but they are accessed with more platform-specific languages
than SQL that have more analytic capabilities (e.g. MDX or XMLA). A
cube with three dimensions is conceptually easier to visualize and
understand, but the OLAP cube model supports more than three dimensions,
and is called a hypercube.
Figure 2 shows two examples of star
schemas regarding the case study AMS. The star schema on the left
represents the data model for the Student's fact, while the data model
on the right represents the Professor's fact. Both of them have a
central fact table that contains specific attributes of the entity in
analysis and also foreign keys to the dimension tables. For example, a
Student has a place of origin (DIM_PLACEOFORIGIN) that is described by a
city and associated to a country (DIM_COUNTRY) that has a name and an
ISO code. On the other hand, Figure 3 shows a cube model with three
dimensions for the Student. These dimensions are repre- sented by sides
of the cube (Student, Country and Date). This cube is useful to execute
queries such as: the students by country enrolled for the first time in a
given year.
A challenge that DWs face is the growth of data,
since it affects the number of dimensions and levels in either the star
schema or the cube hierarchies. The increasing number of dimensions over
time makes the management of such systems often impracticable; this
problem becomes even more serious when dealing with Big Data scenarios,
where data is continuously being generated.
Figure 2. Example of two star schema models for the Academic Management System.

Figure 3. Example of a cube model for the Academic Management System.

Big Data Technologies
The
volume of data has been exponentially increasing over the last years,
namely due to the simultaneous growth of the number of sources (e.g.
users, systems or sensors) that are continuously producing data. These
data sources produce huge amounts of data with variable representations
that make their management by the traditional RDBMSs and DWs often
impracticable. Therefore, there is a need to devise new data models and
technologies that can handle such Big Data.
NoSQL (Not Only SQL)
is one of the most popular approaches to deal with this problem. It
consists in a group of non-relational DBMSs that consequently do not
represent databases using tables and usually do not use SQL for data
manipulation. NoSQL systems allow managing and storing large-scale
denormalized datasets, and are designed to scale horizontally. They
achieve that by compromising consistency in favor of availability and
partition-tolerance, according to Brewer's CAP theorem. Therefore,
NoSQL systems are "eventually consistent", i.e. assume that writes on
the data are eventually propagated over time, but there are limited
guarantees that different users will read the same value at the same
time. NoSQL provides BASE guarantees (Basically Available, Soft state
and Eventually consistent) instead of the traditional ACID guarantees,
in order to greatly improve performance and scalability.
NoSQL
databases can be classified in four categories: Key-value stores,
(2) Document-oriented databases, (3) Wide-column stores, and (4) Graph
databases.
Key-value Stores. A Key-Value store represents data as
a collection (known as dictionary or map) of key- value pairs. Every
key consists in a unique alphanumeric identifier that works like an
index, which is used to access a corresponding value. Values can be
simple text strings or more complex structures like arrays. The
Key-value model can be extended to an ordered model whose keys are
stored in lexicographical order. The fact of being a simple data model
makes Key-value stores ideally suited to retrieve information in a very
fast, available and scalable way. For instance, Amazon makes extensive
use of a Key-value store system, named Dynamo, to manage the products in
its shopping cart. Amazon's Dynamo and Voldemort, which is used
by Linkedin, are two examples of systems that apply this data model with
success. An example of a key-value store for both students and
professors of the Academic Managements System is shown in Figure 4.
Document-oriented
Databases. Document-oriented databases (or document stores) were
originally created to store traditional documents, like a notepad text
file or Microsoft Word document. However, their concept of document goes
beyond that, and a document can be any kind of domain object.
Documents contain encoded data in a standard format like XML, YAML, JSON
or BSON (Binary JSON) and are univocally identified in the database by a
unique key. Documents contain semi-structured data represented as
name-value pairs, which can vary according to the row and can nest other
documents. Unlike key-value stores, these systems support secondary
indexes and allow fully searching either by keys or values. Document
databases are well suited for storing and managing huge collections of
textual documents (e.g. text files or email messages), as well as
semi-struc- tured or denormalized data that would require an extensive
use of "nulls" in an RDBMS. MongoDB and CouchDB are two of the
most popular Document-oriented database systems. Figure 5 illustrates
two collections of documents for both students and professors of the
Academic Management System.
Figure 4. Example of a key-value store for the Academic Management System.

Figure 5. Example of a documents-oriented database for the Academic Management System.
Wide-column
Stores. Wide-column stores (also known as column-family stores,
extensible record stores or column-oriented databases) represent and
manage data as sections of columns rather than rows (like in RDBMS).
Each section is composed of key-value pairs, where the keys are rows and
the values are sets of columns, known as column families. Each row is
identified by a primary key and can have column families different of
the other rows. Each column family also acts as a primary key of the set
of columns it contains. In turn each column of column family consists
in a name-value pair. Column families can even be grouped in super
column families. This data model was highly inspired by Google's
BigTable. Wide-column stores are suited for scenarios like: (1)
Distributed data storage; (2) Large-scale and batch-oriented data
processing, using the famous MapReduce method for tasks like sorting,
parsing, querying or conversion and; (3) Exploratory and predictive
analytics. Cassandra and Hadoop HBase are two popular frameworks of such
data management systems. Figure 6 depicts an example of a
wide-column store for the entity "person" of the Academic Managements
System.
Graph Databases. Graph databases represent data as a
network of nodes (representing the domain entities) that are connected
by edges (representing the relationships among them) and are
characterized by properties expressed as key-value pairs. Graph
databases are quite useful when the focus is on exploring the
relationships between data, such as traversing social networks,
detecting patterns or infer recommendations. Due to their visual
representation, they are more user-friendly than the aforementioned
types of NoSQL databases. Neo4j and Allegro Graph are two examples of
such systems.