Data Modeling

This section gives an in-depth look of the most popular data models used to define and support Operational Databases, Data Warehouses and Big Data technologies.

Table 1. Approaches and perspectives of the survey.

Approaches

Operational

Decision Support

Big Data

Data Modeling Perspective

ER and Relational Models

Star Schema and OLAP Cube Models

Key-Value, Document, Wide-Column and Graph

RDBMS

DW

Big Data-Based Systems

Data Analytics Perspective

OLTP

OLAP

Multiple Classes

(Batch-oriented processing, stream-processing, OLTP and Interactive ad-hoc queries)


Databases are widely used either for personal or enterprise use, namely due to their strong ACID guarantees (atomicity, consistency, isolation and durability) guarantees and the maturity level of Database Management Systems (DBMSs) that support them.

The data modeling process may involve the definition of three data models (or schemas) defined at different abstraction levels, namely Conceptual, Logical and Physical data models. Figure 1 shows part of the three data models for the AMS case study. All these models define three entities (Person, Student and Professor) and their main relationships (teach and supervise associations).

Conceptual Data Model. A conceptual data model is used to define, at a very high and platform-independent level of abstraction, the entities or concepts, which represent the data of the problem domain, and their relationships. It leaves further details about the entities (such as their attributes, types or primary keys) for the next steps. This model is typically used to explore domain concepts with the stakeholders and can be omitted or used instead of the logical data model.

Logical Data Model. A logical data model is a refinement of the previous conceptual model. It details the domain entities and their relationships, but standing also at a platform-independent level. It depicts all the attributes that characterize each entity (possibly also including its unique identifier, the primary key) and all the relationships between the entities (possibly including the keys identifying those relationships, the foreign keys). Despite being independent of any DBMS, this model can easily be mapped on to a physical data model thanks to the details it provides.

Physical Data Model. A physical data model visually represents the structure of the data as implemented by a given class of DBMS. Therefore, entities are represented as tables, attributes are represented as table columns and have a given data type that can vary according to the chosen DBMS, and the relationships between each table are identified through foreign keys. Unlike the previous models, this model tends to be platform-specific, because it reflects the database schema and, consequently, some platform-specific aspects (e.g. database-specific data types or query language extensions).

Summarizing, the complexity and detail increase from a conceptual to a physical data model. First, it is important to perceive at a higher level of abstraction, the data entities and their relationships using a Conceptual Data Model. Then, the focus is on detailing those entities without worrying about implementation details using a Logical Data Model. Finally, a Physical Data Model allows to represent how data is supported by a given DBMS.


Operational Databases

Databases had a great boost with the popularity of the Relational Model proposed by E. F. Codd in 1970. The Relational Model overcame the problems of predecessors data models (namely the Hierarchical Model and the Navigational Model). The Relational Model caused the emergence of Relational Database Management Systems (RDBMSs), which are the most used and popular DBMSs, as well as the definition of the Structured Query Language (SQL) as the standard language for defining and manipulating data in RDBMSs. RDBMSs are widely used for maintaining data of daily operations. Considering the data modeling of operational databases there are two main models: the Relational and the Entity-Relationship (ER) models.

Relational Model. The Relational Model is based on the mathematical concept of relation. A relation is defined as a set (in mathematics terminology) and is represented as a table, which is a matrix of columns and rows, holding information about the domain entities and the relationships among them. Each column of the table corresponds to an entity attribute and specifies the attribute's name and its type (known as domain). Each row of the table (known as tuple) corresponds to a single element of the represented domain entity. In the Relational Model each row is unique and therefore a table has an attribute or set of attributes known as primary key, used to univocally identify those rows. Tables are related with each other by sharing one or more common attributes. These attributes correspond to a primary key in the referenced (parent) table and are known as foreign keys in the referencing (child) table. In one-to-many relationships, the referenced table corresponds to the entity of the "one" side of the relationship and the referencing table corresponds to the entity of the "many" side. In many- to-many relationships, it is used an additional association table that associates the entities involved through their respective primary keys. The Relational Model also features the concept of View, which is like a table whose rows are not explicitly stored in the database, but are computed as needed from a view definition. Instead, a view is defined as a query on one or more base tables or other views.

Figure 1. Example of three data models (at different abstraction levels) for the Academic Management System.

Physical Data Model

Physical Data Model

Physical Data Model

Entity-Relationship (ER) Model. The Entity Relationship (ER) Model, proposed by Chen in 1976, appeared as an alternative to the Relational Model in order to provide more expressiveness and semantics into the database design from the user's point of view. The ER model is a semantic data model, i.e. aims to represent the meaning of the data involved on some specific domain. This model was originally defined by three main concepts: entities, relationships and attributes. An entity corresponds to an object in the real world that is distinguishable from all other objects and is characterized by a set of attributes. Each attribute has a range of possible values, known as its domain, and each entity has its own value for each attribute. Similarly to the Relational Model, the set of attributes that identify an entity is known as its primary key.

Entities can be though as nouns and correspond to the tables of the Relational Model. In turn, a relationship is an association established among two or more entities. A relationship can be thought as a verb and includes the roles of each participating entities with multiplicity constraints, and their cardinality. For instance, a relationship can be of one-to-one (1:1), one-to-many (1:M) or many-to-many (M:N). In an ER diagram, entities are usually represented as rectangles, attributes as circles connected to entities or relationships through a line, and relationships as diamonds connected to the intervening entities through a line.

The Enhanced ER Model provided additional concepts to represent more complex requirements, such as generalization, specialization, aggregation and composition. Other popular variants of ER diagram notations are Crow's foot, Bachman, Barker's, IDEF1X and UML Profile for Data Modeling.


Decision Support Databases

The evolution of relational databases to decision support databases, hereinafter indistinctly referred as "Data Warehouses" (DWs), occurred with the need of storing operational but also historical data, and the need of analyzing that data in complex dashboards and reports. Even though a DW seems to be a relational database, it is different in the sense that DWs are more suitable for supporting query and analysis operations (fast reads) instead of transaction processing (fast reads and writes) operations. DWs contain historical data that come from transactional data, but they also might include other data sources. DWs are mainly used for OLAP (online analytical processing) operations. OLAP is the approach to provide report data from DW through multidimensional queries and it is required to create a multi-dimensional database.

Usually, DWs include a framework that allows extracting data from multiple data sources and transform it before loading to the repository, which is known as ETL (Extract Transform Load) framework.

Data modeling in DW consists in defining fact tables with several dimension tables, suggesting star or snowflake schema data models. A star schema has a central fact table linked with dimension tables. Usually, a fact table has a large number of attributes (in many cases in a denormalized way), with many foreign keys that are the primary keys to the dimension tables. The dimension tables represent characteristics that describe the fact table. When star schemas become too complex to be queried efficiently they are transformed into multidimensional arrays of data called OLAP cubes (for more information on how this transformation is performed the reader can consult the following references).

A star schema is transformed to a cube by putting the fact table on the front face that we are facing and the dimensions on the other faces of the cube. For this reason, cubes can be equivalent to star schemas in content, but they are accessed with more platform-specific languages than SQL that have more analytic capabilities (e.g. MDX or XMLA). A cube with three dimensions is conceptually easier to visualize and understand, but the OLAP cube model supports more than three dimensions, and is called a hypercube.

Figure 2 shows two examples of star schemas regarding the case study AMS. The star schema on the left represents the data model for the Student's fact, while the data model on the right represents the Professor's fact. Both of them have a central fact table that contains specific attributes of the entity in analysis and also foreign keys to the dimension tables. For example, a Student has a place of origin (DIM_PLACEOFORIGIN) that is described by a city and associated to a country (DIM_COUNTRY) that has a name and an ISO code. On the other hand, Figure 3 shows a cube model with three dimensions for the Student. These dimensions are repre- sented by sides of the cube (Student, Country and Date). This cube is useful to execute queries such as: the students by country enrolled for the first time in a given year.

A challenge that DWs face is the growth of data, since it affects the number of dimensions and levels in either the star schema or the cube hierarchies. The increasing number of dimensions over time makes the management of such systems often impracticable; this problem becomes even more serious when dealing with Big Data scenarios, where data is continuously being generated.

Figure 2. Example of two star schema models for the Academic Management System.

Figure 2. Example of two star schema models for the Academic Management System.


Figure 3. Example of a cube model for the Academic Management System.

Figure 3. Example of a cube model for the Academic Management System.


Big Data Technologies

The volume of data has been exponentially increasing over the last years, namely due to the simultaneous growth of the number of sources (e.g. users, systems or sensors) that are continuously producing data. These data sources produce huge amounts of data with variable representations that make their management by the traditional RDBMSs and DWs often impracticable. Therefore, there is a need to devise new data models and technologies that can handle such Big Data.

NoSQL (Not Only SQL) is one of the most popular approaches to deal with this problem. It consists in a group of non-relational DBMSs that consequently do not represent databases using tables and usually do not use SQL for data manipulation. NoSQL systems allow managing and storing large-scale denormalized datasets, and are designed to scale horizontally. They achieve that by compromising consistency in favor of availability and partition-tolerance, according to Brewer's CAP theorem. Therefore, NoSQL systems are "eventually consistent", i.e. assume that writes on the data are eventually propagated over time, but there are limited guarantees that different users will read the same value at the same time. NoSQL provides BASE guarantees (Basically Available, Soft state and Eventually consistent) instead of the traditional ACID guarantees, in order to greatly improve performance and scalability.

NoSQL databases can be classified in four categories: Key-value stores, (2) Document-oriented databases, (3) Wide-column stores, and (4) Graph databases.

Key-value Stores. A Key-Value store represents data as a collection (known as dictionary or map) of key- value pairs. Every key consists in a unique alphanumeric identifier that works like an index, which is used to access a corresponding value. Values can be simple text strings or more complex structures like arrays. The Key-value model can be extended to an ordered model whose keys are stored in lexicographical order. The fact of being a simple data model makes Key-value stores ideally suited to retrieve information in a very fast, available and scalable way. For instance, Amazon makes extensive use of a Key-value store system, named Dynamo, to manage the products in its shopping cart. Amazon's Dynamo and Voldemort, which is used by Linkedin, are two examples of systems that apply this data model with success. An example of a key-value store for both students and professors of the Academic Managements System is shown in Figure 4.

Document-oriented Databases. Document-oriented databases (or document stores) were originally created to store traditional documents, like a notepad text file or Microsoft Word document. However, their concept of document goes beyond that, and a document can be any kind of domain object. Documents contain encoded data in a standard format like XML, YAML, JSON or BSON (Binary JSON) and are univocally identified in the database by a unique key. Documents contain semi-structured data represented as name-value pairs, which can vary according to the row and can nest other documents. Unlike key-value stores, these systems support secondary indexes and allow fully searching either by keys or values. Document databases are well suited for storing and managing huge collections of textual documents (e.g. text files or email messages), as well as semi-struc- tured or denormalized data that would require an extensive use of "nulls" in an RDBMS. MongoDB and CouchDB are two of the most popular Document-oriented database systems. Figure 5 illustrates two collections of documents for both students and professors of the Academic Management System.

Figure 4. Example of a key-value store for the Academic Management System.

Figure 4. Example of a key-value store for the Academic Management System.


Figure 5. Example of a documents-oriented database for the Academic Management System.

Figure 5. Example of a documents-oriented database for the Academic Management System.

Wide-column Stores. Wide-column stores (also known as column-family stores, extensible record stores or column-oriented databases) represent and manage data as sections of columns rather than rows (like in RDBMS). Each section is composed of key-value pairs, where the keys are rows and the values are sets of columns, known as column families. Each row is identified by a primary key and can have column families different of the other rows. Each column family also acts as a primary key of the set of columns it contains. In turn each column of column family consists in a name-value pair. Column families can even be grouped in super column families. This data model was highly inspired by Google's BigTable. Wide-column stores are suited for scenarios like: (1) Distributed data storage; (2) Large-scale and batch-oriented data processing, using the famous MapReduce method for tasks like sorting, parsing, querying or conversion and; (3) Exploratory and predictive analytics. Cassandra and Hadoop HBase are two popular frameworks of such data management systems. Figure 6 depicts an example of a wide-column store for the entity "person" of the Academic Managements System.

Graph Databases. Graph databases represent data as a network of nodes (representing the domain entities) that are connected by edges (representing the relationships among them) and are characterized by properties expressed as key-value pairs. Graph databases are quite useful when the focus is on exploring the relationships between data, such as traversing social networks, detecting patterns or infer recommendations. Due to their visual representation, they are more user-friendly than the aforementioned types of NoSQL databases. Neo4j and Allegro Graph are two examples of such systems.