2. Method

2.1. Planning the SLR Study

The main goals of this phase are the identification of the need for a SLR study and the development of a review protocol.


2.1.1. Identification of need for a SLR study

Following the suggestions provided by some works, we searched for SLRs and similar publications related to Big Data modeling and management, to verify if there was a gap that could be covered with the SLR proposed in this work. We found four works dating 2015, 2016, 2017 and 2018, which are described below.

Ribeiro, Silva and Rodrigues da Silva completed a survey in 2015 focused on data modeling and data analytics. Although not an SLR study, the work describes some concepts that are relevant to the Big Data models. The authors identified the four main data models for Big Data - key-value, document, wide-column and graph - also described in our work. They also presented a brief summary of the abstraction levels, concepts, languages, modeling tools and database tools support. However, their study is not as detailed as ours, nor do they present a bibliometric analysis. For instance, there is a lack of information about the data models used at the conceptual, logical and physical levels, the techniques used for transforming towards the different abstraction levels, the research trends, which data set sources, types and models for Big Data are the most studied and so forth. Furthermore, our SLR is up to date on August 2019. Nevertheless and similarly to us, the study pays special attention to the fact that Big Data modeling and management in databases must be considered for research, documentation and development, as they demonstrate the data modeling necessity as a means to improve the development process in Big Data. However, they do not cover the criteria that we have mentioned before.

Sousa and Val Cura cover the 2012 to 2016 timeframe. They present an SLR study about logical modeling for NoSQL databases. The authors nominate 12 articles and classify them under conceptual, logical and physical levels. They also identify modeling proposals for NoSQL databases, for NoSQL databases' migration and layers' proposals. We do not consider it as a complete work, since in our research we examined 1376 articles about Big Data modeling and management. Furthermore, they conclude that no research about data model conversion from conceptual to logical level existed at this time, even though our findings revealed the existence of several studies related to it.

Davoudian, Chen and Liu present a thorough study of all the concepts and techniques used in NoSQL databases; the data models used in Big Data are described but in our work we also present a deep study on Big Data modeling methods. This is considered as a relevant work but it does not show a bibliometric analysis of all authors, conferences and journals, among other relevant information to know the trends and gaps in this topic of research. Additionally, our work focuses on examining each of the studies conducted in research to give researchers a guide to current approaches and future directions.

Wu, Sark and Zhu identify some NoSQL databases and focuses to compare them according to their data model and the theorem, which indicates that a distributed system can only guarantee two of the following three properties simultaneously: Consistency, Availability and Partition tolerance (CAP). This work also does not consist of an SLR like the one presented in this work.

Furthermore, other recent surveys related to Big Data have been published; for instance, one describing the state-of-the-art about methodologies developed for multimedia Big Data analytics and the challenges, techniques, applications, strategies and future outlook. Another study presents and analyzes in detail the current stage of Big Data environments and platforms and available garbage collection algorithms. These works neither cover the scope of our research questions for Big Data modeling and management, nor achieve the same level of detail and precision.

In the next subsection, we detail the development of our review protocol, asserting our objectives and justification and the research questions.


2.1.2. Development of a Review Protocol

A review protocol is essential in order to mitigate any possible bias from researchers and it must be defined before conducting the SLR. Thus, during this stage, we formed the applied method. First, we proposed specific development goals and the respective justification for our work. Then, we formulated three research questions with the intent of summarizing the existing evidence about Big Data modeling and management. Finally, we elaborated a strategy to conduct this SLR study effectively.

Objectives and Justification

The first objective is to present information about the most relevant research about Big Data modeling and management in a comprehensive bibliometric analysis. This study contains a number of studies from the digital libraries and details the authors, their institutional affiliations, countries and publication details, such as the publication year and their impact factors in the Journal Citation Reports (JCR) and the Scimago Journal Rank (SJR) for journals and in the CORE Ranking for Conferences.
Based on our findings, the second objective was to detect the different approaches for Big Data modeling used in the different studies in order to determine trends and gaps within the three key concepts, source, modeling and database. The SLR study conducted in this research can focus all the research related to Big Data modeling into a single document, to benefit the industry, the academy and the community.

Research Questions

This stage comprises the most important phase of the protocol development. Hence, we took particular care while following Kitchenham's suggestions. Firstly, we identified three actors within the population: (1) researchers, (2) information analysts and (3) software developers who research, document and implement solutions for Big Data modeling and management in databases. Secondly, we considered collecting all the approaches related to data modeling oriented to Big Data. Thirdly, as outcomes, we intend to summarize the findings and determine the trends and gaps in the studied topic. This study raised the following research questions:

Research Question (RQ1): How has the number of published papers about Big Data modeling and management changed over time?

Rationale: Our interest is to consolidate, through a bibliometric analysis, all the research efforts for the topic, providing researchers with the ability to know all the information about the authors and the publication data in a single document. Thus, the reader can know how the studies, in our topic of interest, have grown over time, who were the authors who provided significant contributions towards the subject, which are the most cited studies and which countries are most interested in this research topic, as well as which journals and conferences are involved in this topic and which scientific libraries have the major share of studies about Big Data modeling and management. In addition, we wanted to know whether these researches were mostly funded by industry or the academy.

Research Question (RQ2): Are there any research studies that focus on approaches for semi-structured and unstructured data and what techniques to apply?

Rationale: Our goal is to find out whether the studies are focused on semi-structured and unstructured data, which, according to the data specified in the Big Data Concepts subsection, comprised most of the available data. In addition, we intend to present what models the researchers propose at each modeling abstraction level and to determine three key concepts: source, modeling and database:

  • For source: The dataset sources and data types;
  • For modeling: The data abstraction levels, the data model proposed at conceptual, logical and physical levels, the techniques used for transformations between abstraction levels, the applied modeling language, the modeling methodology and the proposed tools for automatic model transformation;
  • For database: The database type and the evaluation and performance comparison between models.

Research Question (RQ3): What are the trends and gaps in Big Data modeling and management?

Rationale: Based on the data obtained in RQ2, our main interest is to present the solutions proposed by researchers in this topic in a consolidated work. The objective is to allow researchers to focus their efforts on the gaps and solutions that allow for standardization over the currently existent or novel methods.

Strategy

The strategy to conduct an exhaustive compilation of studies on the topic of interest included four actions:
1. Finding primary studies from scientific digital libraries, mainly considering whether: (1) they contain indexed research documents, (2) there is a high frequency of databases update and (3) they publish related research about our topic of interest. The sources listed below comply with the desired requirements, in order to focus our systematic review of relevant research:

  • IEEE Xplore
  • ScienceDirect
  • Scopus
  • Web of Science (WoS)

Moreover, according to a comprehensive study, which evaluated the quality of 28 scientific search systems, Google Scholar is inappropriate as principal search system, while ScienceDirect, Scopus and WoS are suitable to evidence synthesis in an SLR.

2. Applying the inclusion criteria to the primary studies, in order to select those studies related to Big Data modeling and management, we conducted a search of a specific terms-matching process within the articles' titles, abstracts and keywords. Based on our research questions, two major search terms were derived: big data and model. The terms were selected after combining several options, in order to get a significant number of articles and these terms covered the majority of studies that addressed our research subject.
Due to the fact that the selected digital libraries do not share a common search syntax, we enumerated all the search strings applied in each one. The word "model" has been used because some studies use this term when referring to modeling:

  • IEEE Xplore - ((("Document Title":"big data" and "data model") OR "Abstract":"big data" and "data model") OR "Index Terms":"big data" and "data model")
  • ScienceDirect - Title, abstract, keywords: "big data" and "data model"
  • Scopus - TITLE-ABS-KEY ("big data" AND "data model")
  • WoS - TS = ("big data" AND "data model"). TS regards to Topic fields that include titles, abstracts and keywords.

For the inclusion criteria, only studies written in English and published in conferences or journals were considered. Although no date-limiting factor was defined in our search criteria, it was observed that no results prior to 2010 were returned by any selected scientific library. These results match with Figure 1, where a report from Google Trends demonstrates that the term Big Data started to become popular in 2011;
3. Reviewing the studies for a second time through a reading of the papers' content allowed us to discard the ones not relevant to the context of our topic of interest;
4. The snowballing technique was applied to locate additional relevant articles according to existing references from within the already-reviewed studies.