Businesses and institutions must collect and store temporal data for accountability and traceability. This paper highlights an approach to dealing with transaction lineage that considers how data can be stored based on timestamp granularities and methods for refreshing data warehouses with time-varying data via batch cycles. Identify three ways transaction lineage can be used and how this is relevant to temporal data. What industries do you think transaction lineage will always be relevant in? How?
Literature Review
The data warehouse refreshes have been a research topic for more than a decade. The research is mostly related to storing and maintaining the current state of data. Current state of data fails to provide data lineage information. Discarding updates between two refresh points of time with periodic complete reloads leads to a loss of transaction lineage. Most previous work on data warehousing focused on design issues, data maintenance strategies in connection with relational view materialization and implementation aspects. There has been little research work done to date on capturing transaction lineage and the temporal view maintenance problem and most of the previous research ignores the temporal aspects of data warehousing. There is a consensus that Information Systems research must respond to theoretical contributions and make attempt to solving the current and anticipated problems of practitioners. There is a need for coming up with mechanisms to store transaction lineage in conventional databases. Current commercial database systems provide little built-in capabilities to capture transaction lineage or to support query language for temporal data management. As of today, a few companies started providing time-referenced data storing functionality and SQL facilities in their DBMS system. In data warehouses, data comes from many sources and data warehouse refreshes happen several times a day. Data warehouse is a shared environment and the data in it is typically used by so many applications. These applications may need a different time-slice of data. The data warehouses must cope with the temporal granularities of data.
Temporal data warehouses raise many issues including consistent aggregation in presence of time-varying data, temporal queries of multidimensional data, storage method, and temporal view materialization. The temporal aggregation problem was studied in to address the challenges of temporal data. Much research is now being done to improve the efficiency of range aggregate queries in a temporal data warehouse. Kaufmann presents a native support of temporal features for in-memory analytics databases. Stonebraker suggests new SQL for business intelligence (BI) queries as they are so resource-heavy that they get in the way of timely responses to transactions. The new SQL for temporal data with timestamp-based versioning is also very much needed.
Wang et al. study the "problem of how to maintain temporal consistency of real-time data in distributed real-time systems.c Malinowski and Zimanyi provide a conceptual model for temporal data warehouses that "support for levels, attributes, hierarchies, and measures". Chau and Chittayasothorn proposed a temporal object relational SQL language with attribute time-stamping – a superset of OR SQL language. Viqueira and Lorentzos propose an SQL extension for the management of spatio-temporal data. Mkaouar et al. study how to simplify querying and manipulating temporal facts in SQL by integrating time in a native manner. Li et al., Kvet and Matiasko, and Jestes et al. provide insights in ranking large temporal data. Gupta et al. provide an overview of outlier detections for various forms of temporal data.
In this article, we focus on an innovative approach for dealing with transaction lineage and storing them with time-stamp granularities. We present methodologies for refreshing data warehouses with time-varying data via batch cycles. This is suitable for large data warehouses with hundreds of subject areas and thousands of tables where refreshes occur in a span of one to four-hour windows. We propose the use of conventional extract-transform-load (ETL) tools to extract data from source systems and load the staging subject areas in the data warehouse without performing any kind of transformation tasks. As soon as staging tables are refreshed, the data warehouse software performs transformations to insert new rows in the actual data warehouse (analytical subject areas) tables and also update the tables by applying row expired timestamps to the preexisting rows that correspond to the newly arrived rows. ETL represents the most important stage of the (temporal) data warehouse design as 70% of the risk and effort attributed to this stage. We also examine the possibility of using metadata tables to recompile views based on subject area refresh timestamps. We show that there are opportunities to use different performance improvement features, such as indexing conventional commercial databases to load and query temporal data in the commercial non-temporal databases as these features are very important to handle large volume of transaction lineage data.