Unit 2: Sources of Data in BI Systems

2a. Identify data sources based on the type of data and how it will be used to support decision-making

  • What are the primary differences between structured and unstructured data?
  • How do BI systems handle structured and unstructured data differently?
  • What are the main challenges associated with storing and evaluating unstructured data? 

Data can take numerous forms and come from a variety of sources, each with its own characteristics and complexities. This diversity in data types is crucial for Business Intelligence (BI) analysts to understand, as they must be adept at identifying and working with different kinds of data. For instance, data can be structured data, such as that found in databases with predefined schemas, or unstructured data, like free-form text, images, and videos. Structured data is typically organized in a way that is easily searchable and analyzable using traditional database tools and methods. In contrast, unstructured data lacks a consistent format, which can pose significant challenges for storage and analysis.

BI systems are designed to handle these types of data differently, reflecting the unique requirements and complexities associated with each. Structured data, due to its organized nature, can be efficiently managed and queried using SQL databases and other conventional data management systems. These systems enable BI analysts to perform complex queries and generate insights with relative ease. Unstructured data, however, often requires more sophisticated techniques for processing and analysis. Tools such as natural language processing (NLP), machine learning, and specialized data mining techniques are frequently employed to extract meaningful information from unstructured sources.

The challenges associated with unstructured data are particularly notable in terms of storage and evaluation. Unlike structured data, which fits neatly into rows and columns, unstructured data can come in varied formats and sizes, making it more cumbersome to store and manage. Additionally, evaluating unstructured data often involves dealing with ambiguities and inconsistencies, requiring advanced algorithms and significant computational resources. BI analysts must therefore be equipped with the appropriate tools and methodologies to effectively handle these complexities, ensuring that both structured and unstructured data can be utilized to derive valuable insights for decision-making.

To review, see:


2b. Evaluate data quality and relevance through the use of the six dimensions model

  • What factors constitute data quality?
  • How can data quality be evaluated?
  • How is the six dimensions model used?

Data is the raw material of any BI system. Thus, you must evaluate all data sources for relevance and do everything possible to ensure that the data is of high quality. Data is obtained from a wide variety of sources and is widely diverse in terms of reliability, accuracy, timeliness, and appropriateness to the application. 

Quantitative data is information that can be tabulated and measured. Data is measured by the numbers and is clearly defined. For example, researchers can calculate the number of specific responses to a multiple-choice or yes/no question. Qualitative data is descriptive in nature and can provide researchers with information about how respondents feel about a particular product or service and what influences their purchase decisions.

Qualitative data measure "types" and may be represented by names, symbols, or number codes. Qualitative data are data about categorical variables (such as what type or name).

Quantitative data are typically measured using values or counts and expressed as numbers. Quantitative data are data about numeric variables (such as how many, how much, or how often).

The Six Dimensions Model of Data Quality and Relevance is a framework used to evaluate data across six critical aspects: accuracy, completeness, consistency, timeliness, relevance, and validity. Accuracy assesses how closely data reflects real-world values; completeness ensures all required data is present; consistency checks for uniformity and lack of contradictions; timeliness evaluates whether data is up-to-date and available when needed; relevance measures the data's pertinence to its intended purpose; and validity confirms that data conforms to defined formats and rules. Together, these dimensions provide a comprehensive approach to ensuring that data is reliable, usable, and effective for decision-making and analysis.

To review, see:


2c. Analyze the effectiveness of data integration strategies and technologies such as real-time processing and the ETL model

  • What is the Extract, Transform, Load (ETL), and why is it relevant to data integration in BI systems?
  • How does data transformation contribute to maintaining consistency and compatibility in data integration for business intelligence?
  • Why is data governance important during the data integration process in business intelligence?

Data integration in business intelligence refers to combining data from different sources or systems within an organization to provide a unified view for analysis and decision-making purposes. This integration typically involves merging data from disparate databases, applications, and platforms into a coherent data repository or warehouse.

Data consolidation combines data from various sources, such as transactional databases, ERP systems, CRM platforms, spreadsheets, and external sources like social media or market research data.

Data transformation involves converting and standardizing data formats, structures, and semantics to ensure consistency and compatibility across different sources. It may also include data cleansing, normalization, and enrichment to improve data quality.

Data synchronization ensures that data across different systems is kept up-to-date and synchronized in real-time or at regular intervals to provide users with timely and accurate information.

Data governance involves implementing policies, processes, and controls throughout the integration process to ensure data quality, security, and compliance with regulatory requirements. 

Data modeling involves designing a logical and efficient data model that represents the integrated data in a way that supports meaningful analysis and reporting. This may involve creating dimensional models for data warehouses or data marts. By combining data from disparate sources and providing a unified view of information, data integration in business intelligence enables organizations to gain valuable insights, make informed decisions, and improve overall operational efficiency and effectiveness.

The extract, transform, and load (ETL) process, where data is taken out of the data warehouse, transformed into any needed format, and loaded into the BI system, is critical to effectively operating the whole BI system. The process should be capable of identifying and correcting data that is not accurate during this process. Doing so will save a lot of work down the road and ensure higher levels of data quality.

To review, see:


2d. Apply big data models and NoSQL sources to BI through the use of integration of data storage and retrieval systems

  • How has the emergence of big data shifted the approach to data integration and analysis in business intelligence compared to previous practices?
  • What are some examples of non-relational data types that are suitable for querying with NoSQL techniques, and how do they differ from structured data?
  • Why might cloud-based solutions be considered advantageous for managing and analyzing large volumes of big data in business intelligence?

Big data is a term that has emerged in the last several years to describe extremely large and complex datasets that exceed the capabilities of traditional data processing tools. The most significant implication of big data for business intelligence is that now we need to think about the data feeding our BI systems as coming from potentially anywhere. In the past, the data we used was mostly generated internally and mostly in a very structured format. Now, data may come from anywhere and be in any format. Specialized tools like Hadoop have been developed to help us extract big data for inclusion in our data warehouse. 

Many relational databases are what we call SQL databases. In other words, we can use Structured Query Language (SQL) to query the database and return results. Because of the many different types of data that we may want to use for business intelligence, NoSQL, which should be read as "Not only SQL", is a set of tools that we can use to query data that is not structured into a formal relational database. The primary example of this type of data consists of social media data, but there are other sources of non-relational data. Identify three types of data that would be suitable for applying NoSQL techniques. Because of the many different types of data you may want to use for business intelligence, NoSQL is a set of tools you can use to query data that is not structured into a formal relational database. The primary example of this type of data consists of social media data, but other sources of non-relational data exist.

As the amount of data required for analysis increases, so-called Big Data, the manager needs to consider what type of system architecture is required. Cloud-based solutions involving computing services and resources that are delivered over the internet can be an attractive option in many situations. Cloud-based solutions offer scalable, flexible, and cost-effective options for handling big data. They provide on-demand storage, computing power, and advanced analytics tools without the need for extensive on-premises hardware. Cloud platforms, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, enable organizations to easily scale resources up or down based on their data processing needs. Additionally, cloud-based solutions often include integrated tools for data integration, machine learning, and real-time analytics, allowing businesses to gain valuable insights from big data quickly and effectively while minimizing infrastructure management overhead.

To review, see:


Unit 2 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • big data
  • cloud-based solutions
  • data consolidation
  • data governance
  • data integration
  • data synchronization
  • data transformation
  • extract, transform, and load (ETL)
  • NoSQL
  • qualitative data
  • quantitative data
  • Six Dimensions Model
  • structured data
  • unstructured data