This section expands on the data warehouse components and provides an overview of the different data areas within the data warehouse, as shown in Figure 2-15. Note that each data area can include one or more physical databases.
Figure 2-15: Data warehouse data areas
Data warehouse data areas include:
Production databases are where data is cleansed, lineage is introduced, business rules are applied, and versioned data is introduced.
Data in databases contain a mirror copy of a subset of source system data.
The consumption area is accessed by business consumers and can include a data warehouse, multiple data marts, or a combination of the two.
The exception data area is where records that fail data quality checks and business rules are held.
Metadata describes the data itself, both in business terms and technical terms. It includes definitions, rules, and origins of all data in the data warehouse. In addition, process metadata helps define configurations and implement security within the data warehouse.
Logging databases are important for recording the day-to-day activity within the data warehouse and typically include logging activity from data integration processes and, optionally, consumption data area access.
Archived databases hold aged data removed from data areas to improve performance.
These data areas are the focus of the remainder of this section, which is organized by data warehouse maturity level—starting with basic configurations seen in departmental data marts and moving to more advanced configurations seen in EDWs.
One Data Area: Full Loads
The first implementation for many data warehouses and data marts is a full load, shown in Figure 2-16.
Figure 2-16: Full load
This implementation truncates and reloads the data warehouse or data mart directly from one source on a scheduled basis (i.e., daily, weekly, monthly). This is the simplest implementation and requires the least amount of data integration code. The data model is typically a denormalized dimensional model.
However, organizations soon realize that full loads have the following issues:
No history or no point-in-time history. Many OLTP systems keep only current data, which precludes historical reporting. Even if the OLTP systems store history, these historical records are often modified over time. This makes it impossible to capture the state of data at a particular point in time.
Extended processing times. As the data volumes grow, it takes longer and longer to drop indexes, truncate , reload, and reindex all data. This becomes an issue when processing extends into business-usage hours.
Data quality issues. Data stewards have little to no visibility into record change histories, which makes it difficult to track data issues back to the source data and process.
These problems with full loads lead most organizations to an incremental load approach.
One Data Area: Incremental Loads
The first step for incremental loads is to create a data model that supports history. The dimensional data model is extended to support record versions. When implemented within dimensions, this is referred to as a Slowly Changing Dimension II (SCD II).
With incremental loads, organizations can report on history, but this approach has the following issues:
The fact that there is one data area for consumption and production results in the database being offline to business users while data integration processing is active.
Merging multiple business entities into one dimension makes it difficult to enforce a single version of the truth, and tracking down and resolving data quality issues is a time-consuming process.
The AdventureWorksDW2008R2 DimCustomer dimension provides a simple example of how one dimension is populated from many source tables. Figure 2-17 shows the dimensional table along with its sources from the AdventureWorks2008R2 OLTP database. Note that 13 tables were identified as sources for the DimCustomer dimension.
Figure 2-17: DimCustomer dimension and its sources
Note that the AdventureWorks2008R2 OLTP data model presents an overly simplistic source data model. For example, the Person table contains two XML data columns—contact information and demographics—which contain columns used to populate DimCustomer columns, including education, number of cars owned, total children, and number of children at home. This information is typically obtained from a loan application, a credit card application, and/or a survey, which are often separate applications with separate data models.
Here are additional issues encountered when the dimensional model is used to store and track history:
Common entities (e.g., Address, Person) can’t be used to populate other data warehouse tables.
Consolidating customer information from multiple source systems directly into the denormalized data warehouse table would make it more difficult for data stewards to track changes back to the respective source systems they’re responsible for.
Hopefully this simple example helps illustrate the level of complexity that exists within data, and why it’s beneficial to have a separate production data area.
Adding a Production Data Area
Figure 2-18 shows separate data areas for production and consumption. The production area is where information loaded from sources is normalized and where business rules are applied, business keys are identified and enforced, lineage is introduced, and data is prepared for loading into downstream data layers.
Figure 2-18: Separate production data area
Production areas are typically seen in organizations that have moved from data marts to data warehouses within the data warehouse maturity model. At this point, the primary focus of the data warehouse shifts from providing results to consumers to integrating data from multiple sources.
The production area is the data integration working area where a “single version of the truth” is created and serves as the basis for data consistency and quality:
The data model is typically normalized, although less so than in the source systems.
This layer enforces relationships between entities using natural keys.
The production area is referenced by data stewards and analysts when verifying and reconciling outputs with inputs.
Production databases feed all consumption databases.
What About Very Large Tables?
Source system transaction and fact tables are the largest tables in a data warehouse and often have multi-billion record counts. Given these volumes, it’s tempting for the data warehouse team to decide to load this table directly into the consumption area and not keep a master copy of this table in the production area.
Although this strategy reduces the size of the production database, it also means that the production area no longer supports a single version of the truth. Given this issue, it’s generally recommended that the data warehouse team not use this approach.
See Chapter 3 for more information about loading patterns for very large tables using versioned inserts.
The Consumption Data Area
After data is loaded into the production area, the next step is to transform the data into a format more conducive to consumption—such as a dimensional data model. In addition, data will often be summarized and aggregated when there are large volumes of data.
Figure 2-19 shows the options for the consumption area: one data warehouse, multiple data marts, or a combination of both. This decision is based on many factors, including organizational structure, geographic location, security, data volumes, and platform architecture.
For the purpose of this discussion, the difference between a data mart and a data warehouse within the consumption area is scope:
A data mart contains one subject area.
Data warehouses have larger volumes of data than a data mart and contain multiple subject areas.
Note that there are strong views in the industry surrounding data marts and data warehouses. You can read Bill Inmon’s perspective in the article Data Mart Does Not Equal Data Warehouse.
Figure 2-19: Consumption area options
As we noted, consumption area options are as follows:
One data warehouse with multiple subject areas provides the greatest availability to data; users can drill down and drill across subject areas within the data warehouse.
Multiple data marts typically provide better performance because the data volumes are less than those in a data warehouse.
Having both a data warehouse and multiple data marts allows the business consumer to choose between completeness and performance.
Often, the option chosen by the data warehouse team depends on data volumes, as shown in Figure 2-20.
Figure 2-20: Data volume impact on the consumption area
Note that the question of when the consumption area can no longer support one data warehouse depends not only on data volumes but also on the underlying platform architecture.
The Data in Data Area
The next decision is whether to have a Data in area or not. Figure 2-21 shows how the Data in area is populated with source data as the first step in the data integration process. The production area is then loaded from the Data in area.
Figure 2-21: Data in area
The decision about whether to have a Data in data store is based on business rules, available storage, data integration patterns, and whether your system can accommodate increased processing times.
Data in databases are probably not needed if data sources are fully available and there is an archive strategy in place for sources. However, here are some reasons to have a Data in area:
Source data is preserved for auditing and reload.
Preserving flat file sources within a database allows database backup and restore to be used as opposed to a separate file system backup.
The Data in area facilitates very large data warehouse implementations:
Landing source data for very large data warehouses in extract databases lets you manage the amount of data being loaded by segmenting data in batches.
Processing batches can be introduced in parallel fashion, reducing data latency in a data warehouse.
Aggregations and calculations can be applied at the batch level, speeding up loads that require an intermediate area for applying aggregations and other calculations to source databases.
The Data in area supports entities residing in multiple source systems:
Entities residing in multiple source systems typically have different designs. For example, the Products entity within a large financial services conglomerate can exist in many different systems.
Introducing extract databases can help when the complete definition of a dimension is not available until data from all relevant entities is made available or processed.
Such an area helps handle data inconsistency and late-arriving facts and dimensions:
Issues related to data inconsistency and dirty data are preserved in extract layers so that this information can be analyzed and corrected at the source level.
Late-arriving facts and dimensions can be hosted in the extract area for consolidation and loading in downstream data warehouse layers once it’s complete.
Preserving the source data in an extract layer also helps in identifying common patterns in entities, which allows for more efficient design of relevant dimensions as they are shared across multiple data sources.
Chapter 3 covers source system extraction and data Integration patterns in more detail. However, note that some SQL Server technologies can be used to populate the Data in area. These technologies include database mirroring, log shipping, and database replication. And SQL Server 2008 introduced a fourth option: Change Data Capture.
The following links provide details about SQL Server high availability options:
Log Shipping Overview
SQL Server Replication
Basics of Change Data Capture
For more information about whether to use a Data in area, see the “To Stage or not to Stage” section of Chapter 2 in The Data Warehouse ETL Toolkit, by Ralph Kimball and Joe Caserta.
Exception and Logging Data Areas
The exception and logging data areas are both populated by data integration processes, as Figure 2-22 shows.
Figure 2-22: Exception and logging data areas
Data integration processes use business rules, data quality checks, and data lookups to identify and move data records into the exception area. Whether the entire data record is infirmed or just the natural key depends on the implementation. Data stewards use the exception data area to troubleshoot and correct data exceptions.
Data integration processes also populate the logging data area with information used to monitor and track the status of data loads. In addition, data integration processing errors are logged and used by data integration developers to troubleshoot the processes. See Chapter 3 on data integration for more information about these topics.
The data warehouse may also log consumer query activity. This is often useful in determining usage patterns, optimizing queries, and developing user charge-back models.
One way to optimize performance for any database is to reduce the size of the data being queried. Deleting data, however, conflicts with the data warehouse’s objective of providing one version of the truth over time. When data is deleted or purged, it is lost—along with the ability to obtain certain historical perspectives.
The alternative to deleting data is to archive it. Archiving data is the process of moving data from a primary data store to a secondary data store. Figure 2-23 contains an example of an archive data area within the data warehouse.
Figure 2-23: Data warehouse with an archive data area
Note that the secondary data store in the above example is a database, but it could also be a file or set of files. Archiving improves performance while supporting the preservation of data.
Archiving data is typically based on when the data was loaded, and the data archive schedule within a data warehouse is a function of:
The Data in data area will be frequently archived or truncated. Production data may require more frequent archiving for data warehouses with massive amounts of data. And examples of how vertical industry requirements affect consumption data include:
Health care industry: Seven to 10 years of data is required for claims data for companies doing business in the US.
Financial industry: Seven years of data is required for financial audit data for companies doing business in the US.
In general, every very large data warehouse should have an archiving strategy. This strategy also needs to support the repopulating of archived tables should the need arise for this data. In these cases, the destination can be the same or a different table. The architecture team should base its archiving requirements on legal and industry standards as well as user access patterns.
Note that the physical partitioning of tables by date simplifies the data archival processes.
As we mentioned earlier, metadata is data that describes the data itself. It includes definitions, rules, and origins of all data in the data warehouse. Metadata is important because it gives context to the data in the data warehouse and is typically classified into business and technical metadata. Ralph Kimball adds another category: process metadata, which is a variant of technical metadata.
Business metadata – Provides business definitions for database objects, including databases, tables, and columns. It also includes additional information about columns, including but not limited to business rules and field types.
Technical metadata – Documents technical aspects of data; the classic example of technical metadata is a data dictionary. Other examples include source-to-destination mappings used in data integration as well as results from data profiling and lineage.
Process metadata – A type of technical metadata used to configure key components and processes within the data warehouse, including data integration and security.
The following are examples of process metadata.
In data warehouses, there are databases used for storing information on various configurations, including:
Data integration connection properties
Environment variables, XML configuration files
Size of ETL batches
Data warehouse default values
Security-related scenarios are addressed in databases residing in the data warehouse security area.
Access to data in the data warehouse is defined by business rules. There are often requirements to secure data both vertically and horizontally. Client-level access restrictions are a common example of these requirements, where data needs to be secured by specific client IDs. Additionally, there are often provisions for securing data on a row-level basis.
It is important to catalog business and technical metadata information in database structures so that this information is not scattered across various documents, diagrams, and meeting notes. Once introduced into databases, metadata information can be queried and analyzed. Reports can be developed to include answers to commonly asked questions about data transformation rules or data types and column defaults.
Using metadata also provides for more effective integration of data design tools and development tools; an entire software market is dedicated to metadata tools and metadata repositories.
This concludes our overview of data areas. Next, let’s look at brief overviews of operational data stores and data warehouse consumers.
Operational Data Stores
Operational data stores (ODSs) are databases that support operational reporting outside of source systems. An ODS is a key data area within the Corporate Information Factory and is typically categorized into different classes, with the key variable being the delay between the live source data and the ODS data, as follows:
Class I – One to two-second delay; this short delta often requires asynchronous “push” data integration processes instead of the traditional “pull” method.
Class II – Intraday; typically a two- to four-hour delay.
Class III – One day; this can be part of the daily data integration processes.
Class IV – Loaded directly from the data warehouse.
Note that Class I and Class II ODSs require special data integration processes (i.e., processes that run more frequently than the nightly data integration batch schedule).
Also note that an ODS may be used as a Data in area depending upon the implementation.
You can read more about ODSs at the following links:
The Operational Data Store
Corporate Information Factory
Data integrated and stored in data warehouses is consumed by a variety of interfaces. Typical data consumers include external data feeds, queries, reports, OLAP structures, BI semantic layer tools, and suites of interrelated applications and services. In addition, data marts are consumers when the data warehouse architecture is a centralized EDW.
This section covers the typical data consumer interfaces to consider in your data warehouse architecture. Note that there are many products within each of these categories. The products listed below are provided as examples within the Microsoft product suite for each category.
One of the most straightforward methods for consuming data from a data warehouse is using queries. These queries typically scan large numbers of records, are often compiled in stored procedures, and provide for a consistent source of data for analysts and decision makers. Queries are also often combined with other information delivery vehicles, such as reports, and provide for uniform representation of information. Managing changes to queries and managing security are some important aspects of using this delivery method.
SQL Server Reporting Services (SSRS) is one of the most commonly used tools to access data from data warehouses. This tool provides for enterprise-wide access to information in predefined forms as well as for ad hoc access to data.
SSRS, coupled with the powerful report-authoring environment provided by Microsoft Report Builder, frequently provides the primary data consumption methods within an organization.
For more information about SSRS and Report Builder, see the following links:
SQL Server 2008 Reporting Services Web site
SQL Server 2008 R2 Books Online topic SQL Server Reporting Services
TechNet article Getting Started with Report Builder 3.0
OLAP takes data consumption from data warehouses to a higher level. One of the most advanced OLAP technologies available is SSAS, which gives organizations advanced enterprise-wide analytical capabilities, complementing the power of data contained in data warehouses.
How well data warehouse systems are architected directly affects how efficiently you can implement SSAS. In other words, straightforward and effective data warehouse design greatly reduces the complexities around OLAP models.
BI Semantic Layer Tools
SharePoint 2010 Insights is a powerful BI semantic layer tool that enables users to create BI dashboards that include powerful analytic reports, Key Performance Indicators (KPIs), and scorecards. Using PerformancePoint Services (PPS) in SharePoint 2010 provides for even greater insight into how business is performing and the status of key indicators for business processes. Adding PPS to a list of data consumers from a data warehouse enables organizations to make real-time decisions and improve the overall ROI of data warehouse efforts.
For more information about PPS, see the following links:
What's new for PerformancePoint Services (SharePoint Server 2010)
Embedded BI Applications
This class of applications is developed to solve a targeted business problem, such as fraud detection. These applications often use data mining algorithms or provide a more guided user experience than their BI tool equivalents.
Suites of Applications and Services
Enabling ease of access to information contained in data warehouse systems is a main objective of any enterprise BI strategy. In this respect, having information users experience pervasive BI in their organization at a low cost through the Microsoft Office Suite tools they use every day efficiently accomplishes this goal. Applications and services within Office provide for easy and direct access to data in data warehouses, whether users connect directly from Microsoft Excel or use the advanced analytics of Microsoft PowerPivot.
For details about Microsoft Office Suite or PowerPivot, see the following links:
External Data Feeds
External data feeds can include LOB systems, such as customer relationship management (CRM) systems. It is crucial for CRM systems to consume data that is put together from various data sources, so a data warehouse properly architected in this respect represents an ideal source of data for CRM. Data warehouses provide for data feeds that encompass a central, comprehensive, and consistent perspective of customers, making CRM systems more efficient and effective. Additionally, data warehouses represent a consistently cleansed and reliable source of customer data that is accessible enterprise-wide.