Microsoft EDW Architecture, Guidance and Deployment Best Practices
Chapter 2: Data Architecture
By Microsoft Corporation
Contributing writers from Solid Quality Mentors: Larry Barnes, Bemir Mehmedbasic
Technical reviewers from Microsoft: Eric Kraemer, Ross LoForte, Ted Tasker, Benjamin Wright-Jones
Contributing editors from Solid Quality Mentors: Kathy Blomstrom
Applies to: SQL Server 2008 R2
Chapter 2: Data Architecture 1
Chapter Focus 3
Roles and Responsibilities 5
Data Warehouse Maturity Model Overview 8
Data Quality and the Role of Governance and Stewardship 9
Data Warehouse Concepts 10
Data Warehouse Components 11
Business Intelligence = Data Warehousing? 11
Data Warehouse Architectures 12
Data Models: Normalized to Denormalized 14
Database Architecture 18
One Data Area: Full Loads 19
One Data Area: Incremental Loads 20
Adding a Production Data Area 21
The Consumption Data Area 22
The Data in Data Area 23
Exception and Logging Data Areas 24
Operational Data Stores 27
Consumer Interfaces 27
Master Data and Master Data Management 28
What Is Master Data? 30
What Is Master Data Management? 31
Where Do Data Warehousing and MDM Overlap? 32
Platform Architecture 38
Data Warehouse Server 39
Server Virtualization 40
SQL Server Fast Track Data Warehouse 40
Data Warehouse Appliances 40
Which Server Option Should You Choose? 41
Database Architecture 41
Databases, Schemas, and Filegroups 41
Considerations for Physical Database Design 45
Data Modeling 46
Data Modeling Overview 47
Conceptual Model 48
Logical Model 49
Physical Model 51
Data Modeling – Column Types 52
Fact Tables 62
Reference Data 63
Bridge Tables 71
Nulls and Missing Values 71
Referential Integrity 72
Clustered vs. Heap 72
SQL Server Data Type Considerations 74
Very Large Data Sets 76
Conclusion and Resources 76
Data architecture is an umbrella term for the standards, metadata, architectures, and data models used to ensure that an organization’s data warehouse meets the strategic decision-making needs of its business users. At its core, a data warehouse consists of a collection of databases organized into production and consumption areas, as shown in Figure 2-1.
Figure 2-: Data warehouse organized into production and consumption areas
Each database contains tables and supporting objects. These tables are populated with large amounts of data. Data integration processes are responsible for moving and transforming the data as it flows from sources to the consumption area.
This simple concept is complicated by the following factors:
Scope – Multiple, cross-organizational subject areas
Scale – Very large volumes of time-variant data
Quality – Cleansing, integrating, and conforming diverse data from multiple sources
There is a lot of literature available on data warehouse data architecture, but the two most visible data warehouse authors in the industry focus on different aspects of data architecture:
Bill Inmon’s Corporate Information Factory (CIF) focuses on database architecture, a top-down approach.
Ralph Kimball’s Enterprise Data Bus focuses on data modeling, a bottom-up approach.
We’ll look briefly at these different approaches in the next section. However, the objective of this chapter is to distill the available information into a set of concepts and patterns and then present tangible best practices for data warehouses implemented on the Microsoft SQL Server database platform.
The audience for this chapter is members of the data warehouse team responsible for the data architecture, database architecture, data models, and overall quality of the data warehouse.
The data architecture team has responsibilities for oversight and specific deliverables throughout the data warehouse development life cycle, shown in Figure 2-2.
Figure 2-2: Focus of this chapter
This chapter focuses on the deliverables that the data architecture team produces as part of the development phase or in support of development—namely, architectures and data models.
Note that physical best practices and guidance within in this chapter are for the symmetric multi-processing (SMP) versions of SQL Server 2008 R2, i.e. guidance for SQL Server 2008 R2 Parallel Data Warehouse (PDW) is out of scope for the initial release of this chapter and document.
Figure 2-3 shows the data architecture deliverables and the inputs into these deliverables.
Figure 2-3: Data architecture deliverables
Data architecture team deliverables include:
Technical metadata and standards – The team provides oversight and contributes to all database development standards and technical metadata used within the data warehouse.
Data models – The team provides oversight and understanding of all data models within the data warehouse and acts as subject matter expert for data models within data governance and other cross-organizational efforts.
Database architecture – The team has primary responsibility for the data warehouse database architecture.
Platform architecture – The team contributes to the product selection and underlying hardware and software platform that hosts the data warehouse.
One primary driver behind these data architecture deliverables is the maturity level of an organization’s data warehouse. This is briefly covered in the next section.
Roles and Responsibilities
The data architect, data developer, and data steward each play key roles within data architecture and are responsible for working with business analysts and the extended team to translate business requirements into technical requirements. Figure 2-4 shows these roles along with their responsibilities.
Figure 2-4: Roles and responsibilities on the data architecture team
Roles and responsibilities:
The data architect is a member of the data governance team and is responsible for the data warehouse architecture, metadata, overall quality of the data warehouse solution, and in some cases, the initial data model.
The database developer is responsible for the development of data models and other database objects within the data warehouse and contributes to data warehouse metadata.
The data steward, also a member of the data governance team, contributes to business and technical metadata and is responsible for the quality of data within the data warehouse.
This chapter’s primary audience is data architects and database developers. Data stewardship is a key role within a data warehouse project and will be covered when it intersects with the core data architecture team. However, data stewards and their day-to-day activities are not a focus of Chapter 2.
Data architecture for data warehouses should be first driven by business need and should also conform to organizational and technology needs.
As Figure 2-5 illustrates, as enterprise data warehouse systems are being developed, data architecture should be implemented to:
Support the business objectives
Enable information management (data flows, validations, consumption)
Productize data (i.e., turn data into an asset) for competitive advantage
Produce a single version of the truth across time
Achieve high performance
Figure 2-5: Requirements that the data architecture needs to support
Support Business Objectives
Ideally, business objectives are provided to teams developing data architecture. When business objectives are not clearly defined or don’t exist, data architecture team members need to be proactive in acquiring the missing information from business stakeholders and subject matter experts.
Information Management – Data Consumption and Integration
Business objectives are further broken down into data requirements that define the databases and data models used by business consumers. The data architecture team works closely with the data integration team to ensure that data requirements can be successfully populated by data integration processes. These data mappings, business rules, and transformations are often a joint effort between business analysts and data integration developers.
As data warehousing systems and business intelligence (BI) within the organization matures over time, organizations begin to realize the potential for productizing the data—meaning transforming data from a raw asset into a measure of business results that ultimately provides insight into the business.
Single Version of the Truth
There is no alternative to one version of the truth in successful data architecture. The data warehouse team shouldn’t underestimate the difficulty involved in achieving one version of the truth. The team must overcome technical, cultural, political, and technological obstacles to achieve what is probably the most challenging requirement when building a data warehouse.
Achieve High Performance
Achieving high performance for data retrieval and manipulation within the data warehouse is a requirement of data architecture. Later in this chapter, we discuss the different logical and physical data modeling techniques and best practices that directly relate to ensuring the most efficient performance for the very large databases within the data warehouse consumption area.
However, it’s often the performance of the data integration processes that drive the data models and architectures within the production area of very large data warehouses (VLDWs). These dual requirements for high-performance loads and high-performance queries result in different databases and data models for the production and consumption areas.
The data architecture must ensure that data is secured from individuals who do not have authorization to access it. Often, users are allowed to view only a subset of the data, and the data architecture is responsible for ensuring that the underlying constructs are in place to meet this requirement.
Providing high-performance access to one version of the truth that meets business objectives and requirements over time is not a simple task. It requires both a solid initial implementation and the ability to enhance and extend the data warehouse over a period of many years. Figure 2-6 presents some of the challenges for the data architecture team.
Figure 2-6: Challenges for the data warehouse team
These challenges include:
The lack of complete business requirements or conflicting business requirements from business stakeholders
Scope, or the cross-organizational communication toward a common goal and the need for common definitions and models
Understanding source systems, their limitations (including data quality and antiquated systems), and the impact on the downstream data warehouse data stores
Agility, or the ability to meet the needs of the business in a timely manner
Data volumes and the need for high performance for both queries and data loads
In a perfect world, business objectives are clearly defined and well understood by stakeholders and users. In reality, however, business needs are often not clearly defined or are not broken down into a clear set of requirements. Frequently, this is due to the difference between business descriptions of objectives and the technical interpretations of these objectives.
In addition, there will always be organizational change during the lifetime of the data warehouse, which requires that the data architecture be agile and able to respond to changes to the organization and environment.
The requirements and challenges depend upon the scope and scale of the data warehouse, which can often be mapped to its level of maturity. The next section provides a brief overview of a data warehouse maturity model.
Different organizations are at different levels of maturity with respect to their data warehouse initiatives. Several challenges, including scope and data volumes, are a function of a data warehouse’s maturity. Figure 2-7, from a 2004 Information Management article, Gauge Your Data Warehouse Maturity , by Wayne Eckerson, provides an overview of a maturity model for data warehouses.
Figure 2-7: Data warehouse maturity model
The following section contains excerpts from this article.
Data marts are defined as a shared, analytic structure that generally supports a single application area, business process, or department. These "independent" data marts do a great job of supporting local needs; however, their data can't be aggregated to support cross-departmental analysis.
After building their third data mart, most departments recognize the need to standardize definitions, rules, and dimensions to avoid an integration nightmare down the road. Standardizing data marts can be done in a centralized or decentralized fashion… The most common strategy is to create a central data warehouse with logical dependent data marts. This type of data warehouse is commonly referred to as a hub-and-spoke data warehouse.
Although a data warehouse delivers many new benefits, it doesn't solve the problem of analytic silos. Most organizations today have multiple data warehouses acquired through internal development, mergers, or acquisitions. Divisional data warehouses contain overlapping and inconsistent data, creating barriers to the free flow of information within and between business groups and the processes they manage.
In the adult stage, organizations make a firm commitment to achieve a single version of the truth across the organization. Executives view data as a corporate asset that is as valuable as people, equipment, and cash. They anoint one data warehouse as the system of record or build a new enterprise data warehouse (EDW) from scratch. This EDW serves as an integration machine that continuously consolidates all other analytic structures into itself.
In the adult stage, the EDW serves as a strategic enterprise resource for integrating data and supporting mission-critical applications that drive the business. To manage this resource, executives establish a strong stewardship program. Executives assign business people to own critical data elements and appoint committees at all levels to guide the development and expansion of the EDW resource.
In summary, the need for a data warehouse arises from a desire by organizations to provide a single version of the truth. The complexity of this effort is magnified when done at the enterprise level (i.e., for an EDW). As stated above, stewardship programs are central to a successful data warehouse.
The last section in this introduction briefly discusses stewardship, data governance, and how both are essential to delivering a high quality data warehouse.
Data Quality and the Role of Governance and Stewardship
Acquiring and maintaining business trust is a foundational objective of a data warehouse that requires strong communication between the data warehouse team and the business users as well as the ability to provide accessible high quality data. Data governance and data stewardship are ongoing processes in support of maximizing business trust in a data warehouse.
The Data Governance Institute defines data governance as:
A system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.
This definition can be distilled down to the following statement:
Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise.
The data governance conceptual process flows that intersect with data architecture include:
Identifying key data assets within the organization.
Ensuring that data assets have common definitions within the organization. Once common definitions are defined, business consumers can share this data, as opposed to re-creating it within each solution.
Ensuring that quality data is loaded into the data warehouse.
The set of processes to ensure maximum data integrity is called data stewardship. Data stewardship focuses on the management of data assets to improve reusability, accessibility, and quality. The team members implementing and enforcing these objectives are data stewards. These individuals should have a thorough understanding of business processes, data flows, and data sources.
Additionally, data stewards are liaisons between data warehouse architects and developers and the business community. Typically, one data steward is responsible for one subject area in a data warehouse.
Maintaining data warehouse data quality is an ongoing process. Explicitly creating direct ownership and accountability for quality data for data warehouse sources eliminates a “shadow” role that exists in many data warehouses. This shadow role often falls to business analysts and database developers once the business starts questioning results within the data warehouse. Failure to allocate resource to data stewardship activities can result in data warehouse team member burnout and negative attrition.
Data stewards are responsible for establishing and maintaining:
Business naming standards
Base and calculated measures
Data quality analysis
Linkages to and understanding of data sources
Data security specifications
Data retention criteria
This chapter’s focus is less on the process and more on deliverables related to data architecture. The following links provide more information about data governance and data stewardship:
Data Governance Institute
Data Governance & Stewardship Community of Practice
Kimball University: Data Stewardship 101: First Step to Quality and Consistency